An Empirical Evaluation of Large Language Models Applying Software Architectural Patterns
Except for coding, Large Language Models (LLMs) are increasingly explored as assistants for software design and architectural tasks. However, it remains unclear to what extent LLMs can reliably apply explicitly requested software architectural patterns when provided with user-defined specific requirements. In this paper, we empirically evaluate the ability of multiple LLMs to instantiate specific architectural styles under controlled conditions. We conduct a series of experiments in which models are prompted with problem descriptions expressed at different levels of structure, ranging from free-structured requirement lists to complete Software Requirements Specification (SRS) documents. The models are instructed, using single-shot prompts, to generate architectures in four representative styles: client–server, 3-tier, Model–View–Controller (MVC), and microservices. The authors assess the generated architectures with respect to structural correctness, requirement coverage, and adherence to the requested architectural pattern. Our results show that while LLMs can correctly apply simpler architectural patterns, performance decreases as architectural complexity and problem size increase. Model size and requirement representation significantly influence pattern adherence, whereas Retrieval Augmented Generation (RAG) exhibits mixed effects depending mainly on the material context and the LLMs capacity. These findings provide insight into the current capabilities and limitations of LLMs in architectural pattern application and inform the design of future AI-assisted architectural tools.