UNA GENERACIÓN DE CÓDIGO BASADA EN LLM: UNA REVISIÓN SISTEMÁTICA DE TÉCNICAS, MÉTRICAS Y EVALUACIÓN EMPÍRICA

Jorge Bergman Mostajo Pedraza

doi:10.23670/FT.2026.1.31

Vol. 2 No. 1 (2026), Review Articles

Vol. 2 No. 1 (2026)

LLM-Based Code Generation: A Systematic Review of Techniques, Metrics, and Empirical Evaluation

Review Articles

Published 2026-06-12

Jorge Bergman Mostajo⁺⁻

Jorge Bergman Mostajo

https://orcid.org/0009-0008-5068-3096

GENERACIÓN DE CÓDIGO BASADA EN LLM: UNA REVISIÓN SISTEMÁTICA DE TÉCNICAS, MÉTRICAS Y EVALUACIÓN EMPÍRICA

PDF (Spanish)

Keywords

Large Language Models
Code Generation
NL2Code
Software Engineering
.NET
Systematic Literature Review (SLR)
Prompt Engineering
Agentic Systems
Code Evaluation Metrics

How to Cite

LLM-Based Code Generation: A Systematic Review of Techniques, Metrics, and Empirical Evaluation. (2026). Scientific Newsletter Technological Frontiers, 2(1), 12. https://doi.org/10.23670/FT.2026.1.31

Abstract

This systematic literature review (SLR) critically examines the scientific evidence on the use of Large Language Models (LLMs) for code generation in software engineering, with particular attention to their applicability within the .NET ecosystem. The search was conducted across five databases (IEEE Xplore, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv) following the PRISMA protocol, identifying 7,159 initial records. After screening, eligibility assessment, and quality evaluation, 40 primary studies published between 2020 and 2025 were selected.

The results indicate that prompt engineering is the dominant technique (72.5%), while fine-tuning and specialized pretraining act as complementary strategies (40%). An emerging trend toward agentic systems is also identified, where LLMs evolve from standalone code generators into components capable of orchestrating tools and solving repository-level tasks. Regarding evaluation, there is a strong reliance on automated metrics such as pass@k and synthetic benchmarks, particularly HumanEval, which introduces a systematic bias in performance estimation.

The study identifies a structural gap, referred to as the benchmark saturation gap, between performance reported on synthetic benchmarks and real-world effectiveness, as evidenced by significantly lower results on more representative benchmarks such as BigCodeBench and SWE-bench. Additionally, persistent limitations are confirmed, including code hallucinations, security vulnerabilities, and degradation of code quality.

Finally, critical gaps in the literature are identified, including the lack of studies specifically addressing the .NET/C# ecosystem, the scarcity of longitudinal evaluations, and the absence of measurement frameworks in high-maturity contexts. These findings highlight the need to redefine evaluation approaches and adapt development practices to enable the effective and reliable integration of LLMs into real-world software engineering environments.

PDF (Spanish)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

LLM-Based Code Generation: A Systematic Review of Techniques, Metrics, and Empirical Evaluation

Keywords

How to Cite

Download Citation

Abstract

Similar Articles