LLM

Code generation via LLM

From Text Completion to Agentic Software Engineering

Orlando Ding

14 Feb 2026 • 2 min read

This post summarizes the evolution of coding language models — not only how models improved, but how systems around the models now dominate performance.

The key shift over the last three years:

Code generation is no longer a single inference problem.
It is a closed‑loop decision process over execution feedback.

We can understand modern coding LLMs through two layers:

Model layer — foundation models trained on code
Agent layer — execution, verification, and repair loops

1. The Model Layer: Foundation Coding LLMs

These models learn syntax, APIs, and patterns from large‑scale repositories.

AlphaCode (DeepMind, 2022)

Introduced the "generate many → filter" paradigm for solving programming problems.

Key idea:

Correctness emerges from search over many candidates.

Instead of predicting one solution, sample thousands and rank by tests.

Impact: established pass@k as the main evaluation for coding models.

Paper: https://arxiv.org/abs/2203.07814

Code Llama (Meta, 2023)

A code‑specialized version of Llama trained on large code corpora.

Capabilities:

code completion
infilling
instruction following

Shifted open models from general text → usable coding assistants.

Paper: https://arxiv.org/abs/2308.12950

DeepSeek‑Coder (2024) / DeepSeek‑Coder‑V2 (2024)

Large‑scale training focused on real repository distribution.

Key improvements:

long context
multi‑language support
stronger reasoning in code tasks

Paper: https://arxiv.org/abs/2401.14196

2. Benchmarks Changed the Goal

HumanEval / MBPP

Short function synthesis tasks.
Good for syntax — weak for real development.

SWE‑bench (2024)

Real GitHub issues + repository context + tests.

Now success requires:

navigation
understanding
editing multiple files
debugging

Paper: https://arxiv.org/abs/2310.06770

This changed the definition of coding intelligence:

From writing code → fixing software

3. The Agent Layer: Where Performance Actually Comes From

Modern coding systems operate in a loop:

plan → edit → run → observe → repair

Most performance gains now come from improving this loop rather than scaling the base model.

4. Inference‑Time Agentic Improvements (No Training)

AlphaCodium — test‑driven generation

Generate → execute tests → repair → iterate

Turns coding into a verification search problem.

Paper: https://arxiv.org/abs/2401.08500

CodeAct — executable actions

Agents write executable Python blocks as actions instead of natural language plans.

This reduces ambiguity and stabilizes tool usage.

Paper: https://arxiv.org/abs/2402.01030

SWE‑agent — interface matters

Shows agent performance depends heavily on the computer interface:

file operations
command execution
feedback formatting

Paper: https://arxiv.org/abs/2405.15793

Reflexion / Self‑Refine — learning from failures

Store feedback and revise next attempts.

Papers:

ChatDev — multi‑agent development

Simulates a software team (planner, coder, tester).

Paper: https://arxiv.org/abs/2307.07924

5. Training‑Time Agentic Improvement

Instead of only using tests at inference, newer work trains models using execution signals.

RL with verifiable rewards (RLVR)

Run code → reward correctness → optimize policy

Example direction: CodeRL+

Paper: https://arxiv.org/abs/2510.18471

Key idea:

Optimize probability that search finds a correct program

This aligns training with pass@k success.

6. A Unifying Taxonomy

Inference‑time improvement

sampling + reranking
execution feedback
reflection
tool‑grounded actions

Training‑time improvement

supervised traces
preference optimization
reinforcement learning with tests

7. What Actually Improved Coding LLMs

Not just bigger models — but feedback loops.

Era	Optimization target
Early	single output accuracy
Sampling	pass@k
Agents	success under iteration
RLVR	success under search

Final Takeaway

Coding LLM progress did not come primarily from scaling parameters.

It came from turning generation into a decision process interacting with execution.

A coding model is no longer a text generator.
It is a policy operating inside a compiler and runtime.

Future improvements will likely focus on richer feedback signals (execution traces, invariants, semantic correctness) rather than more data alone.