Code generation via LLM
From Text Completion to Agentic Software Engineering
This post summarizes the evolution of coding language models — not only how models improved, but how systems around the models now dominate performance.
The key shift over the last three years:
Code generation is no longer a single inference problem.
It is a closed‑loop decision process over execution feedback.
We can understand modern coding LLMs through two layers:
- Model layer — foundation models trained on code
- Agent layer — execution, verification, and repair loops
1. The Model Layer: Foundation Coding LLMs
These models learn syntax, APIs, and patterns from large‑scale repositories.
AlphaCode (DeepMind, 2022)
Introduced the "generate many → filter" paradigm for solving programming problems.
Key idea:
Correctness emerges from search over many candidates.
Instead of predicting one solution, sample thousands and rank by tests.
Impact: established pass@k as the main evaluation for coding models.
Paper: https://arxiv.org/abs/2203.07814
Code Llama (Meta, 2023)
A code‑specialized version of Llama trained on large code corpora.
Capabilities:
- code completion
- infilling
- instruction following
Shifted open models from general text → usable coding assistants.
Paper: https://arxiv.org/abs/2308.12950
DeepSeek‑Coder (2024) / DeepSeek‑Coder‑V2 (2024)
Large‑scale training focused on real repository distribution.
Key improvements:
- long context
- multi‑language support
- stronger reasoning in code tasks
Paper: https://arxiv.org/abs/2401.14196
2. Benchmarks Changed the Goal
HumanEval / MBPP
Short function synthesis tasks.
Good for syntax — weak for real development.
SWE‑bench (2024)
Real GitHub issues + repository context + tests.
Now success requires:
- navigation
- understanding
- editing multiple files
- debugging
Paper: https://arxiv.org/abs/2310.06770
This changed the definition of coding intelligence:
From writing code → fixing software
3. The Agent Layer: Where Performance Actually Comes From
Modern coding systems operate in a loop:
plan → edit → run → observe → repair
Most performance gains now come from improving this loop rather than scaling the base model.
4. Inference‑Time Agentic Improvements (No Training)
AlphaCodium — test‑driven generation
Generate → execute tests → repair → iterate
Turns coding into a verification search problem.
Paper: https://arxiv.org/abs/2401.08500
CodeAct — executable actions
Agents write executable Python blocks as actions instead of natural language plans.
This reduces ambiguity and stabilizes tool usage.
Paper: https://arxiv.org/abs/2402.01030
SWE‑agent — interface matters
Shows agent performance depends heavily on the computer interface:
- file operations
- command execution
- feedback formatting
Paper: https://arxiv.org/abs/2405.15793
Reflexion / Self‑Refine — learning from failures
Store feedback and revise next attempts.
Papers:
ChatDev — multi‑agent development
Simulates a software team (planner, coder, tester).
Paper: https://arxiv.org/abs/2307.07924
5. Training‑Time Agentic Improvement
Instead of only using tests at inference, newer work trains models using execution signals.
RL with verifiable rewards (RLVR)
Run code → reward correctness → optimize policy
Example direction: CodeRL+
Paper: https://arxiv.org/abs/2510.18471
Key idea:
Optimize probability that search finds a correct program
This aligns training with pass@k success.
6. A Unifying Taxonomy
Inference‑time improvement
- sampling + reranking
- execution feedback
- reflection
- tool‑grounded actions
Training‑time improvement
- supervised traces
- preference optimization
- reinforcement learning with tests
7. What Actually Improved Coding LLMs
Not just bigger models — but feedback loops.
| Era | Optimization target |
|---|---|
| Early | single output accuracy |
| Sampling | pass@k |
| Agents | success under iteration |
| RLVR | success under search |
Final Takeaway
Coding LLM progress did not come primarily from scaling parameters.
It came from turning generation into a decision process interacting with execution.
A coding model is no longer a text generator.
It is a policy operating inside a compiler and runtime.
Future improvements will likely focus on richer feedback signals (execution traces, invariants, semantic correctness) rather than more data alone.