Code generation via LLM

From Text Completion to Agentic Software Engineering

This post summarizes the evolution of coding language models — not only how models improved, but how systems around the models now dominate performance.

The key shift over the last three years:

Code generation is no longer a single inference problem.
It is a closed‑loop decision process over execution feedback.

We can understand modern coding LLMs through two layers:

  1. Model layer — foundation models trained on code
  2. Agent layer — execution, verification, and repair loops

1. The Model Layer: Foundation Coding LLMs

These models learn syntax, APIs, and patterns from large‑scale repositories.

AlphaCode (DeepMind, 2022)

Introduced the "generate many → filter" paradigm for solving programming problems.

Key idea:

Correctness emerges from search over many candidates.

Instead of predicting one solution, sample thousands and rank by tests.

Impact: established pass@k as the main evaluation for coding models.

Paper: https://arxiv.org/abs/2203.07814


Code Llama (Meta, 2023)

A code‑specialized version of Llama trained on large code corpora.

Capabilities:

  • code completion
  • infilling
  • instruction following

Shifted open models from general text → usable coding assistants.

Paper: https://arxiv.org/abs/2308.12950


DeepSeek‑Coder (2024) / DeepSeek‑Coder‑V2 (2024)

Large‑scale training focused on real repository distribution.

Key improvements:

  • long context
  • multi‑language support
  • stronger reasoning in code tasks

Paper: https://arxiv.org/abs/2401.14196


2. Benchmarks Changed the Goal

HumanEval / MBPP

Short function synthesis tasks.
Good for syntax — weak for real development.

SWE‑bench (2024)

Real GitHub issues + repository context + tests.

Now success requires:

  • navigation
  • understanding
  • editing multiple files
  • debugging

Paper: https://arxiv.org/abs/2310.06770

This changed the definition of coding intelligence:

From writing code → fixing software


3. The Agent Layer: Where Performance Actually Comes From

Modern coding systems operate in a loop:

plan → edit → run → observe → repair

Most performance gains now come from improving this loop rather than scaling the base model.


4. Inference‑Time Agentic Improvements (No Training)

AlphaCodium — test‑driven generation

Generate → execute tests → repair → iterate

Turns coding into a verification search problem.

Paper: https://arxiv.org/abs/2401.08500


CodeAct — executable actions

Agents write executable Python blocks as actions instead of natural language plans.

This reduces ambiguity and stabilizes tool usage.

Paper: https://arxiv.org/abs/2402.01030


SWE‑agent — interface matters

Shows agent performance depends heavily on the computer interface:

  • file operations
  • command execution
  • feedback formatting

Paper: https://arxiv.org/abs/2405.15793


Reflexion / Self‑Refine — learning from failures

Store feedback and revise next attempts.

Papers:


ChatDev — multi‑agent development

Simulates a software team (planner, coder, tester).

Paper: https://arxiv.org/abs/2307.07924


5. Training‑Time Agentic Improvement

Instead of only using tests at inference, newer work trains models using execution signals.

RL with verifiable rewards (RLVR)

Run code → reward correctness → optimize policy

Example direction: CodeRL+

Paper: https://arxiv.org/abs/2510.18471

Key idea:

Optimize probability that search finds a correct program

This aligns training with pass@k success.


6. A Unifying Taxonomy

Inference‑time improvement

  • sampling + reranking
  • execution feedback
  • reflection
  • tool‑grounded actions

Training‑time improvement

  • supervised traces
  • preference optimization
  • reinforcement learning with tests

7. What Actually Improved Coding LLMs

Not just bigger models — but feedback loops.

Era Optimization target
Early single output accuracy
Sampling pass@k
Agents success under iteration
RLVR success under search

Final Takeaway

Coding LLM progress did not come primarily from scaling parameters.

It came from turning generation into a decision process interacting with execution.

A coding model is no longer a text generator.
It is a policy operating inside a compiler and runtime.

Future improvements will likely focus on richer feedback signals (execution traces, invariants, semantic correctness) rather than more data alone.