LLM

DSPy Analysis

DSPy by Stanford: Programming (and Compiling) LM Pipelines into Self-Improving Systems

Orlando Ding

14 Feb 2026 • 4 min read

References (core):

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (Khattab et al., 2023/ICLR 2024): https://arxiv.org/abs/2310.03714
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs (Opsahl-Ong et al., 2024/EMNLP 2024): https://arxiv.org/abs/2406.11695
DSPy Docs (Learn + Tutorials): https://dspy.ai/learn/ and https://dspy.ai/tutorials/
DSPy GitHub: https://github.com/stanfordnlp/dspy

Why DSPy exists

The DSPy papers start from a blunt observation: most “LLM apps” are really programs (pipelines of LM calls, retrieval, tools, and control flow), but we still build them by hand-tuning free-form prompt strings. That approach is brittle across model swaps, domain shifts, and even small code changes. DSPy proposes a programming model where you write structured Python modules and then compile them to prompts (and sometimes weights) that maximize an end metric. (arxiv.org)

In short:

Instead of prompt engineering, you write a program and let DSPy optimize how the LM should behave inside it. (arxiv.org)

The DSPy mental model

DSPy separates an LM pipeline into two layers:

Program structure (code): what components exist and how information flows.
Parameters (learned): per-component instructions and demonstrations (and optionally finetuned weights).

The compiler searches over the parameters while keeping the structure fixed, optimizing for a metric you define (accuracy, EM/F1, constraint satisfaction, task success, etc.). (arxiv.org)

A minimal vocabulary

Signature: a typed I/O contract for an LM call (inputs/outputs with names).
Module: a reusable component implementing a step in the pipeline.
Program: a composition of modules (RAG, multi-hop, agents, etc.).
Metric: function to score end behavior.
Optimizer / Teleprompter: procedure that proposes and selects better instructions/demos (and possibly weights).

DSPy docs explicitly present the workflow as three stages: Programming → Evaluation → Optimization. (dspy.ai)

What “compiling” means in DSPy

In the original DSPy paper, “compiling” is the process of producing a new version of your program where each LM call has improved:

instructions (system/prompt guidance)
few-shot demonstrations (examples in-context)
optionally fine-tuned weights (for supported setups)

The compilation procedure is metric-driven: it evaluates candidate parameterizations and chooses the one that maximizes your metric on a dev set. (arxiv.org)

Bootstrapping demonstrations (the classic DSPy trick)

A canonical optimizer family in DSPy is the BootstrapFewShot teleprompter: it builds a prompt’s demos from a mix of labeled training examples and bootstrapped examples produced by a teacher model, filtered by your metric/threshold. (dspy.ai)

This is an important “DSPy move”:

Use the LM itself (possibly as a teacher) to generate candidate traces/demos, then keep only the ones that score well.

The second paper: prompt optimization for multi-stage programs (MIPRO)

The 2024 paper (Optimizing Instructions and Demonstrations…) focuses on a harder setting: multi-stage LM programs, where changing one module’s prompt affects downstream stages, and you typically lack module-level labels and gradients.

They factorize the optimization into per-module instruction + demo optimization and introduce MIPRO, which combines:

Program- and data-aware proposal generation for instructions (so candidates aren’t random strings).
Stochastic mini-batch evaluation to make searching feasible.
Meta-optimization: iteratively improve how the LM proposes new instructions over time.

Empirically, they report improvements across diverse pipelines and show MIPRO can beat baseline optimizers (their paper highlights gains up to ~13% accuracy in some settings). (arxiv.org)

Why MIPRO matters

If you’ve ever tuned a RAG-then-reason pipeline, you’ve seen this failure mode:

“Better retrieval prompt” makes retrieval outputs longer
downstream answerer now hallucinates more
end metric goes down

MIPRO is explicitly designed to navigate this credit assignment across modules by scoring candidates on the program’s end metric and using structured proposal strategies. (arxiv.org)

DSPy in practice: a tiny example

Below is a conceptual sketch (not a full runnable notebook) that shows the DSPy flow:

import dspy

# 1) Define a signature: what goes in/out of one LM call
class Answer(dspy.Signature):
    """Answer questions concisely and correctly."""
    question = dspy.InputField()
    answer = dspy.OutputField()

# 2) Wrap it in a module
class QA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict(Answer)

    def forward(self, question):
        return self.predict(question=question)

# 3) Define a metric (end-to-end scoring)
def exact_match(example, pred, trace=None):
    return (pred.answer.strip() == example.answer.strip())

# 4) Compile/optimize (teleprompter)
teleprompter = dspy.BootstrapFewShot(metric=exact_match)
compiled = teleprompter.compile(QA(), trainset=trainset)

The key idea is: you don’t manually write prompt strings. You declare intent (signature) + structure (modules) and then let an optimizer produce prompts/demos. (arxiv.org)

What DSPy is best at

1) LM programs that are more than one call

DSPy shines when your pipeline has multiple interacting steps:

multi-hop RAG
decomposition → solve subtasks → aggregate
tool-use loops (ReAct-like)
agent pipelines with memory/state

This is also where prompt brittleness hurts the most, so compilation has the highest ROI. (arxiv.org)

2) Rapid iteration across model backends

Because the “parameters” are re-optimized for the new LM, DSPy aims to reduce the pain of switching from (say) GPT to an open-weight model or from one open model family to another. (arxiv.org)

3) Data-driven prompt engineering

If you can define a metric and gather even a modest dev set, DSPy turns prompt work into something closer to ML training: propose → evaluate → select. (dspy.ai)

Sharp edges and design choices

Metrics are everything

DSPy optimizers will faithfully optimize whatever you measure. If your metric is noisy, underspecified, or easy to game, the compiler may produce prompts that “win” on dev but fail in production. The docs explicitly emphasize learning DSPy in the order: programming first, then evaluation, then optimization. (dspy.ai)

Optimization cost

Multi-stage search can get expensive: each candidate prompt requires running (some subset of) the full pipeline. MIPRO addresses feasibility with stochastic mini-batch evaluation and proposal strategies, but you still need to budget for compilation runs. (arxiv.org)

When you don’t need DSPy

If your task is a single prompt with no pipeline and you’re happy with it, DSPy may be overkill. Its main value appears when modularity + optimization interact.

How to read the DSPy ecosystem (2026 snapshot)

DSPy has grown beyond the two foundational papers:

An expanding library of modules (e.g., ChainOfThought, ReAct, BestOfN)
Multiple optimizers (BootstrapFewShot, MIPROv2, etc.)
Practical tutorials for RAG, agents, evaluation, caching, and deployment

The official docs organize this into “Learn” (concepts) and “Tutorials” (hands-on). (dspy.ai)

Takeaways

DSPy reframes prompt engineering as compilation. You write modular code and optimize prompts/demos/weights against a metric. (arxiv.org)
Bootstrapping is central. Many optimizers create candidate demonstrations/traces using LMs and filter them by a metric. (dspy.ai)
MIPRO is about multi-stage credit assignment. It focuses on jointly effective prompts for multi-module programs via structured proposals, stochastic evaluation, and meta-optimization. (arxiv.org)
DSPy is most valuable when pipelines get complex. That’s when manual prompt tuning becomes unscalable and brittle.

References

Khattab et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714. (arxiv.org)
Opsahl-Ong et al. Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs (MIPRO). arXiv:2406.11695; EMNLP 2024. (arxiv.org)
DSPy GitHub repository. (github.com)
DSPy Docs: Learning overview and Tutorials. (dspy.ai)
DSPy Optimizer reference (e.g., BootstrapFewShot). (dspy.ai)