The Evolution of Mamba: From Selective State Spaces to Linear-Time Reasoning Models
Mamba analysis
Transformers dominated language modeling for years because of one core property:
$$
\text{Global attention} \Rightarrow \text{long-range dependency modeling}
$$
But they pay a price:
$$
Cost=O(n^2)
$$
As context length grows, compute explodes.
Mamba introduced a different path:
keep long-range reasoning, remove quadratic attention.
Instead of attention, it uses a learned dynamical system — a Selective State Space Model (SSM).
This post traces how Mamba evolved across versions and why each step mattered.
Before Mamba: Structured State Space Models (S4)
The precursor to Mamba is S4 (Structured State Space Sequence Model).
Key idea:
Represent sequence processing as a continuous-time dynamical system:
$$
h^{'}(t) = Ah(t)+Bu(t) \
y(t)=Ch(t)
$$
Instead of attention comparing tokens, the model updates an internal state that remembers history.
Advantages:
- linear time complexity
- theoretically infinite context
Weakness:
- static filtering — cannot adapt per token
So S4 remembered everything but could not decide what matters.
Mamba (2023) — Selective State Space Models
Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
The core innovation:
Make the state update depend on the token.
Instead of fixed dynamics:
$$
h_{t+1} = Ah_t + Bx_t
$$
Mamba introduces selection:
$$
h_{t+1} = A(x_t)h_t + B(x_t)x_t
$$
Now the model can:
- forget irrelevant tokens
- preserve key information
- behave similarly to attention
This is the moment SSMs become competitive with Transformers.
Why It Works
Attention performs dynamic routing:
every token chooses what to read
Mamba performs dynamic memory:
every token chooses what to remember
Same capability — different mechanism.
Complexity
| Model | Complexity |
|---|---|
| Transformer | O(n²) |
| Mamba | O(n) |
This enables extremely long contexts.
Mamba-2 (2024) — Transformers Are Just Special SSMs
Paper: Transformers are SSMs: Generalized Models and Efficient Algorithms
This paper reframed the theory:
Attention is actually a discretized state-space model.
Meaning:
Transformers and SSMs are not competitors — they are the same family.
Mamba-2 introduces:
- improved parallel training
- better stability
- higher throughput kernels
The biggest conceptual shift:
$$
\text{Attention} \subset \text{State Space Models}
$$
This unified sequence modeling theory.
Practical Impact
Mamba-1 proved linear scaling.
Mamba-2 proved universality.
Now SSMs are not approximations — they are a general framework.
Hybrid Models — Jamba and Mamba-Transformer Mixtures
After Mamba-2, research discovered:
Attention is good for reasoning
SSM is good for memory
So hybrid models emerged:
| Layer type | Role |
|---|---|
| Attention | precise reasoning |
| Mamba | long memory |
This led to models capable of million-token contexts.
Mamba-3 Direction — Toward Reasoning Models
Recent follow-up work (2024-2025) explores using Mamba-style recurrence for reasoning:
Observation:
Reasoning chains resemble recurrent computation more than retrieval.
Instead of:
$$
\text{look up past tokens}
$$
the model:
$$
\text{updates a latent thought state}
$$
This aligns SSMs with cognitive architectures rather than text similarity engines.
Comparison Across Versions
| Version | Key Idea | Limitation Solved |
|---|---|---|
| S4 | continuous memory | no token selectivity |
| Mamba | selective memory | weak theory |
| Mamba-2 | unified with transformers | training stability |
| Hybrid Mamba | combine attention | reasoning precision |
| New SSM reasoning | latent thought state | scaling reasoning |
Intuition
Transformers:
read the past repeatedly
Mamba:
carry the past forward
This difference explains why Mamba excels at long sequences.
Why This Matters
The scaling law of transformers depends on compute:
$$
\text{more context} \rightarrow \text{more cost}
$$
Mamba changes scaling:
$$
\text{more context} \approx \text{same cost}
$$
So future models may scale by time rather than hardware.
Final Takeaway
Mamba is not just a faster transformer.
It is a shift from retrieval-based memory to state-based cognition.
Transformers simulate memory by repeatedly reading tokens.
State-space models simulate memory by continuously updating belief.
The long-term implication:
AI may move from attention architectures to dynamical systems.
References
Gu et al., Structured State Space Models (S4), 2021
https://arxiv.org/abs/2111.00396
Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023
https://arxiv.org/abs/2312.00752
Dao & Gu, Transformers are SSMs: Generalized Models and Efficient Algorithms (Mamba-2), 2024
https://arxiv.org/abs/2405.21060
AI21 Labs, Jamba Hybrid Attention-Mamba Model, 2024
https://arxiv.org/abs/2403.19887