The Evolution of Mamba: From Selective State Spaces to Linear-Time Reasoning Models

Mamba analysis

Transformers dominated language modeling for years because of one core property:

$$
\text{Global attention} \Rightarrow \text{long-range dependency modeling}
$$

But they pay a price:

$$
Cost=O(n^2)
$$

As context length grows, compute explodes.

Mamba introduced a different path:
keep long-range reasoning, remove quadratic attention.

Instead of attention, it uses a learned dynamical system — a Selective State Space Model (SSM).

This post traces how Mamba evolved across versions and why each step mattered.


Before Mamba: Structured State Space Models (S4)

The precursor to Mamba is S4 (Structured State Space Sequence Model).

Key idea:

Represent sequence processing as a continuous-time dynamical system:

$$
h^{'}(t) = Ah(t)+Bu(t) \
y(t)=Ch(t)
$$

Instead of attention comparing tokens, the model updates an internal state that remembers history.

Advantages:

  • linear time complexity
  • theoretically infinite context

Weakness:

  • static filtering — cannot adapt per token

So S4 remembered everything but could not decide what matters.


Mamba (2023) — Selective State Space Models

Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces

The core innovation:

Make the state update depend on the token.

Instead of fixed dynamics:

$$
h_{t+1} = Ah_t + Bx_t
$$

Mamba introduces selection:

$$
h_{t+1} = A(x_t)h_t + B(x_t)x_t
$$

Now the model can:

  • forget irrelevant tokens
  • preserve key information
  • behave similarly to attention

This is the moment SSMs become competitive with Transformers.


Why It Works

Attention performs dynamic routing:

every token chooses what to read

Mamba performs dynamic memory:

every token chooses what to remember

Same capability — different mechanism.


Complexity

ModelComplexity
TransformerO(n²)
MambaO(n)

This enables extremely long contexts.


Mamba-2 (2024) — Transformers Are Just Special SSMs

Paper: Transformers are SSMs: Generalized Models and Efficient Algorithms

This paper reframed the theory:

Attention is actually a discretized state-space model.

Meaning:

Transformers and SSMs are not competitors — they are the same family.

Mamba-2 introduces:

  • improved parallel training
  • better stability
  • higher throughput kernels

The biggest conceptual shift:

$$
\text{Attention} \subset \text{State Space Models}
$$

This unified sequence modeling theory.


Practical Impact

Mamba-1 proved linear scaling.

Mamba-2 proved universality.

Now SSMs are not approximations — they are a general framework.


Hybrid Models — Jamba and Mamba-Transformer Mixtures

After Mamba-2, research discovered:

Attention is good for reasoning
SSM is good for memory

So hybrid models emerged:

Layer typeRole
Attentionprecise reasoning
Mambalong memory

This led to models capable of million-token contexts.


Mamba-3 Direction — Toward Reasoning Models

Recent follow-up work (2024-2025) explores using Mamba-style recurrence for reasoning:

Observation:

Reasoning chains resemble recurrent computation more than retrieval.

Instead of:

$$
\text{look up past tokens}
$$

the model:

$$
\text{updates a latent thought state}
$$

This aligns SSMs with cognitive architectures rather than text similarity engines.


Comparison Across Versions

VersionKey IdeaLimitation Solved
S4continuous memoryno token selectivity
Mambaselective memoryweak theory
Mamba-2unified with transformerstraining stability
Hybrid Mambacombine attentionreasoning precision
New SSM reasoninglatent thought statescaling reasoning

Intuition

Transformers:

read the past repeatedly

Mamba:

carry the past forward

This difference explains why Mamba excels at long sequences.


Why This Matters

The scaling law of transformers depends on compute:

$$
\text{more context} \rightarrow \text{more cost}
$$

Mamba changes scaling:

$$
\text{more context} \approx \text{same cost}
$$

So future models may scale by time rather than hardware.


Final Takeaway

Mamba is not just a faster transformer.

It is a shift from retrieval-based memory to state-based cognition.

Transformers simulate memory by repeatedly reading tokens.
State-space models simulate memory by continuously updating belief.

The long-term implication:

AI may move from attention architectures to dynamical systems.

References

Gu et al., Structured State Space Models (S4), 2021
https://arxiv.org/abs/2111.00396

Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023
https://arxiv.org/abs/2312.00752

Dao & Gu, Transformers are SSMs: Generalized Models and Efficient Algorithms (Mamba-2), 2024
https://arxiv.org/abs/2405.21060

AI21 Labs, Jamba Hybrid Attention-Mamba Model, 2024
https://arxiv.org/abs/2403.19887