LLM

The Evolution of Mamba: From Selective State Spaces to Linear-Time Reasoning Models

Mamba analysis

Orlando Ding

14 Feb 2026 • 2 min read

Transformers dominated language modeling for years because of one core property:

$$
\text{Global attention} \Rightarrow \text{long-range dependency modeling}
$$

But they pay a price:

$$
Cost=O(n^2)
$$

As context length grows, compute explodes.

Mamba introduced a different path:
keep long-range reasoning, remove quadratic attention.

Instead of attention, it uses a learned dynamical system — a Selective State Space Model (SSM).

This post traces how Mamba evolved across versions and why each step mattered.

Before Mamba: Structured State Space Models (S4)

The precursor to Mamba is S4 (Structured State Space Sequence Model).

Key idea:

Represent sequence processing as a continuous-time dynamical system:

$$
h^{'}(t) = Ah(t)+Bu(t) \
y(t)=Ch(t)
$$

Instead of attention comparing tokens, the model updates an internal state that remembers history.

Advantages:

linear time complexity
theoretically infinite context

Weakness:

static filtering — cannot adapt per token

So S4 remembered everything but could not decide what matters.

Mamba (2023) — Selective State Space Models

Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces

The core innovation:

Make the state update depend on the token.

Instead of fixed dynamics:

$$
h_{t+1} = Ah_t + Bx_t
$$

Mamba introduces selection:

$$
h_{t+1} = A(x_t)h_t + B(x_t)x_t
$$

Now the model can:

forget irrelevant tokens
preserve key information
behave similarly to attention

This is the moment SSMs become competitive with Transformers.

Why It Works

Attention performs dynamic routing:

every token chooses what to read

Mamba performs dynamic memory:

every token chooses what to remember

Same capability — different mechanism.

Complexity

Model	Complexity
Transformer	O(n²)
Mamba	O(n)

This enables extremely long contexts.

Mamba-2 (2024) — Transformers Are Just Special SSMs

Paper: Transformers are SSMs: Generalized Models and Efficient Algorithms

This paper reframed the theory:

Attention is actually a discretized state-space model.

Meaning:

Transformers and SSMs are not competitors — they are the same family.

Mamba-2 introduces:

improved parallel training
better stability
higher throughput kernels

The biggest conceptual shift:

$$
\text{Attention} \subset \text{State Space Models}
$$

This unified sequence modeling theory.

Practical Impact

Mamba-1 proved linear scaling.

Mamba-2 proved universality.

Now SSMs are not approximations — they are a general framework.

Hybrid Models — Jamba and Mamba-Transformer Mixtures

After Mamba-2, research discovered:

Attention is good for reasoning
SSM is good for memory

So hybrid models emerged:

Layer type	Role
Attention	precise reasoning
Mamba	long memory

This led to models capable of million-token contexts.

Mamba-3 Direction — Toward Reasoning Models

Recent follow-up work (2024-2025) explores using Mamba-style recurrence for reasoning:

Observation:

Reasoning chains resemble recurrent computation more than retrieval.

Instead of:

$$
\text{look up past tokens}
$$

the model:

$$
\text{updates a latent thought state}
$$

This aligns SSMs with cognitive architectures rather than text similarity engines.

Comparison Across Versions

Version	Key Idea	Limitation Solved
S4	continuous memory	no token selectivity
Mamba	selective memory	weak theory
Mamba-2	unified with transformers	training stability
Hybrid Mamba	combine attention	reasoning precision
New SSM reasoning	latent thought state	scaling reasoning

Intuition

Transformers:

read the past repeatedly

Mamba:

carry the past forward

This difference explains why Mamba excels at long sequences.

Why This Matters

The scaling law of transformers depends on compute:

$$
\text{more context} \rightarrow \text{more cost}
$$

Mamba changes scaling:

$$
\text{more context} \approx \text{same cost}
$$

So future models may scale by time rather than hardware.

Final Takeaway

Mamba is not just a faster transformer.

It is a shift from retrieval-based memory to state-based cognition.

Transformers simulate memory by repeatedly reading tokens.
State-space models simulate memory by continuously updating belief.

The long-term implication:

AI may move from attention architectures to dynamical systems.

References

Gu et al., Structured State Space Models (S4), 2021
https://arxiv.org/abs/2111.00396

Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023
https://arxiv.org/abs/2312.00752

Dao & Gu, Transformers are SSMs: Generalized Models and Efficient Algorithms (Mamba-2), 2024
https://arxiv.org/abs/2405.21060

AI21 Labs, Jamba Hybrid Attention-Mamba Model, 2024
https://arxiv.org/abs/2403.19887