Multiplex Thinking vs Maximum Likelihood Reinforcement Learning (MLRL)
How two 2025 reasoning training paradigms independently rediscovered “optimize search success instead of single‑trajectory accuracy”.
TL;DR
Both papers try to train language models for Pass@K / best‑of‑N decoding success rather than single‑sample correctness — but they approach the problem from completely different directions:
Multiplex Thinking improves the reasoning path
MLRL improves the probability landscape
Together, they approximate training a model where search almost always succeeds.
The Shared Problem
Traditional alignment trains models using:
$$ \mathbb{E}_{y \sim \pi}[r(y)] $$
But reasoning systems are evaluated using:
\begin{equation}
Pass@K = P(\exists\ y_i \text{ correct among } K \text{ samples})
\end{equation}
This mismatch explains why:
- majority voting works
- best‑of‑N sampling works
- self‑consistency works
- tree search works
Modern reasoning papers therefore change training to match inference.
What Each Paper Treats as the Training Unit
| Method | Training Unit |
|---|---|
| PPO / RLHF | trajectory |
| DPO | trajectory pair |
| Multiplex Thinking | reasoning states (graph nodes) |
| MLRL | success event likelihood |
So:
- Multiplex changes the data structure
- MLRL changes the objective function
Multiplex Thinking (Microsoft 2025)
Core Idea
Instead of treating sampled reasoning chains independently, merge them into a reasoning graph (DAG):
A → B → C
↙ ↘
D X
↙ ↘ ↘
E Z Y
The model is trained so at least one branch remains correct at each reasoning step.
Objective
$$
\max_\theta \sum_t
\log
\frac{\text{probability of good continuations}}
{\text{probability of all continuations}}
$$
What Gets Reinforced
Correct reasoning decisions rather than full trajectories.
Interpretation
The model learns:
Place probability mass where correct answers live.
It optimizes search success probability.
What Is Actually Being Optimized
| Method | Reinforces |
| PPO | high‑reward trajectory |
| DPO | preferred trajectory |
| Multiplex | correct reasoning decisions |
| MLRL | successful outcome set |
Relation to Pass@K
| Paper | How Pass@K Improves |
| Multiplex | ensures at least one reasoning branch survives |
| MLRL | increases probability mass of success region |
Structural improvement vs distributional improvement
Training Signal Granularity
| Level | Multiplex | MLRL |
| Token | ✔ | ✖ |
| Step | ✔ | ✖ |
| Trajectory | ✔ | ✔ |
| Trajectory Set | ✔ | ✔ |
| Success Event | indirect | direct |
Multiplex = micro credit assignment
MLRL = macro probability shaping
Why They Produce Similar Results
Both improve best‑of‑N performance while barely affecting single‑sample accuracy because they optimize:
$$
search\ success \neq single\ sample\ accuracy
$$
Unified Mental Model
| Paradigm | Meaning |
| RLHF | learn correct answers |
| Multiplex | learn reasoning that survives search |
| MLRL | learn probability distribution searchable by sampling |
The Deeper Connection
They are actually dual views of the same latent objective:
$$
\max P(\text{tree search finds solution})
$$
- Multiplex approximates gradient via credit assignment in reasoning space
- MLRL approximates gradient via likelihood in outcome space
Modern reasoning RL combines both.
Final Intuition
Multiplex improves the path
MLRL improves the probability landscape
Future methods improve both — making search reliable rather than lucky
References
-
Tang et al., Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge, 2026
https://arxiv.org/abs/2601.08808
https://github.com/GMLR-Penn/Multiplex-Thinking -
Tajwar et al., Maximum Likelihood Reinforcement Learning, 2026
https://arxiv.org/abs/2602.02710
https://github.com/tajwarfahim/maxrl