Reinforcement Learning

Multiplex Thinking vs Maximum Likelihood Reinforcement Learning (MLRL)

Orlando Ding

Feb 14, 2026 • 2 min read

How two 2025 reasoning training paradigms independently rediscovered “optimize search success instead of single‑trajectory accuracy”.

TL;DR

Both papers try to train language models for Pass@K / best‑of‑N decoding success rather than single‑sample correctness — but they approach the problem from completely different directions:

Multiplex Thinking improves the reasoning path
MLRL improves the probability landscape

Together, they approximate training a model where search almost always succeeds.

The Shared Problem

Traditional alignment trains models using:

$$ \mathbb{E}_{y \sim \pi}[r(y)] $$

But reasoning systems are evaluated using:

\begin{equation}
Pass@K = P(\exists\ y_i \text{ correct among } K \text{ samples})
\end{equation}

This mismatch explains why:

majority voting works
best‑of‑N sampling works
self‑consistency works
tree search works

Modern reasoning papers therefore change training to match inference.

What Each Paper Treats as the Training Unit

Method	Training Unit
PPO / RLHF	trajectory
DPO	trajectory pair
Multiplex Thinking	reasoning states (graph nodes)
MLRL	success event likelihood

So:

Multiplex changes the data structure
MLRL changes the objective function

Multiplex Thinking (Microsoft 2025)

Core Idea

Instead of treating sampled reasoning chains independently, merge them into a reasoning graph (DAG):

A → B → C
         ↙   ↘
        D     X
       ↙ ↘     ↘
      E   Z     Y

The model is trained so at least one branch remains correct at each reasoning step.

Objective

$$
\max_\theta \sum_t
\log
\frac{\text{probability of good continuations}}
{\text{probability of all continuations}}
$$

What Gets Reinforced

Correct reasoning decisions rather than full trajectories.

Interpretation

The model learns:

Place probability mass where correct answers live.

It optimizes search success probability.

What Is Actually Being Optimized

Method	Reinforces
PPO	high‑reward trajectory
DPO	preferred trajectory
Multiplex	correct reasoning decisions
MLRL	successful outcome set

Relation to Pass@K

Paper	How Pass@K Improves
Multiplex	ensures at least one reasoning branch survives
MLRL	increases probability mass of success region

Structural improvement vs distributional improvement

Training Signal Granularity

Level	Multiplex	MLRL
Token	✔	✖
Step	✔	✖
Trajectory	✔	✔
Trajectory Set	✔	✔
Success Event	indirect	direct

Multiplex = micro credit assignment
MLRL = macro probability shaping

Why They Produce Similar Results

Both improve best‑of‑N performance while barely affecting single‑sample accuracy because they optimize:

$$
search\ success \neq single\ sample\ accuracy
$$

Unified Mental Model

Paradigm	Meaning
RLHF	learn correct answers
Multiplex	learn reasoning that survives search
MLRL	learn probability distribution searchable by sampling

The Deeper Connection

They are actually dual views of the same latent objective:

$$
\max P(\text{tree search finds solution})
$$

Multiplex approximates gradient via credit assignment in reasoning space
MLRL approximates gradient via likelihood in outcome space

Modern reasoning RL combines both.

Final Intuition

Multiplex improves the path
MLRL improves the probability landscape
Future methods improve both — making search reliable rather than lucky

References

Tang et al., Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge, 2026
https://arxiv.org/abs/2601.08808
https://github.com/GMLR-Penn/Multiplex-Thinking
Tajwar et al., Maximum Likelihood Reinforcement Learning, 2026
https://arxiv.org/abs/2602.02710
https://github.com/tajwarfahim/maxrl