Multiplex Thinking vs Maximum Likelihood Reinforcement Learning (MLRL)

How two 2025 reasoning training paradigms independently rediscovered “optimize search success instead of single‑trajectory accuracy”.

TL;DR

Both papers try to train language models for Pass@K / best‑of‑N decoding success rather than single‑sample correctness — but they approach the problem from completely different directions:

Multiplex Thinking improves the reasoning path
MLRL improves the probability landscape

Together, they approximate training a model where search almost always succeeds.

The Shared Problem

Traditional alignment trains models using:

$$ \mathbb{E}_{y \sim \pi}[r(y)] $$

But reasoning systems are evaluated using:

\begin{equation}
Pass@K = P(\exists\ y_i \text{ correct among } K \text{ samples})
\end{equation}

This mismatch explains why:

  • majority voting works
  • best‑of‑N sampling works
  • self‑consistency works
  • tree search works

Modern reasoning papers therefore change training to match inference.

What Each Paper Treats as the Training Unit

MethodTraining Unit
PPO / RLHFtrajectory
DPOtrajectory pair
Multiplex Thinkingreasoning states (graph nodes)
MLRLsuccess event likelihood

So:

  • Multiplex changes the data structure
  • MLRL changes the objective function

Multiplex Thinking (Microsoft 2025)

Core Idea

Instead of treating sampled reasoning chains independently, merge them into a reasoning graph (DAG):

A → B → C
         ↙   ↘
        D     X
       ↙ ↘     ↘
      E   Z     Y

The model is trained so at least one branch remains correct at each reasoning step.

Objective

$$
\max_\theta \sum_t
\log
\frac{\text{probability of good continuations}}
{\text{probability of all continuations}}
$$

What Gets Reinforced

Correct reasoning decisions rather than full trajectories.

Interpretation

The model learns:

Place probability mass where correct answers live.

It optimizes search success probability.


What Is Actually Being Optimized

MethodReinforces
PPOhigh‑reward trajectory
DPOpreferred trajectory
Multiplexcorrect reasoning decisions
MLRLsuccessful outcome set

Relation to Pass@K

PaperHow Pass@K Improves
Multiplexensures at least one reasoning branch survives
MLRLincreases probability mass of success region

Structural improvement vs distributional improvement


Training Signal Granularity

LevelMultiplexMLRL
Token
Step
Trajectory
Trajectory Set
Success Eventindirectdirect

Multiplex = micro credit assignment
MLRL = macro probability shaping


Why They Produce Similar Results

Both improve best‑of‑N performance while barely affecting single‑sample accuracy because they optimize:

$$
search\ success \neq single\ sample\ accuracy
$$

Unified Mental Model

ParadigmMeaning
RLHFlearn correct answers
Multiplexlearn reasoning that survives search
MLRLlearn probability distribution searchable by sampling

The Deeper Connection

They are actually dual views of the same latent objective:

$$
\max P(\text{tree search finds solution})
$$

  • Multiplex approximates gradient via credit assignment in reasoning space
  • MLRL approximates gradient via likelihood in outcome space

Modern reasoning RL combines both.


Final Intuition

Multiplex improves the path
MLRL improves the probability landscape
Future methods improve both — making search reliable rather than lucky

References