How On Policy Self Distillation Works - Sasha Rush
TL;DR
On-policy distillation trains on the student's actual behavior: Instead of copying a teacher's full output sequence, the student generates its own trajectory and then matches the teacher's token probabilities on that exact path.
Rush's Nadal analogy makes the difference click: Classic distillation is like watching Rafael Nadal play and trying to imitate him, while OPD is like playing your own bad tennis and having Nadal correct each mistake over your shoulder.
Composer 2.5 uses self-distillation because there is no better teacher model: Rather than relying on a stronger external model, the system creates a synthetic teacher by adding text feedback to the student's existing trajectory and comparing log probs before and after.
This helps with credit assignment in long RL traces: When trajectories can run for hundreds of turns, RL struggles to pinpoint where an error began, so a reader model flags a specific problematic message and inserts feedback right there.
There is only one rollout, not two: The model does not regenerate a better sequence after feedback, it simply re-scores the same tokens under a slightly modified context, which avoids an extra decode step.
The tradeoff is local improvement, not dramatic jumps: Because the model learns from its own imperfect trajectory, it gets incremental corrections rather than the full benefit of seeing an ideal teacher path from start to finish.
The Breakdown
Composer 2.5 speeds up RL by turning a model's own flawed trajectory into a teaching signal, without needing a stronger teacher model or a second rollout. Sasha Rush's key point is that you can inject targeted text feedback into the same token sequence, re-score it under the modified context, and use that KL gap as a local correction signal.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.