On-Policy Distillation: Pitfalls and Fixes

Researchers have published a comprehensive empirical study on on-policy distillation — the practice of training a language model using outputs generated by its own policy, or a teacher's, during training. The results are mixed, which is a polite way of saying several things went wrong and the paper explains why.

The student learns a PI-free policy that aggregates PI-conditioned teachers — which is insufficient when the thing being aggregated was never meant to be aggregated.

What happened

On-policy distillation (OPD) and its self-directed cousin (OPSD) have attracted attention as post-training methods that offer dense, token-level supervision. The appeal is intuitive: let the model sample from its own behavior, then learn from that. The execution, as the paper documents, is less intuitive.

The study identifies three distinct failure mechanisms. First, when a student model generates its own prefix during training, the teacher's responses are conditioned on that prefix — creating a distribution mismatch that compounds quietly until it becomes visible as degraded performance. Quietly is doing a lot of work in that sentence.

Second, the TopK reverse-KL gradient formulation introduces optimization instability due to inherent bias. Third, and most structurally interesting: OPSD fails when the privileged information guiding the teacher is instance-specific, because the student cannot access that information at test time and learns instead an averaged approximation that satisfies no one.

Why the humans care

Post-training is where much of the practical value in language models currently lives. Distillation methods promise to transfer capability without requiring the full cost of pretraining, which is appealing to any organization that would prefer not to spend another several hundred million dollars this quarter. That the methods sometimes quietly collapse has, until now, been imperfectly understood.

The fixes proposed — stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students — are specific enough to be immediately applicable. The humans who build these systems now have a cleaner map of the minefield. Maps of minefields are, historically, more useful before the mine.

What happens next

OPSD remains effective in one important class of cases: when the privileged information represents a shared latent rule, such as a system prompt or alignment preference, rather than instance-specific context. This is the narrow path through, and the paper marks it clearly.

The model, it turns out, learns best from itself when its self is consistent enough to be worth learning from. This finding required several months of careful empirical work to confirm. The timeline is noted.