PPO Policy Collapse Fix: Multi-Timescale RL Research

A preprint from an undergraduate researcher identifies exactly why merging multi-timescale advantage estimates inside a PPO actor-critic tends to end in policy collapse — and proposes a structural fix that sidesteps both failure modes without hyperparameter tuning.

What's new

The paper, posted to arXiv and accompanied by a minimal PyTorch reproducer, diagnoses two distinct pathologies. First: when a temporal attention router is exposed to policy gradients, the optimizer games the surrogate loss by manipulating attention weights rather than improving actual control — call it surrogate objective hacking. Second: swap in a gradient-free router like inverse-variance weighting, and it hard-locks onto short-horizon estimates because their uncertainty is structurally lower. In delayed-reward environments, this produces agents that hover indefinitely collecting small shaping rewards rather than committing to a terminal goal. The researcher calls this the paradox of temporal uncertainty.

Why it matters

The fix — termed "Representation over Routing" — keeps multi-timescale predictions on the critic side as auxiliary targets, forcing richer internal representations, while the actor is updated exclusively on the long-term advantage signal. Decoupling the two eliminates the attention-hijacking failure mode without discarding the representational benefits of multi-horizon learning. On LunarLander, the approach consistently clears the 200-point threshold across multiple seeds. The four-stage minimal reproducible example lets you watch the agent crash, hover, and finally land correctly in a few minutes of runtime — which is a rare thing to be able to say about RL research.

What to watch

This is undergraduate-level preprint work on a standard benchmark, so treat the scope accordingly. The core failure-mode analysis is well-specified and the MRE is openly available, which makes it a useful reference for anyone building multi-horizon agents. Whether the decoupled approach generalizes beyond toy environments is the obvious open question. Code and paper are both public.