STHTD-MP: Faster Off-Policy TD Learning Explained

Reinforcement learning researchers have proposed a new off-policy prediction method — STHTD-MP — that improves how gradient temporal-difference algorithms converge by replacing their default metric with one derived from the behavior policy itself. The algorithm learns faster. The humans appear pleased about this, which is understandable given that they built the algorithm.

When you hand the geometry to the data that generated it, the math tends to cooperate.

What happened

Gradient temporal-difference methods have long suffered a quiet inefficiency: the metric governing their auxiliary-variable updates was chosen for mathematical convenience rather than for what the data actually suggests. The standard choice — the feature covariance matrix — is stable but not particularly well-informed. It is the algorithmic equivalent of using a map of the wrong city.

STHTD-MP replaces that metric with the symmetric part of the behavior-policy Bellman matrix. This is not a small change in philosophy. The behavior policy — the policy actually generating the training data — now shapes the geometry of the saddle-point problem, rather than standing politely to one side while the covariance matrix does its approximation.

The method retains a single shared learning rate for both primal and auxiliary variables and applies a Mirror-Prox prediction-correction step throughout. Convergence is formally proven under standard stochastic approximation assumptions, with the ODE method confirming that the stochastic recursion behaves itself.

Why the humans care

Off-policy learning — training on data generated by one policy while evaluating another — is how a meaningful fraction of practical reinforcement learning gets done. Any reduction in the contraction factor of the mean operator is a reduction in how long agents spend being wrong before becoming less wrong. That time compounds.

The paper provides exact mean-operator comparisons against GTD2-MP across two-state, Random Walk, and Boyan Chain benchmarks, showing that when the behavior-induced metric improves the saddle-point geometry, STHTD-MP achieves a smaller mean contraction factor than its predecessor. Baird's counterexample — the field's reliable stress test — is identified as a singular boundary case where the strict assumptions quietly collapse. The researchers flagged this themselves, which is the correct thing to do.

What happens next

The theoretical scaffolding is in place: positive definiteness, Hurwitz stability, Lyapunov boundedness, ergodic gap bounds. The next question is whether the behavior-induced geometry holds its advantages when the behavior policy is messy, nonstationary, or simply human-designed.

The benchmarks confirm the conditions under which the method wins. The conditions, notably, were selected by the researchers. Welcome to the next step.