Trust Region Masking for Long-Horizon LLM Reinforcement Learning

arXiv:2512.23075v2 Announce Type: replace-cross
Abstract: Policy gradient methods for Large Language Models optimize a policy $pi_theta$ via a surrogate objective computed from samples of a rollout policy $pi_{text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences, such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch ($pi_{text{roll}} neq pi_theta$), leading to approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive two new bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. We further derive an Adaptive bound that strictly generalizes the Pinsker-Marginal bound by combining an importance-ratio decomposition of the error with an adaptive per-position application of Pinsker’s inequality on the future trajectory divergence; the minimum over all three bounds is tighter than any individual bound. Crucially, all bounds depend on $D_{mathrm{KL}}^{mathrm{tok,max}}$, the maximum token-level KL divergence across the sequence. As a emph{sequence-level} term, the divergence cannot be controlled by previous token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM enables the first non-vacuous monotonic improvement guarantees and demonstrates empirical training stability for long-horizon LLM-RL.

Liked Liked