ESPO: Entropy Importance Sampling Policy Optimization
arXiv:2512.00499v2 Announce Type: replace-cross
Abstract: Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving performance at scale often introduces a fundamental trade-off between training stability and training efficiency. Token-level optimization applies fine-grained updates at the individual units, but is prone to high variance in gradient estimation, which can result in unstable training dynamics. In contrast, Sequence-level optimization often relies on aggressive clipping mechanisms to ensure stable updates. However, such design may discard a large fraction of valid training samples, leading to inefficient gradient utilization and reduced training efficiency. We refer to this phenomenon as gradient underutilization. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that aims to combine fine-grained updates with stable training. ESPO decomposes sequences into groups based on predictive entropy, enabling (1) Entropy Grouping Importance Sampling to capture intra-sequence heterogeneity, and (2) Entropy Adaptive Clipping to dynamically allocate trust regions based on model uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that ESPO not only accelerates convergence but also achieves state-of-the-art performance, notably improving accuracy on the challenging mathematical benchmarks.