Your Group-Relative Advantage Is Biased
|
This paper identifies and theoretically proves a statistical bias in group-based advantage estimation within Reinforcement Learning from Verifier Rewards (RLVR) algorithms used for post-training large language models on reasoning tasks. It proposes History-Aware Adaptive Difficulty Weighting (HA-DW) to mitigate this bias, consistently improving LLM performance and training efficiency across benchmarks. Paper link: https://arxiv.org/pdf/2601.08521 submitted by /u/This_Ad9834 |
Like
0
Liked
Liked