Your Group-Relative Advantage Is Biased

Your Group-Relative Advantage Is Biased

This paper identifies and theoretically proves a statistical bias in group-based advantage estimation within Reinforcement Learning from Verifier Rewards (RLVR) algorithms used for post-training large language models on reasoning tasks. It proposes History-Aware Adaptive Difficulty Weighting (HA-DW) to mitigate this bias, consistently improving LLM performance and training efficiency across benchmarks.

Paper link: https://arxiv.org/pdf/2601.08521

https://preview.redd.it/2j5xdz35h7pg1.png?width=1720&format=png&auto=webp&s=ec7e34a6f49da2b2c1394a37fa865c8193eee28a

submitted by /u/This_Ad9834
[link] [comments]

Liked Liked