New to RL: why does RLVR work if the reward is so sparse?
Why does RLVR (RL with verifiable rewards) seem to work well for LLMs?
My intuition was that sparse rewards are usually bad because exploration is hard and gradients get noisy. But RLVR papers/blogs make it look pretty effective in practice
submitted by /u/Parking_Throat_9125
[link] [comments]
Like
0
Liked
Liked