Built a reward-function debugger for RL. Looking for feedback from people.
While experimenting with GRPO training, I kept running into a problem that when reward increases, it becomes difficult to tell whether the policy is genuinely improving or simply exploiting the reward function. So I built a small library called rewardspy that wraps an existing reward function and continuously monitors indicators that often precede reward hacking.
It currently tracks things like:
- rolling reward statistics
- reward variance collapse
- reward component imbalance
- response length drift
- reward slope changes
- GRPO group collapse, etc
Check it out: https://github.com/AvAdiii/rewardspy
I’d love sm technical feedback.
submitted by /u/Oranoleo12
[link] [comments]
Like
0
Liked
Liked