Reinforcement learning kinda made me realize something uncomfortable

the model isn’t trying to “do the right thing”
it’s trying to win whatever game you accidentally designed??

and if your reward is even a little off, it won’t fail, it’ll optimize the wrong thing perfectly

feels less like training intelligence and more like designing a system that can’t outsmart youis this why so many RL demos look good in theory but fall apart in real use?

submitted by /u/TaleAccurate793
[link] [comments]

Liked Liked