Seeking arXiv cs.LG endorsement for paper on probe transfer failure in reward hacking detection
eeking arXiv cs.LG endorsement for a paper on activation probe transfer failure for reward hacking detection. I test whether probes trained on the School of Reward Hacks dataset (Taylor et al. 2025) transfer to GRPO-induced reward seeking. They don’t. The SFT and RL probe directions are nearly orthogonal (cosine = -0.07). Paper builds on Wilhelm et al. 2026, Taufeeque et al. 2026, and Gupta & Jenner 2025 (NeurIPS MechInterp Workshop).
Paper will be visible on arXiv once endorsed and submitted. Happy to answer any questions about the work beforehand.
Endorsement link: https://arxiv.org/auth/endorse?x=OQ3LDW
Endorsement code: OQ3LDW
Thanks in advance!!
submitted by /u/Main_Comparison4455
[link] [comments]
Like
0
Liked
Liked