Seeking arXiv cs.LG endorsement for paper on probe transfer failure in reward hacking detection

eeking arXiv cs.LG endorsement for a paper on activation probe transfer failure for reward hacking detection. I test whether probes trained on the School of Reward Hacks dataset (Taylor et al. 2025) transfer to GRPO-induced reward seeking. They don’t. The SFT and RL probe directions are nearly orthogonal (cosine = -0.07). Paper builds on Wilhelm et al. 2026, Taufeeque et al. 2026, and Gupta & Jenner 2025 (NeurIPS MechInterp Workshop).

Paper will be visible on arXiv once endorsed and submitted. Happy to answer any questions about the work beforehand.

Endorsement link: https://arxiv.org/auth/endorse?x=OQ3LDW

Endorsement code: OQ3LDW

Thanks in advance!!

submitted by /u/Main_Comparison4455
[link] [comments]

Liked Liked