979,200 evaluation episodes measuring RL behavioral stability – reward explains 3.7% of stability variance [results + code]
|
Hi Everyone. Sharing the complete results from ARCUS-H, a post-hoc evaluation harness measuring behavioral stability of trained RL policies under structured stress. What ARCUS-H does Three-phase protocol (pre/shock/post) applied to any SB3 policy. Eight stressors across three failure axes:
Five channels: Competence · Policy Consistency · Temporal Stability · Observation Reliability · Action Entropy Divergence No retraining. No model internals. Scale 51 (env, algo) pairs · 12 environments · 8 algorithms · 8 stressors · 10 seeds · 979,200 evaluation episodes Finding 1: r = +0.240 [0.111, 0.354] This is the primary number (env stressors only, VI/RN excluded). compare.py also outputs r = +0.311 for all 8 stressors — that number is inflated by circularity: VI and RN corrupt the reward signal, which is 15% of the ARCUS score formula. Don’t cite 0.247 as the main result. Spearman r = +0.180. R² = 0.057. Earlier pilot on 47 pairs: r = 0.286 [0.149, 0.411]. The decrease to 0.240 reflects adding SpaceInvaders and Walker2d. The CI narrowed by 69%. The full evaluation is more reliable and more diverse. Finding 2: SAC 92.5% vs TD3 61.0% under observation noise Replicated across 51 pairs and 10 seeds. Finding 3: Pong 41.9% vs SpaceInvaders 13.0% under obs noise Same CNN. Same wrapper. Representation structure, not architecture. Finding 4: Walker2d-v4 (new) FPR = 0.053. MuJoCo fragility confirmed on a third locomotion env. Code and data submitted by /u/Less_Conclusion9066 |