979,200 evaluation episodes measuring RL behavioral stability – reward explains 3.7% of stability variance [results + code]

digitado ⋅ 4 de April de 2026

Hi Everyone. Sharing the complete results from ARCUS-H, a post-hoc evaluation harness measuring behavioral stability of trained RL policies under structured stress.

What ARCUS-H does

Three-phase protocol (pre/shock/post) applied to any SB3 policy. Eight stressors across three failure axes:

Perception: CD (concept drift) · ON (obs noise) · SB (sensor blackout)
Execution: RC (reward compression) · TV (actuator corruption)
Feedback: VI (reward inversion) · RN (reward noise)

Five channels: Competence · Policy Consistency · Temporal Stability · Observation Reliability · Action Entropy Divergence

No retraining. No model internals.

Scale

51 (env, algo) pairs · 12 environments · 8 algorithms · 8 stressors · 10 seeds · 979,200 evaluation episodes

https://preview.redd.it/6n24vpbv42tg1.png?width=1737&format=png&auto=webp&s=82b9d9d31e78587a9e422a35ec8b646a3311b2d0

Finding 1: r = +0.240 [0.111, 0.354]

This is the primary number (env stressors only, VI/RN excluded). compare.py also outputs r = +0.311 for all 8 stressors — that number is inflated by circularity: VI and RN corrupt the reward signal, which is 15% of the ARCUS score formula. Don’t cite 0.247 as the main result.

Spearman r = +0.180. R² = 0.057.

Earlier pilot on 47 pairs: r = 0.286 [0.149, 0.411]. The decrease to 0.240 reflects adding SpaceInvaders and Walker2d. The CI narrowed by 69%. The full evaluation is more reliable and more diverse.

Finding 2: SAC 92.5% vs TD3 61.0% under observation noise

Replicated across 51 pairs and 10 seeds.

Finding 3: Pong 41.9% vs SpaceInvaders 13.0% under obs noise

Same CNN. Same wrapper. Representation structure, not architecture.

Finding 4: Walker2d-v4 (new)

FPR = 0.053. MuJoCo fragility confirmed on a third locomotion env.

Code and data

https://github.com/karimzn00/ARCUSH

submitted by /u/Less_Conclusion9066
[link] [comments]

Like 0

Liked Liked