When Chaos Wins: noisy net eval with noise off gave wildly inconsistent results. Turning it back on fixed everything.

digitado ⋅ 17 de May de 2026

Running a Rainbow DQN ablation on Snake (C51 + dueling + noisy nets). When I evaluated checkpoints with noise off (mean weights, sigma zeroed out, the standard approach), the scores were all over the place. Some checkpoints averaged 78, others averaged 18. Training curve at those same points was perfectly stable.

First instinct was a bug. Checked everything. It wasn’t.

The worst case was at ep450K. Deterministic eval produced a bimodal distribution: ~25% of episodes scored near zero, ~75% scored above 80. The average was 59 but that number is meaningless with two separate peaks and nothing in between.

What’s happening: the mean-weight policy has traps. Game states where Q-values for two actions are nearly identical. Without noise, the agent picks the same action every time. If it’s the wrong one, it loops and dies. 25% of starting states consistently hit these traps.

Same checkpoint, same seeds, noise turned back on: bimodal failure mode vanished entirely. p25 jumped from 2 to 59. Average went from 59 to 73. Std dropped from 42 to 26. This held at every checkpoint from ep50K through ep450K. Stochastic eval beat deterministic eval across the board.

The noise isn’t residual exploration overhead. The agent learned a policy where the sigma values are functional. They provide just enough Q-value perturbation to prevent degenerate action loops. Zero them out and you get a policy that’s strictly worse than what the agent actually learned.

Snake makes this especially acute because a single wrong turn at length 100+ is immediately fatal. The deterministic traps are lethal in a way they wouldn’t be in more forgiving environments.

One caveat: at one very late checkpoint where sigma had grown extremely large, stochastic eval finally dropped below deterministic. There’s a productive zone for noise magnitude, and past it the noise becomes destructive. So it’s not “always evaluate with noise.” It’s “don’t assume deterministic eval is automatically the ground truth.”

Has anyone else seen this kind of eval divergence with noisy nets? Curious whether it’s specific to tight spatial environments like Snake or shows up more broadly.

submitted by /u/statphantom
[link] [comments]

Like 0

Liked Liked