Reproducible DQN / Double DQN / Dueling comparison with diagnostics and generalization tests (LunarLander-v3)

I wanted to compare Vanilla DQN, DDQN and Dueling DDQN beyond just final reward, so I built a structured training and evaluation setup around LunarLander-v3.

Instead of tracking only episode return, I monitored:

• activation and gradient distributions

• update-to-data ratios for optimizer diagnostics

• action gap and Q-value dynamics

• win rate with 95% CI intervals

• generalization via human-prefix rollouts

The strongest model (<9k params) achieves 98.4% win rate (±0.24%, 95% CI) across 10k seeds.

The resulting evaluation framework can be applied to other Gymnasium environments.

I’d appreciate feedback, especially on evaluation methodology.

https://medium.com/towards-artificial-intelligence/apollo-dqn-building-an-rl-agent-for-lunarlander-v3-5040090a7442

submitted by /u/yarchickkkk
[link] [comments]

Liked Liked