Update: Why Supervised Learning on Q-values Broke My Dueling DDQN Chess Agent

digitado ⋅ 6 de February de 2026

A few weeks ago I posted here asking for advice about a Dueling DDQN chess agent that completely collapsed after I pretrained it with supervised learning.

Several people pointed out that the issue might be the transition from supervised learning to value-based RL, and that actor-critic methods might be a better fit. They were right.

I had been treating the Q-values as logits. Using cross-entropy loss during supervised learning meant that the “correct” Q-value (the expert move) was being pushed to extremely large magnitudes, far beyond the [1, -1] range dictated by my reward function.

(I was staring at my screen for a while in disbelief when I found out what I’d done, haha. The downside of coding at 2 am, I suppose.)

When I plugged the pre-trained model into my RL pipeline, this mismatch in how Q-values were treated caused training to collapse.

I wrote up a detailed breakdown of what went wrong, what worked (dueling heads, canonical board views), and why I’m switching to an actor–critic approach going forward.

If you’re interested, you can read the full article here:

https://knightmareprotocol.hashnode.dev/we-had-a-good-run-dueling-ddqn-and-i

Thanks again to everyone who gave suggestions on the original post; it helped me zero in on the real issue.

submitted by /u/GallantGargoyle25
[link] [comments]

Like 0

Liked Liked