Improving defense in a PPO agent for a constrained Gomoku variant

digitado ⋅ 1 de July de 2026

I’ve been working on an RL agent for a constrained variant of Gomoku as a personal project.

The game is played on a 20×20 board. The first move must be in the center, and every subsequent move must be adjacent to an existing stone, so the game develops as a growing cluster rather than over the whole board.

My current setup is:

Maskable PPO
Custom Gymnasium environment
Curriculum learning
- Random opponent
- Minimax (depth 1)
- Minimax (depth 2)
Opponent pool where previous versions are added if they achieve at least a 55% win rate against the current pool

The feature extractor is a custom CNN using four input channels:

my pieces
opponent pieces
valid frontier moves
threat map (positions where the opponent can win next move)

The agent has never trained against depth-3 minimax, but in evaluation it achieves roughly 12 wins / 8 losses over 20 non-deterministic games, and when evaluated deterministically as the first player it consistently beats depth-3.

The biggest weakness I’ve observed is defense. The agent often fails to respond correctly to dangerous positions even though the threat map is provided as an input channel.

I’m looking for suggestions on what direction you would investigate next.

Would you focus on:

reward shaping?
improving the curriculum?
different self-play strategies?
network architecture?
something else entirely?

Any papers or similar projects I should look at would also be greatly appreciated.

submitted by /u/Choice_Balance9681
[link] [comments]

Like 0

Liked Liked