Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO
arXiv:2604.04983v1 Announce Type: new Abstract: We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes — reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and […]