We Ran the Largest AI Pokemon Tournament Ever. Now It’s an Open Benchmark.
|
We built a standardized Pokemon benchmark and ran a NeurIPS 2025 competition to validate it. RL specialists easily beat LLM generalists in battling, but hybrid methods (LLM planning + RL execution) won speedrunning. The LLM battling arena ranking is different from standard benchmark leaderboards, and harness design matters as much as model choice. See our paper for full details. Paper: https://arxiv.org/abs/2603.15563 submitted by /u/PokeAgentChallenge |
Like
0
Liked
Liked