Pokemon Showdown AI (ELO 1900+)

I’ve spent some time recently building an RL agent to play competitive Pokémon (Generation 9 Random Battles on Pokémon Showdown). I wanted to share the architecture, the training pipeline, and some thoughts on the MCTS vs. pure-network approaches in this specific environment.

Why Pokémon?

From an RL perspective, a Pokémon battle is a great proxy for real-world, messy decision-making. It combines three massive headaches:

  1. Simultaneous Action: Both agents lock in actions concurrently. You are trying to approximate Nash Equilibria, not just solve an MDP.
  2. Imperfect Information: Opponent sets, stats, and abilities are hidden variables. You have to maintain an implicit belief state.
  3. High Stochasticity: Damage rolls, crits, and secondary effects mean tactically optimal decisions carry non-zero failure probabilities.

Prior Art: Engine-Assisted Search

If you look at the literature for high-performing Showdown bots (Wang, PokéChamp, Foul Play), they rely heavily on engine-assisted search—usually Expectimax or MCTS.

While they achieve high win rates, they require a near-perfect simulation engine to calculate the best moves. My goal was to ascertain the performance limits of a pure neural network agent.

The Approach: PokeTransformer

Flattening 12 Pokémon, their discrete moves, and global field effects into a 1D array destroys the semantic geometry of the state space. To fix this, I moved to a Transformer architecture.

  • Bespoke Representation: Specialized subnets encode move, ability, and Pokémon vectors. The game state is modeled as a sequence of discrete embeddings (1 Field Token, 12 Pokémon Tokens).
  • Training Pipeline: 1. Imitation Learning: Bootstrapped via cross-entropy loss on a dataset generated by poke-env‘s SimpleHeuristicsPlayer to learn legal, logically sound moves. 2. PPO & Self-Play: Transitioned to distributed self-play for policy improvement.

Results

The agent peaked at ~1900 ELO (top 25%) on the Gen 9 Random Battle ladder. During inference, it runs entirely search-free. The raw observation tensor is processed, and the action is sampled in a single forward pass. While capable of high level gameplay, it falls short of engine-assisted search algorithms, such as Foul Play, which can achieve ELOs exceeding 2300.

Challenge the Bot & Links

For the next couple of weeks, I will have the bot running on the Showdown servers accepting challenges for Gen 9 Random Battle. If you want to test its logic (or break its policy), you can challenge it directly!

submitted by /u/Nebraskinator
[link] [comments]

Liked Liked