WM Arena: Compare world model predictions across 26 Atari games with blind battles and a perception quiz

I built WM Arena (arena.worldflux.ai), an interactive benchmark for visual world models on the Atari 100k suite.

Three modes:

– Visual Explorer: side-by-side real vs predicted frames across 26 games

– Blind Battle: ELO-ranked voting on anonymous model outputs

– Real or Predicted? Quiz: a perception test

Currently evaluating DIAMOND (NeurIPS ’24 Spotlight), TWISTER (ICLR ’25), IRIS (ICLR ’23), and STORM (NeurIPS ’23).

Every model runs its official code at a pinned commit. No re-implementations.

Try it: arena.worldflux.ai

Would love feedback from this community, especially on which models to add next. DreamerV3, Delta-IRIS, and EDELINE are on the roadmap.

submitted by /u/Confident_Gas_5266
[link] [comments]

Liked Liked