[P] LLM Jigsaw: Benchmarking Spatial Reasoning in VLMs – frontier models hit a wall at 5×5 puzzles

I built a benchmark to test how well frontier multimodal LLMs can solve jigsaw puzzles through iterative reasoning.

The Task – Shuffle an image into an N×N grid – LLM receives: shuffled image, reference image, correct piece count, last 3 moves – Model outputs JSON with swap operations – Repeat until solved or max turns reached

Results (20 images per config)

Grid GPT-5.2 Gemini 3 Pro Claude Opus 4.5
3×3 95% solve 85% solve 20% solve
4×4 40% solve 25% solve
5×5 0% solve 10% solve

Key Findings 1. Difficulty scales steeply – solve rates crash from 95% to near 0% between 3×3 and 5×5 2. Piece Accuracy plateaus at 50-70% – models get stuck even with hints and higher reasoning effort 3. Token costs explode – Gemini uses ~345K tokens on 5×5 (vs ~55K on 3×3) 4. Higher reasoning effort helps marginally – but at 10x cost and frequent timeouts

Why This Matters Spatial reasoning is fundamental for robotics, navigation, and real-world AI applications. This benchmark is trivial for humans, and reveals a clear capability gap in current VLMs.

Links – 📊 Results: https://filipbasara0.github.io/llm-jigsaw – 💻 GitHub: https://github.com/filipbasara0/llm-jigsaw – 🎮 Try it: https://llm-jigsaw.streamlit.app

Feedback welcome! Curious if anyone has ideas for why models plateau or has ran similar experiments.

submitted by /u/Qubit55
[link] [comments]

Liked Liked