[R] Extreme Sudoku as a constraint-satisfaction benchmark, solved natively without tools or CoT or solution backtracking

I came across an interesting writeup from Pathway that I think is more interesting as a reasoning benchmark than as a puzzle result.

They use “Sudoku Extreme”: about 250,000 very hard Sudoku instances. The appeal is that Sudoku here is treated as a pure constraint-satisfaction problem: each solution is trivial to verify, hard to bluff and the task isn’t naturally linguistic. According to their numbers, leading LLMs (O3‑mini, DeepSeek R1, Claude 3.7 8K) all get 0% accuracy on this benchmark, while their BDH architecture reaches 97.4% accuracy without chain‑of‑thought traces or explicit solution backtracking.

What caught my attention is not just the reported result, but the mechanism claim: transformers do token‑by‑token continuation with a relatively limited internal state per step, which is a bad fit for search‑heavy reasoning where you want to keep multiple candidate worlds in play, revise earlier assumptions and converge under tight constraints. Writing a Python solver or calling tools “works,” but that’s a different capability than solving the constraint problem natively.

Given how much recent work is about scaling up chain‑of‑thought and longer contexts, I think this raises some uncomfortable questions for transformer‑centric reasoning: 1. If a model can’t handle a large, clean constraint‑satisfaction benchmark without external tools, how far can language‑only reasoning really be pushed? 2. Are we mostly rewarding longer verbalizations of search, instead of building architectures that actually perform search internally? 3. Do we need a different reasoning substrate (e.g., richer latent/continuous reasoning spaces with stronger internal memory) for these tasks, or can transformers realistically get there with enough scaffolding?

Edit: I’ve put the blog link and paper/benchmark details in the comments so it doesn’t clutter the post body.

submitted by /u/THEGAM3CHANG3R
[link] [comments]

Liked Liked