Stress-Testing the Reasoning Competence of LLMs with Proofs Under Minimal Formalism
We introduce PROOFGRID, a challenging benchmark suite for evaluating the reasoning competence of language models through machine-checkable proofs rather than final answers alone. PROOFGRID spans 15 tasks organized around proof writing, proof checking, proof masking, and proof gap-filling. The tasks are expressed in deliberately minimal formal notation, most notably NDL, a stripped-down language for natural deduction that fits in a short prompt and supports precise, auditable checking. As a result, correctness judgments are mechanical, reproducible, and fine-grained rather than dependent on human or LLM judges. PROOFGRID discriminates model capabilities with high resolution and covers a carefully calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no model currently solves, while avoiding reliance on domain knowledge, problems that can be easily outsourced to solvers, or long-context artifacts. The paper develops a detailed comparative framework for reasoning benchmarks and uses it to survey and systematize a broad and previously fragmented literature. This analysis clarifies how PROOFGRID differs from existing benchmarks not only in difficulty, but in representational choices, verification guarantees, reasoning depth, and vulnerability to shortcut-based success. A key methodological contribution of the paper is an instrumented proof-checking pipeline that tolerates minor surface-level deviations while explicitly identifying the first substantive failure in a stretch of reasoning. This significantly sharpens measurement resolution, separating high-level proof planning from low-level execution noise, and reveals that current frontier models often possess surprisingly strong proof-strategy capabilities even when their formal proof execution is brittle. Using this pipeline, we conduct an extensive evaluation and analysis of a broad set of open and proprietary models released from late 2024 until now. Our results demonstrate strikingly rapid advances as well as sharp remaining limitations. Frontier systems now perform strongly on several foundational tasks, yet difficult PROOFGRID tasks (especially those requiring global combinatorial reasoning or low-level proof synthesis) remain far from solved. Beyond task-level accuracy, we identify widespread epistemic instability, e.g., models often generate proofs containing local logical errors that they can nevertheless recognize and reject when those same inferences are presented in isolation. We formalize this phenomenon via an Epistemic Stability Index (ESI), a quantitative measure of cross-context coherence between model outputs for different but related tasks (such as generated proofs and corresponding entailment judgments). Finally, we complement raw accuracies with 2PL Item Response Theory (IRT) analyses, Wright maps, and a new normalized measure of task discrimination based on Fisher information and latent model abilities estimated from their responses. Taken together, these results position PROOFGRID not merely as a reasoning benchmark, but as a diagnostic framework for studying the emerging structure, limits, and internal coherence of reasoning in contemporary language models.