A Formal Framework for Evaluating Reasoning Integrity in Language Models

digitado ⋅ 26 de March de 2026

Traditional evaluation of language models prioritizes Ñnal-answer accuracy, offering limited insight into the reasoning processes that produce those outputs. Thispaper introduces a formal framework for evaluating reasoning integrity by modelinginference as a trajectory of belief states under uncertainty. We deÑne externallyobservable belief states that capture hypotheses, uncertainty distributions, and con-straints at each reasoning step, enabling analysis without reliance on internal modelrepresentations. Building on this formulation, we propose a divergence functional that quantiÑessustained disagreement between reasoning trajectories, together with a complexityregularization term that penalizes excessive or redundant reasoning. These compo-nents are combined into a uniÑed scoring function that balances consistency andparsimony. To operationalize the framework, we introduce a multi-stage evalua-tion protocol that constrains intermediate reasoning, injects minimal adversarialperturbations, and measures both divergence and repair cost. We establish theoretical properties of the proposed metrics, including bound-edness, invariance under semantic-preserving transformations, and stability undercontrolled perturbations. Analytical examples illustrate how the framework distin-guishes robust reasoning processes from brittle or superÑcial ones that maintaincorrectness without internal consistency. By shifting evaluation from outcomes tothe dynamics of reasoning, this framework provides a principled basis for assessingreliability and stability in modern language models.

Like 0

Liked Liked