Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

arXiv:2509.11208v2 Announce Type: replace
Abstract: Transformers used for evidence-grounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (“hallucinations” under a Bernoulli predicate).
We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings. This objective can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation–realization gap via a Quantified Martingale Violation (QMV) bound that predicts $mathcal{O}(log n)$ growth in permutation dispersion under harmonic positional sensitivity.
We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define emph{Bits-to-Trust} (B2T), emph{Risk-of-Hallucination} (RoH), and the emph{Information Sufficiency Ratio} (ISR), together with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures.
On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR $= 1$ gate attains 0.0–0.7% hallucination with 20.6–27.9% abstention (95% confidence intervals).

Liked Liked