Knowing Before Speaking: In-Computation Metacognition Precedes Verbal Confidence in Large Language Models

digitado ⋅ 1 de April de 2026

We propose the Knowledge Landscape hypothesis: the forward pass of a large language model (LLM) encodes whether it knows the answer before producing any output token. Well-learned knowledge corresponds to deep convergence valleys in the activation landscape; unlearned queries traverse flat plains where signals disperse. These geometric properties manifest as measurable signals—token-level entropy and layer-wise hidden-state variance— that precede and causally influence the model’s output uncertainty. On TriviaQA with Qwen2.5-7B and Mistral-7B, token entropy strongly discriminates known from unknown questions (Mann-Whitney p < 10⁻⁷, rank-biserial r > 0.5 across both architectures). Hidden-state variance localises a metacognitive locus at layers 9 and 20–27 (peak p < 10⁻⁴, r = 0.46). Activation patching with monotone interpolation provides causal confirmation: entropy decreases strictly as the known hidden state is progressively substituted, with Spearman rank correlation of negative one (permutation p < 0.001). A single-pass abstention system built on these signals achieves an area under the ROC curve of 0.804 and a 5.6 percentage-point accuracy gain over the unaided baseline, without any fine-tuning

Like 0

Liked Liked