Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning
arXiv:2602.11201v1 Announce Type: new Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model’s decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model’s confidence in […]