MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models
arXiv:2605.28825v1 Announce Type: new
Abstract: Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs — a phenomenon known as emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to emph{understand} model behavior rather than to emph{extract} hidden knowledge. We present textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) textbf{Locate} — using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) textbf{Verify} — employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) textbf{Elicit} — applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7%, outperforming CCS by 6.2% and direct linear probing by 9.1%. Crucially, MechELK successfully identifies latent knowledge in 78.3% of cases where the model’s surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.