[R] Concept Influence: Training Data Attribution via Interpretability (Same performance and 20× faster than influence functions)
TL;DR: We attribute model behavior to interpretable vectors (probes, SAE features) instead of individual test examples. This makes TDA more semantically meaningful and 20× faster than influence functions.
The Problem:
Standard influence functions have two issues:
– Condition on single test examples → biased toward lexical overlap, not semantic similarity
– Computationally expensive at LLM scale
Our Approach:
Instead of attributing to ∇θL(ztest), we attribute to ∇θf_v^ℓ(xtest) where v is a semantic direction (probe/SAE feature).
This shifts the question from “which data matches this output?” to “which data causes this behavior?”
Key Results:
– On emergent misalignment: Concept Influence outperforms influence functions across all datasets (Figure 2)
– On OASST1: Using only 5% of data maintains full capability while reducing harm 3× (Figure 5)
– Simple probe methods are 20× faster and work surprisingly well (we prove they’re first-order approximations)
– SAE clustering reveals semantic features driving behaviors (2000× higher influence on relevant concepts, Figure 4)
Paper: https://arxiv.org/abs/2602.14869
Blog: https://www.far.ai/news/concept-data-attribution-02-2026
Interested in feedback on applications beyond safety and comparisons with other TDA methods. Happy to answer questions!
submitted by /u/KellinPelrine
[link] [comments]