[R] Concept Influence: Training Data Attribution via Interpretability (Same performance and 20× faster than influence functions)

TL;DR: We attribute model behavior to interpretable vectors (probes, SAE features) instead of individual test examples. This makes TDA more semantically meaningful and 20× faster than influence functions.

The Problem:

Standard influence functions have two issues:

– Condition on single test examples → biased toward lexical overlap, not semantic similarity

– Computationally expensive at LLM scale

Our Approach:

Instead of attributing to ∇θL(ztest), we attribute to ∇θf_v^ℓ(xtest) where v is a semantic direction (probe/SAE feature).

This shifts the question from “which data matches this output?” to “which data causes this behavior?”

Key Results:

– On emergent misalignment: Concept Influence outperforms influence functions across all datasets (Figure 2)

– On OASST1: Using only 5% of data maintains full capability while reducing harm 3× (Figure 5)

– Simple probe methods are 20× faster and work surprisingly well (we prove they’re first-order approximations)

– SAE clustering reveals semantic features driving behaviors (2000× higher influence on relevant concepts, Figure 4)

Paper: https://arxiv.org/abs/2602.14869

Blog: https://www.far.ai/news/concept-data-attribution-02-2026

Interested in feedback on applications beyond safety and comparisons with other TDA methods. Happy to answer questions!

submitted by /u/KellinPelrine
[link] [comments]

Liked Liked