[R] Predicting Edge Importance in GPT-2’s Induction Circuit from Weights Alone (ρ=0.623, 125x speedup)

TL;DR: Two structural properties of virtual weight matrices ,spectral concentration and downstream path weight, predict which edges in GPT-2 small’s induction circuit are causally important, without any forward passes, ablations, or training data. Spearman ρ=0.623 with path patching ground truth (p < 10⁻⁷), at 125x speedup. Weight magnitude achieves ρ=0.070. Gradient attribution achieves ρ=−0.262. Two other properties I tested failed to transfer to the residual stream architecture. I report what worked and what didn’t.

– The question –

Can you predict which edges in a transformer circuit matter before you do any causal interventions?

Current methods for measuring edge importance — path patching, activation patching, ablation studies — all require running the model. You perturb something, observe the effect, repeat. This scales linearly with the number of edges per intervention, and gets expensive fast for large models and dense circuits.

I’ve been developing a scoring method (the “Cheap Anchor” score) that predicts edge importance from weight structure alone. It started in a very different domain (algebraic number theory — I’ll spare you the details, but the short version is that I was studying which local constraints determine global factorization outcomes in non-unique factorization rings, and the structural properties that predicted importance there turned out to generalize). The method worked well on feedforward networks (ρ=0.836–0.931 across scales from 80 to 3,120 edges). This post is about what happened when I tested it on a real transformer.

– Setup –

Model: GPT-2 small (124M parameters, 12 layers, 12 heads).

Target circuit: The induction circuit (Olsson et al., 2022). I chose this because it’s the cleanest, best-characterized circuit in GPT-2. If the method can’t work here, it can’t work anywhere. If it works here, it might or might not generalize to messier circuits — that’s an open question I’m not claiming to have answered.

Edge definition: Following Elhage et al. (2021), I define edges as component-to-component connections mediated by virtual weight matrices. For attention head A writing to the residual stream and attention head B reading from it, the virtual weight is W_O[A] @ W_Q[B] (QK path) or W_O[A] @ W_V[B] (OV path). These are d_head × d_head matrices (64×64 for GPT-2 small). I identified the induction subgraph (26 attention heads + 10 MLPs = 640 edges across QK, KV, and OV paths) and scored all edges within it.

Ground truth: Path patching (Goldowsky-Dill et al., 2023) on 50 repeated-pattern sequences of length 64. 63 of 640 edges had nonzero importance.

Hardware: Everything ran on a stock Windows PC with an RTX 4060 Ti. No cluster access.

– The method: what I scored and why –

I tested four structural properties of virtual weight matrices. The core idea is that an edge matters if it’s selective (transmits specific information rather than noise) and well-positioned (its target has downstream influence on the output). Two of these properties worked. Two didn’t. I’ll describe all four.

Property 1 — Discrimination (WORKED, ρ=0.467): Does this edge transmit information through a few specific channels, or spread it uniformly? Operationalized as 1 minus the normalized spectral entropy of the virtual weight matrix’s singular values. If the singular value spectrum is sharply peaked (one or two dominant singular values), the edge is highly discriminating — information flows through a narrow bottleneck, which typically means it’s doing something specific. If the spectrum is flat, the edge is passing everything through equally, which usually means it’s not doing much of anything interesting.

Property 2 — Cascade Depth (WORKED, ρ=0.593): How much downstream influence does this edge’s target have? Computed as the exponentially-decayed sum of path weights from the target component to the model output, through all downstream virtual weight matrices. Early-layer heads feeding into late-layer heads with strong output projections get high cascade depth. Late-layer heads with weak connections to unembedding get low cascade depth. This is purely a graph-theoretic property of the component DAG — no activations involved.

Property 3 — Non-Redundancy (FAILED, ρ=−0.105): Is this edge’s information unique, or could another edge to the same target substitute for it? Measured via pairwise principal angle comparison between each edge’s top singular vectors and its most similar neighbor. Worked in feedforward networks. Failed completely in transformers. I’ll explain why below.

Property 4 — Signal Flow (FAILED, ρ=0.070): Does this edge have enough throughput to influence its target? Measured as the log-normalized Frobenius norm. Also failed. Essentially redundant with discrimination in this context — the spectral structure already captures throughput.

The composite score (Cheap Anchor V2): Discrimination × Cascade Depth. That’s it. Two numbers multiplied together, both derived from SVD of weight matrices and graph structure.

– Results –

Method Spearman ρ p-value Time (s) Speedup vs. path patching

Cheap Anchor V2 (Disc × Cascade) 0.623 5.01 × 10⁻⁸ 2.0 125×

Cascade Depth only 0.593 3.04 × 10⁻⁷ — —

Disc × NR × Cascade (3-factor) 0.598 2.30 × 10⁻⁷ — —

Discrimination only 0.467 1.15 × 10⁻⁴ — —

Weight Magnitude (Frobenius) 0.070 0.586 ~0 —

Gradient Attribution −0.262 0.038 0.5 —

Random −0.083 0.517 — —

The composition outperforms either property individually (0.623 > 0.593, 0.467), confirming that you need both selectivity and downstream reach. Neither alone is sufficient.

Weight magnitude is essentially random on this task (ρ=0.070, p=0.586). This isn’t surprising — the Frobenius norm of a virtual weight matrix tells you how much total information could flow, but not whether the information that does flow is functionally relevant.

Gradient attribution is negatively correlated (ρ=−0.262). This is likely a softmax saturation effect: when an induction head is working correctly, it places ~99% attention probability on a single token. The softmax is saturated, so the local gradient is near zero, and gradient-based methods conclude the head doesn’t matter. Cheap Anchor doesn’t look at gradients — it looks at the spectral structure of the weight matrix, which reflects what the head can do, not what the gradient says about the current operating point.

Why non-redundancy failed (and what this tells us)

This is the part I think is most informative for the interpretability community.

Non-redundancy measures whether an edge’s information channel is unique or could be substituted by another edge to the same target. In feedforward networks, this is well-defined: each edge is a direct point-to-point connection, and you can compare the subspaces spanned by different edges’ weight vectors.

In transformers, every component reads from and writes to a shared residual stream. The concept of “redundant information channels” that applies to dedicated connections doesn’t straightforwardly apply to broadcast communication through a shared medium. When I measured subspace similarity between virtual weight matrices for different edges targeting the same component, the metric was essentially noise — it couldn’t distinguish functionally redundant from functionally independent edges.

This suggests that the structural properties underlying edge importance have a two-tier structure: architecture-general properties (discrimination and cascade depth work on both direct connections and residual-stream-mediated connections) and architecture-specific properties (non-redundancy needs to be re-derived for each communication topology). I think non-redundancy can be recovered for transformers, likely by measuring similarity along sparse feature directions rather than raw principal angles, but that’s future work — not something I’m claiming to have solved.

– Limitations (please read these) –

I want to be explicit about what this result does and does not show.

What it shows: Two structural properties of virtual weight matrices, computable from weights alone in 2 seconds, predict 39% of the variance (ρ²≈0.39) in causal edge importance within a known circuit.

What it does NOT show:

This is not circuit discovery. I identified the induction heads first (from attention patterns), then scored edges within that known subgraph. The stronger claim — that high-scoring edges under Cheap Anchor cluster around known circuits when you score all edges in the model — has not been tested yet. That experiment is next.

Induction heads are the easiest case. They’re clean, well-structured, and have been studied extensively. Messier circuits (factual recall, reasoning, refusal) involve distributed computation where edge-level analysis may be less informative. Success here is necessary but not sufficient.

The correlation is moderate, not spectacular. ρ=0.623 reliably identifies the most and least important edges, but the middle of the ranking is noisy. This is useful for prioritizing which edges to investigate or for coarse pruning, but it’s not a replacement for path patching when you need precise importance scores.

Virtual weight matrices are a lossy abstraction. They ignore nonlinearities (attention softmax, LayerNorm, MLP activations) between components. The structural analysis captures what the linear pathway could transmit but not what the full nonlinear computation does transmit. The 39% captured variance likely represents the linear-algebraic component of edge importance, with the remaining 61% depending on activation-dependent factors.

Single model, single circuit. Replication on other models and circuits is needed before making general claims.

What I think this means

The fact that spectral concentration of virtual weight matrices predicts causal importance at all is, I think, a nontrivial observation. It suggests that the functional role of transformer components is partially encoded in their weight structure in a way that’s accessible without running the model. The weight matrices aren’t just arbitrary parameterizations that happen to produce the right input-output mapping — they carry structural signatures of their function.

The 125x speedup matters because it changes what’s computationally feasible. Path patching every edge in GPT-2 small’s induction circuit took ~250 seconds. Cheap Anchor took 2 seconds. For larger models and denser circuits, this gap widens. Even if the method only serves as a pre-filter — score all edges cheaply, then path-patch only the top 5% — that’s a meaningful reduction in compute for circuit analysis.

– Next steps –

Global percentile test: Score every edge in GPT-2 small (~21,750 edges) and check whether the 63 ground-truth induction edges cluster in the top percentiles. This is the circuit discovery test.

Scale to GPT-2 medium/large: The speedup advantage grows with model size. Demonstrating maintained correlation at larger scales would establish practical utility.

Test on other circuits: Indirect object identification, factual recall. Messier circuits are the real test.

Reproducing this

Full paper and reproducible code available, shoot me an email at [Davensg@gmail.com](mailto:Davensg@gmail.com). I am working on getting the Github repo up and running as we speak!

All experiments run on a single consumer GPU (RTX 4060 Ti, 8GB VRAM). No API access, no cluster compute. If you have TransformerLens installed, you can reproduce the core result in under 5 minutes.

I’m an independent researcher (day job: paramedic). I don’t have institutional affiliations or advisors in ML. If you see methodological problems with this work, I genuinely want to hear about them — that’s why I’m posting here rather than just putting the paper on arXiv and hoping for the best. The method either works or it doesn’t, and I’d rather find out from people who know transformers better than I do.

submitted by /u/IfUDontLikeBigRedFU
[link] [comments]

Liked Liked