Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

arXiv:2602.04902v1 Announce Type: new
Abstract: The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t – q_{t-1}$, implementing the symplectic shear $hat{q}_t = q_t + gamma p_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution — by injecting kinematic momentum, we sidestep the topological depth constraint ($L geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A–R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $gamma^* = 4.17 times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

Liked Liked