Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales

arXiv:2603.15678v1 Announce Type: new
Abstract: Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary — the emph{spectral edge} — between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $sigma_k/sigma_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GPT-2 124M under a distribution shift, the spectral edge exhibits a universal three-phase pattern (rise, plateau, collapse), signal rank adjusts with task complexity ($k^* = 2$ at 51M, $k^* = 3$ at 124M), and the directional coupling between spectral geometry and validation loss reverses with window size — a emph{lag flip} reflecting the timescale of trajectory integration. Johnson–Lindenstrauss projection to $d = 10W$ dimensions (e.g., $d = 100$ for $W = 10$) preserves the spectral gap within 5.7%, making the framework applicable to models of arbitrary size. In companion work, the same spectral geometry provides early-warning signals of grokking — predicting generalization 600–1{,}700 steps before it occurs across modular arithmetic, Dyck languages, and the SCAN benchmark.

Liked Liked