A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs
arXiv:2601.16979v1 Announce Type: cross
Abstract: Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($lambda_{max}^H$) — the largest eigenvalue of the loss Hessian — determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $textit{critical sharpness}$ ($lambda_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Delta mathbf{theta}$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $textit{relative critical sharpness}$ ($lambda_c^{1to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.