[R] Controlled LLM Training on Spectral Sphere
| |
TL;DR: The paper introduces Spectral Sphere Optimizer, which takes steepest descent under spectral norm (Muon) and forces the weights & updates onto a spectral sphere. Paper: https://www.arxiv.org/pdf/2601.08393 Repo: https://github.com/Unakar/Spectral-Sphere-Optimizer Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ( muP) provides a theoretical safeguard for width-invariant theta(1) activation control, whereas emerging optimizers like Muon are only “half-aligned” with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully muP-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations. Algorithm: Evals: submitted by /u/StartledWatermelon |