Spectral Condition for $mu$P under Width-Depth Scaling

digitado ⋅ 3 de March de 2026

arXiv:2603.00541v1 Announce Type: cross
Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $mu$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $mu$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $mu$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $mu$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $mu$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.

Like 0

Liked Liked