HiMA: Not All Parameters Need Every Step

Standard deep learning training updates all model parameters at every optimization step with identical learning rates and update frequencies, regardless of each parameter’s contribution to the current loss landscape. We observe that this uniform treatment wastes significant compute and memory on parameters that are temporarily converged or minimally contributing. We propose HiMA (Hierarchical Memory-Aware Training), a training strategy that partitions parameters into three dynamically managed tiers — Hot, Warm, and Cold — based on gradient magnitude. Hot parameters receive full-precision updates every step, Warm parameters are updated at reduced frequency with scaled learning rates, and Cold parameters are frozen entirely with their optimizer states evicted from memory. In experiments on GPT-2 Small (124M parameters) trained on WikiText-103, HiMA enables a tunable trade-off between resource efficiency and model performance, saving 31–56% of parameter update compute with up to 1.16× wall-clock speedup on GPT-2 Medium (345M). On the memory front, HiMA reduces Adam optimizer state memory by up to 72.8% (from 944 MB to 257 MB) and GPU peak memory by up to 32.4% (from 7,945 MB to 5,369 MB) by evicting the first and second moment tensors of Cold parameters entirely and excluding them from the backward pass. Critically, at 50% parameter freezing, HiMA maintains stable training (val loss 5.49) where static random freezing catastrophically degrades (val loss 7.90)—a gap of 2.41 points, confirming that gradient-based selection is essential. Ablation experiments reveal that the Warm tier’s continuous optimizer momentum serves as a critical bridge between Hot and Cold: removing it degrades loss by 0.60 points, accounting for 75% of the total quality gap. Unlike prior offloading approaches that treat memory tiers as passive storage, HiMA uses the memory hierarchy as an active learning signal: a parameter’s position in the hierarchy directly determines its learning intensity, turning hardware constraints into an adaptive training strategy.

Liked Liked