Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

arXiv:2602.17565v1 Announce Type: cross
Abstract: Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher’s own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $xi$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $lambda > 0$ at which the teacher ridge risk $R(lambda)$ is nonstationary (i.e., $R'(lambda) neq 0$). We obtain a closed-form expression for the optimal mixing weight $xi^star(lambda)$ for any value of $lambda$ and show that it obeys the sign rule: $operatorname{sign}(xi^star(lambda))=-operatorname{sign}(R'(lambda))$. In particular, $xi^star(lambda)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $xi^star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

Liked Liked