[R] Teacher-Free Self-Distillation: Fixing the Softmax “Infinite Gap” with Euclidean alignment
Hi everyone,
I recently wrote a blog post describing a fix to a fundamental instability in standard Deep Learning optimization: the “Infinite Gap” problem inherent in the Cross-Entropy loss. I wanted to share the intuition here and get your thoughts.
Geometric Alignment via Teacher-Free Self-Distillation
Standard Softmax with dot-product logits ($z = w cdot x$) is geometrically flawed because the loss function is asymptotic. To drive the loss to exactly 0, the model must push the logit to infinity. Since $z = |w||x|cos(theta)$, the optimizer often takes the “lazy” route of exploding the feature norm $|x|$ (Radial Explosion) rather than perfecting the alignment.
This mechanism contributes significantly to the training loss spikes seen in LLMs and poor Out-of-Distribution (OOD) detection.
I propose a method called Teacher-Free Self-Distillation (TFSD) that relies on a “Geometric Turn”:
- Metric Regime: Replace the dot product with negative squared Euclidean distance ($z = -|x – c|2$). This naturally bounds the logits (max logit is 0 at zero distance), physically preventing the “infinity” problem.
- Self-Distillation: Instead of using a one-hot target (which still forces infinite separation in standard setups), the model acts as its own teacher:
- Take the model’s current predicted distances. Manually set the distance to the True Class to 0 (the “Zero Anchor”).
- Keep the distances to all Negative Classes exactly as predicted.
- Apply Softmax to this constructed target and train via KL Divergence.
For “easy” samples, the target distribution becomes sharp. For “hard” samples (like synonyms in LLMs), the target distribution stays naturally flat. This prevents the model from “tearing” the manifold to force a binary distinction between semantically similar tokens.
It effectively caps the gradients for outliers, which helps prevent the semantic fracturing that occurs during long training runs. It also helps to preserve the “Dark Knowledge” and semantic structure that the model already learned.
Hope you find the method as exciting as I do!
Feedback very welcome!
submitted by /u/4rtemi5
[link] [comments]