[D]KL Divergence is not a distance metric. It’s a measure of inefficiency. (Derivations + Variance Reduction)

I recently decided to stop treating KL Divergence as a “black box” distance metric and actually derive it from first principles to understand why it behaves the way it does in optimization.

I found that the standard intuition (“it measures distance between distributions”) often hides the actual geometry of what’s happening during training. I wrote a deep dive article about this, but I wanted to share the two biggest “Aha!!!!!!” moments here directly.

The optimization geometry (forward vs. reverse): The asymmetry of KL is not just a mathematical quirk. it dictates whether your model spreads out or collapses.

– Forward KL (D_KL​(P∣∣Q)): This is Zero-Avoiding. The expectation is over the true data P. If P(x) >0 and your model Q(x) -> 0, the penalty explodes.

Result: Your model is forced to stretch and cover every mode of the data (Mean-Seeking). This is why MLE works for classification but can lead to blurry images in generation.

Reverse KL (D_KL​(Q∣∣P)): This is Zero-Forcing. The expectation is over your model Q. If P(x)≈0, your model must be 0. But if your model ignores a mode of P entirely? Zero penalty.

Result: Your model latches onto the single easiest mode and ignores the rest (Mode-Seeking). This is the core reason behind “Mode Collapse” in GANs/Variational Inference.

The Variance Trap & The Fix: If you try to estimate KL via naive Monte Carlo sampling, you’ll often get massive variance.

D_KL​≈1/N ​∑ log P(x)/Q(x)​

The issue is the ratio P/Q. In the tails where Q underestimates P, this ratio explodes, causing gradient spikes that destabilize training.

The Fix (Control Variates): It turns out there is a “natural” control variate hiding in the math. Since E​[Q/P]=1, the term (Q/P−1) has an expected value of 0. Subtracting this term from your estimator cancels out the first-order Taylor expansion of the noise. It stabilizes the gradients without introducing bias.

If you want to see the full derivation and concepts in more detial. Here is the link – https://medium.com/@nomadic_seeker/kl-divergence-from-first-principle-building-intuition-from-maths-3320a7090e37

I would love to get feedback on it.

submitted by /u/Illustrious-Cat-4792
[link] [comments]

Liked Liked