[R] Why doubly stochastic matrix idea (using Sinkhorn-Knopp algorithm) only made popular in the DeepSeek’s mHC paper, but not in earlier RNN papers?

After DeepSeek’s mHC paper, the Sinkhorn–Knopp algorithm has attracted a lot of attention because it turns $$mathcal{H}^{mathrm{res}}_{l}$$ at each layer into a doubly stochastic matrix. As a result, the layerwise product remains doubly stochastic, and since the L_2 (spectral) norm of a doubly stochastic matrix is 1, this helps prevent vanishing or exploding gradients.

This makes me wonder why such an apparently straightforward idea wasn’t discussed more during the era of recurrent neural networks, where training dynamics also involve products of many matrices.

submitted by /u/Delicious_Screen_789
[link] [comments]

Liked Liked