Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning
arXiv:2603.01514v1 Announce Type: cross Abstract: We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and show that a simple first-order optimization algorithm can converge to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection […]