Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
arXiv:2510.11789v2 Announce Type: replace Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-frac{2beta}{2beta+1}}$, where $M$ is the sample size and $beta$ is the H”older smoothness of the activation function. Importantly, this rate is independent of the embedding dimension $d$, the number of tokens $N$, and the rank $r$ of the weight matrix, provided […]