YuriiFormer: A Suite of Nesterov-Accelerated Transformers
arXiv:2601.23236v2 Announce Type: replace-cross Abstract: We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie–Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization […]