[D] A mathematical proof from an anonymous Korean forum: The essence of Attention is fundamentally a d^2 problem, not n^2. (PDF included)
Hello, r/MachineLearning . I am just a regular user from a Korean AI community (“The Singularity Gallery”). I recently came across an anonymous post with a paper attached. I felt that the mathematical proof inside was too important to be buried in a local forum and not go viral globally, so I used Gemini to help me write this English post to share it with you all.
The author claims they do not work in the LLM industry, but they dropped a paper titled: “The d^2 Pullback Theorem: Why Attention is a d^2-Dimensional Problem”.
They argue that the field has been fundamentally misunderstanding the intrinsic geometry of Attention. Here is the core of their mathematical proof:
The d^2 Pullback Theorem (The Core Proof):
The author mathematically proves that if you combine the Forward pass (n X n) and the Backward gradient (n X n), the actual optimization landscape the parameter explores is strictly d^2-dimensional. The n X n bottleneck is merely an illusion caused by the softmax normalization choice.
- Softmax destroys the Euclidean Matching structure:
Previous O(n) linear attention models failed because removing exp() (softmax) destroyed the contrast (matching). Softmax creates the “matching” but artificially inflates the rank to n, causing the O(n^2) curse.
- O(nd^3) Squared Attention without the instability:
Because the true optimization geometry is d^2, we can swap softmax with a degree-2 polynomial kernel (x^2) and still explore the exact same optimization landscape. The author introduces CSQ (Centered Shifted-Quadratic) Attention with soft penalties. This retains the Euclidean matching property, stabilizes the training, and drops both training AND inference complexity to O(nd^3).
The author wrote: “I’m not in the LLM industry, so I have nowhere to share this. I’m just posting it here hoping it reaches the researchers who can build better architectures.”
I strongly believe this math needs to be verified by the experts here. Could this actually be the theoretical foundation for replacing standard Transformers?
Original PDF:https://drive.google.com/file/d/1IhcjxiiHfRH4_1QIxc7QFxZL3_Jb5dOI/view?usp=sharing
Original Korean Forum Post:https://gall.dcinside.com/mgallery/board/view/?id=thesingularity&no=1016197
submitted by /u/Ok-Preparation-3042
[link] [comments]