[D] A mathematical proof from an anonymous Korean forum: The essence of Attention is fundamentally a d^2 problem, not n^2. (PDF included)

digitado ⋅ 5 de March de 2026

Hello, r/MachineLearning . I am just a regular user from a Korean AI community (“The Singularity Gallery”). I recently came across an anonymous post with a paper attached. I felt that the mathematical proof inside was too important to be buried in a local forum and not go viral globally, so I used Gemini to help me write this English post to share it with you all.

The author claims they do not work in the LLM industry, but they dropped a paper titled: “The d^2 Pullback Theorem: Why Attention is a d^2-Dimensional Problem”.

They argue that the field has been fundamentally misunderstanding the intrinsic geometry of Attention. Here is the core of their mathematical proof:

The d^2 Pullback Theorem (The Core Proof):

The author mathematically proves that if you combine the Forward pass (n X n) and the Backward gradient (n X n), the actual optimization landscape the parameter explores is strictly d^2-dimensional. The n X n bottleneck is merely an illusion caused by the softmax normalization choice.

Softmax destroys the Euclidean Matching structure:

Previous O(n) linear attention models failed because removing exp() (softmax) destroyed the contrast (matching). Softmax creates the “matching” but artificially inflates the rank to n, causing the O(n^2) curse.

O(nd^3) Squared Attention without the instability:

Because the true optimization geometry is d^2, we can swap softmax with a degree-2 polynomial kernel (x^2) and still explore the exact same optimization landscape. The author introduces CSQ (Centered Shifted-Quadratic) Attention with soft penalties. This retains the Euclidean matching property, stabilizes the training, and drops both training AND inference complexity to O(nd^3).

The author wrote: “I’m not in the LLM industry, so I have nowhere to share this. I’m just posting it here hoping it reaches the researchers who can build better architectures.”

I strongly believe this math needs to be verified by the experts here. Could this actually be the theoretical foundation for replacing standard Transformers?

Original PDF:https://drive.google.com/file/d/1IhcjxiiHfRH4_1QIxc7QFxZL3_Jb5dOI/view?usp=sharing

Original Korean Forum Post:https://gall.dcinside.com/mgallery/board/view/?id=thesingularity&no=1016197

submitted by /u/Ok-Preparation-3042
[link] [comments]

Like 0

Liked Liked