Sparse Projection Attention: A Computationally Efficient Framework for Long Sequence Modeling

The self-attention mechanism has revolutionized sequence modeling but suffers from quadratic computational complexity with respect to sequence length, limiting its applicability to long sequences. We propose Sparse Projection Attention (SPA), a novel attention variant that leverages learnable sparse projections to reduce the effective dimensionality of queries and keys while maintaining expressive power. Our method is grounded in the Johnson-Lindenstrauss lemma and provides theoretical guarantees on distance preservation. We introduce a comprehensive mathematical framework including error bounds, convergence analysis, and gradient dynamics. Experimental results on language modeling, machine translation, and long-range sequence classification demonstrate that SPA achieves up to 8 × computational speedup while maintaining competitive performance compared to standard attention and other sparse variants. The proposed approach offers an effective trade-off between computational efficiency and model expressivity for long-sequence tasks, making transformers more accessible for resource-constrained environments and real-time applications.

Liked Liked