Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
arXiv:2604.15464v1 Announce Type: new Abstract: Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google’s Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures–particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance […]