Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading

digitado ⋅ 29 de January de 2026

arXiv:2601.19910v1 Announce Type: new
Abstract: KV cache offloading enables long-context LLM inference by storing caches in CPU DRAM, but PCIe bandwidth limitations create severe bottlenecks. In this paper, we develops an analytical framework that derives $kappa_{text{crit}}$, the critical cached-to-prefill token ratio where execution becomes memory-bound and show typical workloads exceed this threshold by orders of magnitude. Empirical characterization reveals 99% of latency spent on transfers and serving offloaded requests results in GPU’s consuming only 28% of their rated TDP, motivating our proposed optimizations for hardware interconnects, model architectures, and scheduling algorithms.

Like 0

Liked Liked