Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
arXiv:2603.08739v1 Announce Type: new Abstract: The memory-for-computation paradigm of KV caching is essential for accelerating large language model (LLM) inference service, but limited GPU high-bandwidth memory (HBM) capacity motivates offloading the KV cache to cheaper external storage tiers. While this expands capacity, it introduces the challenge of dynamically managing heterogeneous storage resources to balance cost, throughput, and latency under varying workloads. We formulate this as a multi-objective optimization problem: identifying the Pareto frontier across these metrics within the […]