A Universal Load Balancing Principle and Its Application to Large Language Model Serving
arXiv:2601.17855v1 Announce Type: cross Abstract: Load balancing-the allocation of work across parallel resources to reduce delay, energy and cost-is a pervasive challenge in science and engineering, from large-scale simulation and data processing to cloud and manufacturing operations. Motivated by the emerging bottleneck in large language model (LLM) serving, we study a particularly stringent regime of load balancing that arises in barrier-synchronized, stateful systems: work cannot be freely migrated and progress is gated by the slowest participant at each […]