LagTuner: Adaptive Staleness Orchestration for Parameter-Server AI Training
This paper introduces a server-driven orchestration layer for parameter-server training that adaptively bounds iteration skew among workers at runtime, using recent pushtimestamp telemetry to minimize straggler waiting while preserving convergence in distributed AI systems. The mechanism selects per-iteration, per-worker staleness allowances within a configurable band, turning gradient exchange into a feedbackcontrolled service that balances throughput and consistency under both homogeneous and heterogeneous GPU clusters. A formal analysis establishes convergence guarantees comparable to bounded-staleness methods via an O( √ T) regret bound, aligning systems control with algorithmic stability for large-scale training. A reference implementation in MXNet integrates worker/server procedures and a synchronization controller that simulates nearterm iteration timelines, granting extra steps to the current fastest worker only when it minimizes projected wait time. Empirically, on CIFAR-10/100 with AlexNet and ResNet variants across multiGPU, multi-node deployments, the approach accelerates time-toaccuracy versus bulk-synchronous and fixed-staleness baselines, while matching the agility of asynchronous execution without its instability risks. The results position adaptive staleness control as a practical distributed-systems primitive—coordinating parameter exchange through runtime telemetry to sustain high iteration throughput with robust convergence in production AI training pipelines.