LagTuner: Adaptive Staleness Orchestration for Parameter-Server AI Training
This paper introduces a server-driven orchestration layer for parameter-server training that adaptively bounds iteration skew among workers at runtime, using recent pushtimestamp telemetry to minimize straggler waiting while preserving convergence in distributed AI systems. The mechanism selects per-iteration, per-worker staleness allowances within a configurable band, turning gradient exchange into a feedbackcontrolled service that balances throughput and consistency under both homogeneous and heterogeneous GPU clusters. A formal analysis establishes convergence guarantees comparable to bounded-staleness methods via an O( √ […]