[P] Adaptive load balancing in Go for LLM traffic – harder than expected
I am an open source contributor, working on load balancing for Bifrost (LLM gateway) and ran into some interesting challenges with Go implementation.
Standard weighted round-robin works fine for static loads, but LLM providers behave weirdly. OpenAI might be fast at 9am, slow at 2pm. Azure rate limits kick in unexpectedly. One region degrades while others stay healthy.
Built adaptive routing that adjusts weights based on live metrics – latency, error rates, throughput. Used EWMAs (exponentially weighted moving averages) to smooth out spikes without overreacting to noise.
The Go part that was tricky: tracking per-provider metrics without locks becoming a bottleneck at high RPS. Ended up using atomic operations for counters and a separate goroutine that periodically reads metrics and recalculates weights. Keeps the hot path lock-free.
Also had to handle provider health scoring. Not just “up or down” but scoring based on recent performance. A provider recovering from issues should gradually earn traffic back, not get slammed immediately.
Connection pooling matters more than expected. Go’s http.Transport reuses connections well, but tuning MaxIdleConnsPerHost made a noticeable difference under sustained load.
Running this at 5K RPS with sub-microsecond overhead now. The concurrency primitives in Go made this way easier than Python would’ve been.
Anyone else built adaptive routing in Go? What patterns worked for you?
submitted by /u/dinkinflika0
[link] [comments]