[P] What we learned building automatic failover for LLM gateways

Working on Bifrost and one thing we kept hearing from users was “OpenAI went down and our entire app stopped working.” Same thing happens with Anthropic, Azure, whoever.

So we built automatic failover. The gateway tracks health for each provider – success rates, response times, error patterns. When a provider starts failing, requests automatically route to backup providers within milliseconds. Your app doesn’t even know it happened.

The tricky part was the circuit breaker pattern. If a provider is having issues, you don’t want to keep hammering it with requests. We put it in a “broken” state, route everything else to backups, then periodically test if it’s recovered before sending full traffic again.

Also added weighted load balancing across multiple API keys from the same provider. Helps avoid rate limits and distributes load better.

Been running this in production for a while now and it’s pretty solid. Had OpenAI outages where apps just kept running on Claude automatically.

submitted by /u/dinkinflika0
[link] [comments]

Liked Liked