[P] Semantic caching for LLMs is way harder than it looks – here’s what we learned

digitado ⋅ 13 de January de 2026

Work at Bifrost and wanted to share how we built semantic caching into the gateway.

Architecture:

Dual-layer: exact hash matching + vector similarity search
Use text-embedding-3-small for request embeddings
Weaviate for vector storage (sub-millisecond retrieval)
Configurable similarity threshold per use case

Key implementation decisions:

Conversation-aware bypass – Skip caching when conversation history exceeds threshold. Long contexts drift topics and cause false positives.
Model/provider isolation – Separate cache namespaces per model and provider. GPT-4 responses shouldn’t serve from Claude cache.
Per-request overrides – Support custom TTL and threshold via headers. Some queries need strict matching, others benefit from loose thresholds.
Streaming support – Cache complete streamed responses with proper chunk ordering. Trickier than it sounds.

Performance constraints: Had to keep overhead under 10µs. Embedding generation happens async after serving the first request, doesn’t block response.

The trickiest part was handling edge cases – empty messages, system prompt changes, cache invalidation timing. Those details matter more than the happy path.

Code is open source if anyone wants to dig into the implementation: https://github.com/maximhq/bifrost

Happy to answer technical questions about the approach.

submitted by /u/dinkinflika0
[link] [comments]

Like 0

Liked Liked