[P] Semantic caching for LLMs is way harder than it looks – here’s what we learned
Work at Bifrost and wanted to share how we built semantic caching into the gateway.
Architecture:
- Dual-layer: exact hash matching + vector similarity search
- Use text-embedding-3-small for request embeddings
- Weaviate for vector storage (sub-millisecond retrieval)
- Configurable similarity threshold per use case
Key implementation decisions:
- Conversation-aware bypass – Skip caching when conversation history exceeds threshold. Long contexts drift topics and cause false positives.
- Model/provider isolation – Separate cache namespaces per model and provider. GPT-4 responses shouldn’t serve from Claude cache.
- Per-request overrides – Support custom TTL and threshold via headers. Some queries need strict matching, others benefit from loose thresholds.
- Streaming support – Cache complete streamed responses with proper chunk ordering. Trickier than it sounds.
Performance constraints: Had to keep overhead under 10µs. Embedding generation happens async after serving the first request, doesn’t block response.
The trickiest part was handling edge cases – empty messages, system prompt changes, cache invalidation timing. Those details matter more than the happy path.
Code is open source if anyone wants to dig into the implementation: https://github.com/maximhq/bifrost
Happy to answer technical questions about the approach.
submitted by /u/dinkinflika0
[link] [comments]
Like
0
Liked
Liked