[P] Semantic caching for LLMs is way harder than it looks – here’s what we learned

Work at Bifrost and wanted to share how we built semantic caching into the gateway.

Architecture:

  • Dual-layer: exact hash matching + vector similarity search
  • Use text-embedding-3-small for request embeddings
  • Weaviate for vector storage (sub-millisecond retrieval)
  • Configurable similarity threshold per use case

Key implementation decisions:

  1. Conversation-aware bypass – Skip caching when conversation history exceeds threshold. Long contexts drift topics and cause false positives.
  2. Model/provider isolation – Separate cache namespaces per model and provider. GPT-4 responses shouldn’t serve from Claude cache.
  3. Per-request overrides – Support custom TTL and threshold via headers. Some queries need strict matching, others benefit from loose thresholds.
  4. Streaming support – Cache complete streamed responses with proper chunk ordering. Trickier than it sounds.

Performance constraints: Had to keep overhead under 10µs. Embedding generation happens async after serving the first request, doesn’t block response.

The trickiest part was handling edge cases – empty messages, system prompt changes, cache invalidation timing. Those details matter more than the happy path.

Code is open source if anyone wants to dig into the implementation: https://github.com/maximhq/bifrost

Happy to answer technical questions about the approach.

submitted by /u/dinkinflika0
[link] [comments]

Liked Liked