[D] Production gaps in context-window compression for AI agent memory

digitado ⋅ 1 de April de 2026

‘ve been working on AI memory infrastructure and recently spent a few weeks reading through the source code of an open-source context-window compression system — the kind that replaces retrieval entirely by having background LLM agents compress conversation history into structured observations, then prefix the entire block into every turn.

The approach just hit 90+% on LongMemEval, which is impressive. But after tracing the full lifecycle: observer prompts, compression thresholds, reflector behavior, cross-conversation scoping, etc.; there are some significant production gaps that the benchmark doesn’t surface:

Compression is irreversible. Observations overwrite originals. Importance is decided at write time, not query time. If a detail gets pruned, there’s no fallback to source material.
The benchmark likely never triggers the destructive compression phase. LongMemEval conversation volumes probably stay below the reflector threshold, so the score reflects the high-fidelity extraction stage only; not the lossy compression that kicks in at scale.
Cross-conversation memory is either zero or everything. Default config = total amnesia between conversations. The alternative loads ALL prior conversation observations on every turn of every new conversation. No selective retrieval.
Multimodal and tool-call content gets destroyed. Tool results are capped at 2k tokens for counting, images get a one-pass text description and originals are abandoned. At higher compression levels, entire agentic workflows collapse to single-line summaries.
Economics depend entirely on prompt caching. 30k token prefix every turn is only viable at 75-90% cache discounts. Async use cases where cache TTL expires between turns pay full price.

I wrote up a longer analysis here

Curious if others have run into similar tradeoffs with compression-based approaches, or if there are mitigations I’m missing. Also interested in whether people think LongMemEval and LoCoMo are sufficient for evaluating production memory systems; they don’t test contradiction detection, multimodal recall, deletion compliance, or entity disambiguation, which seems like a significant gap.

submitted by /u/Ok_Row9465
[link] [comments]

Like 0

Liked Liked