[D] Production gaps in context-window compression for AI agent memory
‘ve been working on AI memory infrastructure and recently spent a few weeks reading through the source code of an open-source context-window compression system — the kind that replaces retrieval entirely by having background LLM agents compress conversation history into structured observations, then prefix the entire block into every turn.
The approach just hit 90+% on LongMemEval, which is impressive. But after tracing the full lifecycle: observer prompts, compression thresholds, reflector behavior, cross-conversation scoping, etc.; there are some significant production gaps that the benchmark doesn’t surface:
- Compression is irreversible. Observations overwrite originals. Importance is decided at write time, not query time. If a detail gets pruned, there’s no fallback to source material.
- The benchmark likely never triggers the destructive compression phase. LongMemEval conversation volumes probably stay below the reflector threshold, so the score reflects the high-fidelity extraction stage only; not the lossy compression that kicks in at scale.
- Cross-conversation memory is either zero or everything. Default config = total amnesia between conversations. The alternative loads ALL prior conversation observations on every turn of every new conversation. No selective retrieval.
- Multimodal and tool-call content gets destroyed. Tool results are capped at 2k tokens for counting, images get a one-pass text description and originals are abandoned. At higher compression levels, entire agentic workflows collapse to single-line summaries.
- Economics depend entirely on prompt caching. 30k token prefix every turn is only viable at 75-90% cache discounts. Async use cases where cache TTL expires between turns pay full price.
I wrote up a longer analysis here
Curious if others have run into similar tradeoffs with compression-based approaches, or if there are mitigations I’m missing. Also interested in whether people think LongMemEval and LoCoMo are sufficient for evaluating production memory systems; they don’t test contradiction detection, multimodal recall, deletion compliance, or entity disambiguation, which seems like a significant gap.
submitted by /u/Ok_Row9465
[link] [comments]