KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement – Works with any model that uses DynamicCache [P]

digitado ⋅ 12 de April de 2026

Been working on this for a bit and figured it was ready to share. KIV (K-Indexed V Materialization) is a middleware layer that replaces the standard KV cache in HuggingFace transformers with a tiered retrieval system. The short version: it keeps recent tokens exact in VRAM, moves old K/V to system RAM, and uses K vectors as a search index to pull back only the ~256 most relevant V entries per decode step.

Results on a 4070 12GB with Gemma 4 E2B (4-bit):

1M tokens, 12MB KIV VRAM overhead, ~6.5GB total GPU usage
4.1 tok/s at 1M context (8-10 tok/s on GPU time), 12.9 tok/s at 4K
70/70 needle-in-haystack tests passed across 4K-32K
Perfect phonebook lookup (unique names) at 58K tokens
Prefill at 1M takes about 4.3 minutes (one-time cost)
Decode is near-constant regardless of context length

The core finding that makes this work: K vectors are smooth and structured, which makes them great search indices. V vectors are high-entropy and chaotic, so don’t try to compress them, just retrieve them on demand. Use K to decide which V entries deserve to exist in VRAM at any given step.

No model weights are modified. No retraining or distillation. It hooks into the HuggingFace cache interface and registers a custom attention function. The model has no idea it’s talking to a tiered memory system. Works with any model that uses DynamicCache. Tested on Gemma 4, Qwen2.5, TinyLlama, and Phi-3.5 across MQA/GQA/MHA.

There are real limitations and I’m upfront about them in the repo. Bounded prefill loses some info for dense similar-looking data. Collision disambiguation doesn’t work but that’s the 4-bit 2B model struggling, not the cache. Two-hop reasoning fails for the same reason. CPU RAM scales linearly (5.8GB at 1M tokens).

Still actively optimizing decode speed, especially at longer contexts. The current bottleneck is CPU-to-GPU transfer for retrieved tokens, not the model itself. Plenty of room to improve here.

GitHub: github.com/Babyhamsta/KIV (can be installed as a local pip package, no official pip package yet)

Happy to answer questions about the architecture or results. Would love to see what happens on bigger models with more VRAM if anyone wants to try it.

submitted by /u/ThyGreatOof
[link] [comments]

Like 0

Liked Liked