Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant

digitado ⋅ 10 de May de 2026

Author(s): Ravi Yogesh Originally published on Towards AI. 10 experiments, 3 models, one honest verdict: the quality story is real, the speed story needs a disclaimer, and there’s a finding in the entropy data nobody talks about. ⏱ ~14 min read🔬 Deep Dive⚙️ LLM Inference🗜 Quantization🚀 Serving Photo by Logan Voss on Unsplash When Google published TurboQuant at ICLR 2026, the headline was hard to ignore: compress your LLM’s key-value cache to 3 bits, keep quality intact, get up to 6× memory savings. I built a 10-experiment evaluation pipeline, ran it across three models — Gemma-2B base, Gemma-2B-IT, and TinyLlama 1.1B Chat — and measured everything I could: factual accuracy, RAG retrieval quality, multi-task generation fidelity, throughput, memory footprint, layer sensitivity, and something most quantization write-ups skip entirely: what compression does to attention entropy. Figure 0: Full experiment suite — quality, memory, throughput, and mixed-bit results across all three models What Is TurboQuant and Why It’s Different Most KV-cache quantization schemes treat compression as a reconstruction problem: minimize mean squared error on cached vectors. TurboQuant is smarter than that. It targets the thing that actually matters for attention — inner-product preservation. Stage 1 — PolarQuant Each pair of consecutive vector coordinates gets converted to polar form (radius + angle). The radius stays in full float. The angle — which lives in a bounded, roughly uniform range after an optional random rotation — is quantized at low bitwidth. The rotation is the real trick: it spreads outlier energy so that scalar quantization on the angle isn’t wrecked by a few dominant values. Stage 2 — QJL (Quantized Johnson-Lindenstrauss) The reconstruction error from Stage 1 gets projected into a random subspace, and only the sign of each projection is stored — 1 bit per projection. The Johnson-Lindenstrauss lemma says random projections approximately preserve inner products. So this step specifically corrects for dot-product error that distorts attention scores, not just MSE. That distinction is what separates TurboQuant from naive quantizers. Disclaimer: My implementation hooks into the model’s forward pass and compresses cached K and V tensors between decode steps using int8 containers with per-vector float16 scale factors. It is not a packed 3-bit kernel — it does not use fused CUDA or Triton. This distinction matters a lot for speed results and somewhat for memory results. I will be explicit about both throughout. The Experimental Setup Ten experiments, two phases. All text evaluations used a fixed external encoder (sentence-transformers/all-MiniLM-L6-v2) separate from the tested models. Throughput used a fixed 80-step greedy decode benchmark with alternating load order to reduce warm/cold GPU bias. 1. 3-Bit Is the Real Operating Point — and Entropy Explains Why Start with the synthetic bit-depth comparison — PolarQuant alone vs. the full TurboQuant-style path on attention MSE and KL divergence. At 2 bits, TurboQuant performs worse than PolarQuant alone. At 3 and 4 bits, the combination wins decisively Figure 1 (E1): Attention MSE and KL divergence vs. bit depth — PolarQuant vs. TurboQuant-style Figure 2 (E2): Rate-distortion curve — measured MSE vs. anchored 4^-b reference (log scale) The MSE table shows that 2-bit is worse, but it doesn’t explain why. The attention entropy experiment does. Entropy measures how focused or diffuse the softmax distribution is — high entropy means the model is attending broadly, low entropy means it’s locked onto a few positions. The key distinction is between random key distributions (noise-floor conditions) and structured key distributions — vectors with dominant directional clusters, which is closer to what real LLM attention heads actually produce. Structured distributions (closer to real LLM key patterns) show a 65% entropy collapse at 2 bits — the model is forced to hyper-focus on a tiny fraction of context positions At 2-bit compression, structured key distributions experience a 65% entropy collapse — the quantizer is actively reshaping the attention geometry in ways that cut off access to broader context. This is not just elevated MSE. It is the underreported reason why 2-bit compression quietly damages multi-hop reasoning, long-document synthesis, and complex RAG chains. 2. Quality Results Are Stronger Than Expected Factual QA (40 questions, 3 seeds) RAG-Style Context-Grounded QA (12 contexts, 3 seeds) Multi-task semantic similarity (factual, reasoning, coding, summarization) between baseline and TQ-3bit generations: 1.0 across all categories and all three models. Figure 3 (E7): Multi-task semantic similarity radar — TQ-3bit vs. baseline (1.0 across all categories) 3. The Memory Math Has Two Very Different Versions For Gemma-2B (18 layers, MQA with 1 KV head per group, 256-dim heads), the idealized 3-bit target vs. what the int8 prototype actually stores: The 2.69× gap between ideal and prototype is entirely explained by storage format: int8 costs 8 bits/element instead of 3, plus float16 scale metadata per token vector. Figure 4 (E4): Left: ideal targets by bitwidth. Right: prototype storage vs. ideal 3-bit target If someone tells you TurboQuant gives 5× memory savings, ask what storage backend they’re using. An int8 prototype gives ~2×. That’s still useful — it can extend context window or increase batch size on constrained hardware — but it’s a different product story. The ~5× headline requires truly packed sub-byte storage with custom memory layouts. 4. Speed Is Model-Dependent — Here’s the Honest Reason Why Throughput sweep: 80 fixed greedy decode steps, 5 measured trials, 2 alternating rounds, three prompt lengths. Headline (512-token prompt) Full prompt-length sweep (TQ / Baseline ratio) Values above 1.0 mean TQ-3bit is faster. TinyLlama is nearly 10× faster at baseline, making it proportionally more sensitive to cache I/O savings The speed question is not “is TurboQuant fast?” It is “is your model’s bottleneck memory bandwidth or arithmetic?” Faster models like TinyLlama spend proportionally more time on cache I/O — compressing the cache loosens that bottleneck. Compute-heavy models like Gemma-2B are less responsive. Know your bottleneck before optimizing for it. 5. Mixed-Bit Schedules Are an Easy Win Nobody Talks About Layer sensitivity analysis showed that for both Gemma variants, compression sensitivity peaks in layers 7–10 — the middle of the stack. Uniform bit allocation is leaving quality on […]

Like 0

Liked Liked