[P] Implemented TurboQuant in Python

digitado ⋅ 29 de March de 2026

Spent ~2 days implementing this paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Most quantization stuff I’ve worked with usually falls into one of these:

This paper basically says: what if we just… don’t do either?

The main idea is weirdly simple:

No training. No dataset-specific tuning. Same quantizer works everywhere.

There’s also a nice fix for inner products:

normal MSE quantization biases dot products (pretty badly at low bits)

so they add a 1-bit JL-style correction on the residual -> makes it unbiased

Why this is actually useful:

KV cache in transformers you can’t calibrate because tokens stream in -> this works online
vector DBs / embeddings compress each vector independently, no preprocessing step

What surprised me:

My implementation notes:

works pretty cleanly in numpy
rotation is expensive (O(d³))
didn’t implement fractional bits (paper does 2.5 / 3.5-bit with channel splitting)

Like 0

Liked Liked