[P] Implemented TurboQuant in Python
Spent ~2 days implementing this paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Repo: github.com/yashkc2025/turboquant
Most quantization stuff I’ve worked with usually falls into one of these:
- you need calibration data (k-means, clipping ranges, etc.)
- or you go naive (uniform quant) and take the quality hit
This paper basically says: what if we just… don’t do either?
The main idea is weirdly simple:
- take your vector
- hit it with a random rotation
- now suddenly the coordinates behave nicely (like ~Gaussian-ish)
- so you can just do optimal 1D quantization per dimension
No training. No dataset-specific tuning. Same quantizer works everywhere.
There’s also a nice fix for inner products:
normal MSE quantization biases dot products (pretty badly at low bits)
so they add a 1-bit JL-style correction on the residual -> makes it unbiased
Why this is actually useful:
- KV cache in transformers you can’t calibrate because tokens stream in -> this works online
- vector DBs / embeddings compress each vector independently, no preprocessing step
What surprised me:
- the rotation step is doing all the magic
- after that, everything reduces to a solved 1D problem
- theory is tight: within ~2.7× of the optimal distortion bound
My implementation notes:
- works pretty cleanly in numpy
- rotation is expensive (O(d³))
- didn’t implement fractional bits (paper does 2.5 / 3.5-bit with channel splitting)
submitted by /u/chhed_wala_kaccha
[link] [comments]
Like
0
Liked
Liked