Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

Author(s): Chew Loong Nian – AI ENGINEER Originally published on Towards AI. Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good A 26-billion-parameter model has no business fitting in 15GB of memory and spitting out 193 tokens a second on a single consumer GPU. That is laptop-and-gaming-rig territory, not a datacenter. Yet that is exactly what Google’s new Gemma 4 QAT checkpoints do, and after digging into how they pulled it off, the part that stuck with me is not the speed. It is that the 4-bit version barely loses anything compared to the full-precision original. By every law of quantization I thought I understood, it should be noticeably dumber. It isn’t. After the lead, the article breaks down why Gemma 4 QAT + Unsloth’s GGUF conversion is unusually effective: it quantizes during training so the model learns to be robust to 4-bit rounding, explains the typical PTQ quality loss, and describes how Unsloth fixes a subtle scale-mismatch bug that otherwise wipes out most of the benefit when converting to llama.cpp formats. It then provides concrete performance and memory numbers for different Gemma 4 variants (especially the 26B-A4B mixture-of-experts model), compares naive vs dynamic conversion accuracy, and summarizes the practical steps to run the model with llama.cpp, plus other deployment options (API server, Ollama/LM Studio, Unsloth Studio, vLLM/SGLang, MLX, and browser ONNX). Finally, it offers guidance on which model to choose based on available hardware, notes the remaining caveat that 4-bit is still 4-bit, and concludes that the usual quality-vs-speed tradeoff is collapsing—making the 26B-A4B feel like a near big-model experience on consumer GPUs. Read the full blog for free on Medium. Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI

Liked Liked