TAI #199: Gemma 4 Brings a Credible US Open-Weight Contender Back to the Table
Author(s): Towards AI Editorial Team Originally published on Towards AI. What happened this week in AI by Louie This week, Google DeepMind released Gemma 4, and I think this is the most consequential US open-weight release in quite a while. China has been leading the open-weight conversation for months, especially with ever-larger Mixture-of-Experts families and increasingly agentic models. Gemma 4 does not wipe that scoreboard clean. What it does do is bring a strong Apache 2.0 family from a U.S. lab back into the part of the market that actually wants to run models itself, on local hardware or within tighter enterprise boundaries. That said, the part of the market that insists on self-hosting is shrinking. Anthropic reported today that its run-rate revenue has surpassed $30 billion, up from about $9 billion at the end of 2025 and roughly $1 billion in December 2024. That is approximately 30x in 16 months. We are seeing far more clients comfortable with using LLM APIs or enterprise-tier agents and chatbots than we did six months ago. The security and privacy policies of the major AI labs have also become substantially clearer, which has helped lower the barrier for risk-averse organizations. Google is launching four variants of Gemma 4: the small E2B and E4B edge models, the 31B dense flagship, and a 26B A4B MoE aimed at higher-throughput reasoning. Gemma has now passed 400 million downloads and more than 100,000 community variants. This generation is built on Gemini 3 research and, for the first time, ships under the Apache 2.0 license. On Google’s benchmarks, the two larger models are serious. The 31B posts 1,452 on Arena AI text, 84.3% on GPQA Diamond, 89.2% on AIME 2026, 80.0% on LiveCodeBench v6, 76.9% on MMMU Pro, and 86.4% on Tau2-bench retail (versus 6.6% for Gemma 3 27B on the same test). The 26B A4B is close behind: 1,441 Arena AI text, 82.3% GPQA Diamond, 88.3% AIME 2026, 77.1% LiveCodeBench. Google also reports 19.5% and 8.7% on Humanity’s Last Exam without tools for the 31B and 26B, respectively, rising to 26.5% and 17.2% with search. These are properly competitive open-model results. The architecture is conservative, and that is part of the appeal. Hybrid sliding-window plus global attention, Proportional RoPE for long context, 512-token local window on the edge models, and 1,024 on the larger ones. The 31B is 30.7B effective parameters; the 26B A4B is 25.2B total, but only 3.8B active per token (8 of 128 experts plus one shared). The capability jump looks to be driven more by reinforcement learning, training recipes, and data than by architectural reinvention. On the engineering side, Gemma 4 supports configurable thinking mode, native system-role prompting, native function calling with dedicated tool-call tokens, and text-and-image input across the family, plus video and audio on the smaller models. The prompting docs are unusually concrete, with a clearly defined tool lifecycle, direct guidance on stripping thought traces from multi-turn history, and a recommendation to summarize reasoning back into context for long-running agents rather than replaying raw tokens. Google also explicitly warns developers to validate function names and arguments before execution. The small models target phones, Raspberry Pi, and Jetson Nano; the 26B and 31B fit on consumer GPUs and workstations. Both larger models can run on a single H100. Important caveat: despite only 3.8B active parameters, the 26B MoE still requires loading the full model into memory. MoE still doesn’t give you a free lunch on deployment. Ecosystem support is thorough: day-one availability across Hugging Face, Ollama, Kaggle, LM Studio, vLLM, and llama.cpp, MLX, NVIDIA NIM, Vertex AI, and Google AI Edge. On Android, Gemma 4 serves as the base for Gemini Nano 4, offering up to 4x faster performance and 60% lower battery use. The independent picture from Artificial Analysis is nuanced. On its Intelligence Index, the 31B scores 39, trailing Qwen 3.5 27B at 42 by only 3 points while using roughly 2.5x fewer output tokens to complete the benchmark suite (39M vs. 98M). The 31B’s main weakness versus Qwen is agentic performance, not general reasoning. On non-agentic evaluations, it is right there: SciCode 43 vs. 40, TerminalBench Hard 36 vs. 33, GPQA Diamond 86 vs. 86, IFBench 76 vs. 76, Humanity’s Last Exam 23 vs. 22. The 26B A4B is a less flattering story, trailing Qwen 3.5 35B A3B more clearly on agentic work (Agentic Index 32 vs. 44). Short version: the 31B is the star, the 26B A4B is useful but not magic, and the small models punch well above their weight. Why should you care? Gemma 4 matters because it changes the shape of the open-weight market, not because it takes the crown. The last year of Chinese-lab dominance has produced brilliant models, but many are trillion-parameter MoE systems that are awkward to self-host, expensive to run cleanly, and, for some Western enterprises, uncomfortable from a compliance standpoint. Gemma 4 gives those organizations a credible alternative: US-origin, Apache 2.0, practical to deploy on a single GPU. For regulated sectors, air-gapped environments, edge devices, and teams that need control over data retention and customization, it is an actual option, not a toy. At the same time, Anthropic’s $30 billion run-rate is strong evidence that the broader market is moving toward hosted APIs and enterprise-tier products rather than self-hosting. I think that narrows the role of open weights, but it also sharpens it. Open models no longer need to serve everyone. They need to own the use cases where locality, inspectability, and tuning flexibility matter more than the capability frontier. It is also worth noting that the AI engineering space has continued to drift away from fine-tuning. Most production teams rely entirely on prompting, retrieval, and context engineering, and the frontier closed models are generally not available for fine-tuning at the weight level anyway. The bar for fine-tuning a smaller open model to outperform the out-of-the-box capabilities of a frontier model with strong tools and good context is extremely high. But Gemma 4 matters here precisely because it keeps […]