NVIDIA Gave Away a 550B AI Model. A Chip Company Doesn’t Do that by Accident.

digitado ⋅ 6 de June de 2026

Nemotron 3 Ultra is the most capable open model a US lab has released, and you can download the whole thing: weights, training data, recipes. That raises an obvious question. Why is the company that sells the chips also giving away the software that runs on them? The answer explains a lot about how the AI race actually works.

On June 4, NVIDIA released Nemotron 3 Ultra, a 550-billion-parameter language model, and did something that still surprises people the first time they hear it: it gave the whole thing away. Not an API you rent. The actual model weights, the training data, and the recipes used to build it, all published openly under a license that lets you use it commercially, fine-tune it, and run it yourself.

That prompts two reasonable questions, and this piece answers both. First, what actually is this thing, and is it any good? Second, and more interesting, why would NVIDIA, a company that makes its money selling chips, spend enormous resources building a frontier-class model and then hand it out for free? The first question is a tour of some genuinely clever engineering. The second tells you something real about the strategy underneath the entire AI boom.

What Nemotron 3 Ultra actually is

Start with the basics, then the parts that make it interesting.

Nemotron 3 Ultra is the largest and most capable model in NVIDIA’s Nemotron 3 family, which also includes a small Nano model and a mid-sized Super model of around 100 billion parameters. Ultra is built specifically for agentic work, the long-running, multi-step tasks where a model has to reason, use tools, and keep going across many turns, rather than just answer a single question. That focus shapes everything about its design.

The headline number, 550 billion parameters, comes with an important asterisk, and the asterisk is the first clever bit. Ultra is a Mixture-of-Experts model, which means that although it contains 550 billion parameters in total, only about 55 billion of them are active for any given token. The model routes each piece of input through a small subset of specialized “expert” networks rather than firing all 550 billion every time. The sparsity ratio is roughly ten to one, so about 90% of the model sits dormant on any single pass. The payoff is large: you get the quality that comes from a huge model, at a fraction of the compute cost a dense 550-billion-parameter model would demand, which would need something like ten times the compute per call. For agent systems where many calls happen constantly, that efficiency is the difference between practical and unaffordable.

There is more under the hood worth knowing in plain terms. The architecture is a hybrid, mixing Mamba layers, an efficient alternative to the standard Transformer attention that handles long contexts cheaply, with attention layers and the expert routing. It uses something called multi-token prediction, where the model is trained to guess several upcoming tokens at once instead of one at a time, which is part of how it hits over 300 tokens per second in output speed. And it was trained using NVFP4, NVIDIA’s own four-bit number format native to its newest Blackwell chips, which shrinks the memory footprint substantially while keeping accuracy essentially intact. The throughput result NVIDIA reports is the real selling point: up to roughly six times the inference speed of comparable open models at similar accuracy. It also carries a one-million-token context window and, notably, posted the highest non-hallucination score in its comparison set, meaning it makes things up less than its peers.

In short: this is a big, fast, efficient, agent-focused open model, and the engineering is genuinely strong. Which makes the next part more interesting, not less.

It is the best open model from a US lab, and it still trails China

Here is the honest framing that the launch-day enthusiasm tends to soften.

By the most-cited independent benchmark, the Artificial Analysis Intelligence Index, Nemotron 3 Ultra scores about 48, and that is the highest score any open-weight model from a US lab has achieved. It sits well clear of the previous American open contenders: Google’s Gemma 4 at around 39, NVIDIA’s own mid-sized Super at 36, and the open model from OpenAI at around 33. So as a statement about American open models specifically, this is a real milestone. It is the new leader.

But widen the frame and the picture is more sobering. China’s best open model, Kimi K2.6 from Moonshot AI, scores about 54 on the same index, comfortably ahead of Ultra, and DeepSeek’s frontier open model also outranks it. So the most capable open model America just produced still trails the best open models coming out of China. And against the closed proprietary flagships, from Anthropic, Google, and OpenAI, which cluster around 57, Ultra is a few points behind those too.

This is the genuinely important context. There is a real race in open-weight AI, and it is not one the US is currently winning. China has been shipping strong open models at a fast clip, and Nemotron 3 Ultra, impressive as it is, closes the American gap to the domestic competition without closing the gap to the global open frontier. The efficiency story is where Ultra genuinely leads, it is built to be faster and cheaper to run than those Chinese models, which matters a lot for real deployment, but on raw intelligence it is chasing, not leading. Anyone telling you America just retook the open-model lead is skipping the most relevant comparison.

So why does a chip company give away a frontier model

This is the question that confuses people, and the answer is the most clarifying thing in this whole story. NVIDIA is not an AI-model company in the way OpenAI or Anthropic are. NVIDIA is a chip company. It makes its money, an enormous amount of it, selling the GPUs that AI runs on. Once you hold that fact in mind, giving away Nemotron stops looking strange and starts looking obvious.

The logic is straightforward. NVIDIA’s business grows when the world runs more AI, because running AI requires its chips. Anything that makes AI more useful, more widespread, and more compute-hungry is good for NVIDIA, regardless of who builds the models. So NVIDIA has a direct incentive to push the entire field forward, including by releasing excellent open models that anyone can use, because every company that picks up Nemotron and builds on it becomes, sooner or later, a buyer of the hardware to run it. The model is free. The chips it runs best on are not. Nemotron is, in effect, a loss-leader for silicon.

Look at the specifics and the strategy is unmistakable. Ultra was trained in NVFP4, a number format native to NVIDIA’s own Blackwell chips, and it runs fastest on NVIDIA hardware. The better and more popular the open model, the more reason developers have to buy the GPUs it was tuned for. Giving away the model is a way of selling the platform. It is the same playbook as giving away the razor to sell the blades, except the razor here happens to be a state-of-the-art AI model.

There is a second, geopolitical layer to it as well. With China shipping strong open models, there is a real contest over whose open models the world’s developers build on, because the ecosystem that forms around the leading open models shapes standards, tooling, and mindshare. By releasing a top US open model, weights, data, and recipes all in the open, NVIDIA is planting a flag in that contest on the American side. That serves its commercial interest and the broader strategic interest of US open-weight AI staying competitive at the same time.

So the free model is not charity and it is not a puzzle. It is a chip company using an excellent open model to sell more chips and to keep its hardware at the center of how AI gets built. Once you see that, every “why would they do this” question about NVIDIA and open models answers itself.

Can you actually run it

A natural follow-up, especially after all the talk of efficiency: can you run this yourself? The honest answer is mostly no, not on your own machine, and it is worth being clear about why.

The efficiency gains are real, but they are relative. NVFP4 quantization cuts the memory needed by something like half to three-quarters compared to the older format, which is a big saving. But 55 billion active parameters, even at four-bit precision, is still a lot of model. Running Ultra yourself means datacenter-grade GPUs, the kind of multi-GPU server setup that most individuals and many companies do not have sitting around. One of NVIDIA’s own efficiency wins with this model is that its reduced footprint lets it fit on a single eight-GPU server node rather than needing two, which tells you the scale we are talking about. This is not a laptop model.

For almost everyone, then, the practical way to use Nemotron 3 Ultra is the same way you would use a closed model: through an API. NVIDIA hosts it, and so do third-party inference providers, where you pay per token and someone else owns the hardware. The self-hosted path makes sense only for organizations that already have the GPU infrastructure, or that have a specific reason, like data that legally cannot leave their own servers, to run it in-house. The fact that it is open and free to download does not mean it is free or easy to run. Those are different things, and the gap between them is a rack of expensive chips, which, not coincidentally, NVIDIA would be happy to sell you.

What to take from this

A few things worth carrying away.

Nemotron 3 Ultra is a genuinely strong piece of engineering: a 550-billion-parameter Mixture-of-Experts model that activates only a tenth of itself per token, built for agents, fast, efficient, and fully open down to its training data. As a US open model it is the new leader, which is a real achievement. As a global open model it still trails China’s best, which is the context that matters and the part the hype tends to mute.

And the reason it exists in open form is the cleanest illustration you will find of NVIDIA’s position in the AI economy. A company that sells chips wins when the world runs more AI, so it gives away an excellent model, tuned for its own hardware, to make AI more capable and more widespread, and to keep its silicon at the center of all of it. The model is the gift. The chips are the business. Understanding that one relationship explains not just Nemotron, but a great deal of why the AI race looks the way it does.

If you have tried Nemotron 3 Ultra through one of the hosted APIs, drop a comment on how its agentic performance and speed compare to what you have been using. The most useful question to ask about any new open model: not just how smart it is, but how much it costs to actually run.

Resources

NVIDIA Nemotron 3 Ultra technical blog (architecture and throughput details): https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/
NVIDIA Nemotron 3 Ultra technical report (full architecture and training): https://research.nvidia.com/labs/nemotron/Nemotron-3-Ultra/
Artificial Analysis evaluation of Nemotron 3 Ultra (independent benchmark scores): https://artificialanalysis.ai/articles/nvidia-nemotron-3-ultra-released
The model weights on Hugging Face (Base, post-trained, and NVFP4 checkpoints): https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
Background on Mixture-of-Experts and why sparsity reduces inference cost: https://huggingface.co/blog/moe

NVIDIA Gave Away a 550B AI Model. A Chip Company Doesn’t Do that by Accident. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked