Hybrid by Design: Inside the Mamba-MoE Engine of Nemotron 3

digitado ⋅ 1 de February de 2026

Inside the Mamba-MoE Engine of Nemotron 3

TL;DR

The Models: The family includes Nano, Super, and Ultra.
The Architecture: A Hybrid Mamba-Transformer Mixture-of-Experts (MoE) design that replaces most attention layers with Mamba-2 layers for high throughput.

Key Innovations:

LatentMoE: A new expert routing mechanism in Super/Ultra that projects tokens into a smaller latent space to improve accuracy-per-byte.
MTP (Multi-Token Prediction): Enables faster generation via native speculative decoding.
NVFP4: Native 4-bit floating-point training for the larger models.
Capabilities: Supports 1M token context windows and granular inference-time reasoning budget control.

Paper-explained Series 5

Nemotron 3, a new family of open models introduces radical architectural shifts, including a specialized Mixture-of-Experts (MoE) design, native NVFP4 training, and a massive 1-million-token context window.

1. The Core Architecture: Hybrid Mamba-Transformer MoE

The defining feature of the Nemotron 3 family is its Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture.

Breaking the Attention Bottleneck

Standard Transformer models rely on self-attention layers, which require a Key-Value (KV) cache that grows linearly during generation. This growth creates a memory bottleneck that hampers inference throughput, especially for long-context reasoning.

To solve this, Nemotron 3 predominantly interleaves MoE layers with Mamba-2 layers.

Mamba-2 Layers: These layers process sequences using a constant state during generation, avoiding the massive memory footprint of a growing cache.
Sparse Attention: The models retain only a “select few” self-attention layers to handle high-fidelity, all-to-all information routing where absolutely necessary.

In Jet-Nemotron, they said that they compare Gated-Delta and Mamba and Gated-Delta is better, but maybe Mamba pairs well with MoE.

The Result

This design delivers best-in-class throughput. For example, the Nemotron-3-Nano-30B-A3B (30B total parameters, ~3B active) achieves 3.3x higher throughput compared to the similarly sized Qwen3–30B-A3B model.

2. LatentMoE: Compressing the Router (Super & Ultra)

While Nano uses a standard hybrid MoE design, the larger Super and Ultra models introduce a novel architecture called LatentMoE.

The Problem with Standard MoE

In large-scale deployments, MoE layers face distinct bottlenecks depending on the workload:

Latency-focused: Even though an MoE model only uses a few “active” experts per token, the GPU still has to fetch those specific expert weights from its main memory (VRAM) into the compute cores. These weight matrices are massive. Their size is determined by the model’s hidden dimension d and the expert’s internal size m. Because the batch size is small, the GPU spends more time waiting for these massive weights to arrive from memory than it spends actually doing the math.
Throughput-focused: In a large-scale MoE, experts are often distributed across different GPUs or chips. For every layer, the model must check which expert is needed for which token and “dispatch” that token to the correct GPU. This creates a massive traffic jam of data moving between GPUs. If the “lanes” between GPUs (interconnect bandwidth) get clogged, the powerful compute cores sit idle waiting for data to arrive.

The LatentMoE Solution

LatentMoE addresses these issues by compressing the routing mechanism. Instead of performing routing and computation in the full model hidden dimension d, the model:

Projects the token embedding into a smaller latent dimension l.
Routes and Computes entirely within this compressed latent space.
Projects back to the original hidden dimension.

This compression reduces parameter loads and communication payloads by a factor of roughly 4x. NVIDIA reinvests these savings by scaling up the total number of experts N and the active experts per token K by that same factor. The result is improved accuracy per byte without sacrificing inference throughput or latency.

During training MoE also face The “Expert Imbalance” Problem (Router Collapse). During training, the “Router” (the part of the network that decides which expert gets which token) might discover that one or two experts are slightly better than the others early on. The router starts sending all tokens to just those few “favored” experts. Consequently, only those experts get trained and improve, while the others are starved of data and remain “dumb.” To fix this, researchers typically have to add complex “load balancing” loss functions (auxiliary losses) that punish the model if it doesn’t distribute tokens evenly. Tuning these “balancers” is difficult — if you force it too hard, the model ignores the actual data; if you force it too little, the experts collapse.

3. Multi-Token Prediction (MTP)

To further accelerate generation, the Super and Ultra models incorporate Multi-Token Prediction (MTP) layers.

Rather than predicting only the single next token, the model is trained to predict multiple future tokens simultaneously. This serves two critical functions:

Richer Training Signal: It forces the model to plan several steps ahead, improving reasoning capabilities.
Native Speculative Decoding: The auxiliary predictions serve naturally as “draft tokens”. In ablation studies, the MTP module achieved a 97% acceptance rate on the first two predicted tokens, enabling substantial speedups without requiring a separate draft model.

Read more about Speculative Decoding here

4. NVFP4 Training: Pushing Hardware Limits

Nemotron 3 pushes training efficiency to the limit by utilizing NVFP4 (NVIDIA 4-bit Floating Point) for the Super and Ultra models.

Unlike previous works that simulated low-precision training, Nemotron 3 uses native NVFP4 GEMMs for forward propagation, gradient calculation, and weight updates. To maintain stability, the team developed a specific mixed-precision recipe:

Quantized: Weight, activation, and gradient tensors are quantized to NVFP4.
High Precision: Sensitive layers — specifically Mamba output projections (which are prone to flushing to zero), QKV projections, and Attention projections — are kept in higher precision (BF16 or MXFP8).

This approach resulted in a training loss difference of <1% compared to standard BF16 training, with comparable downstream task accuracy.

5. 1M Context & Agentic Capabilities

Extreme Context Length

Nemotron 3 supports a context length of up to 1 million tokens, enabling the processing of large codebases and extensive documents. Notably, because the Mamba layers provide implicit positional information, the attention layers do not require Rotary Position Embeddings (RoPE). This eliminates the out-of-distribution issues often seen when extending Transformer context windows.

Multi-Environment RL

Post-training involves Multi-environment Reinforcement Learning (RL). Instead of a staged approach (e.g., learning coding, then math), Nemotron 3 is trained on diverse environments simultaneously. This method was found to be more stable and less prone to “reward hacking” than staged training.

Reward hacking occurs when an AI model discovers a “loophole” or unintended strategy to maximize its reward score without actually achieving the desired goal or behaving correctly. This happens because the reward function is often an imperfect proxy for the true objective, leading the model to exploit flaws in how success is measured rather than learning the actual task.

Granular “Thinking” Control

Similar to other recent reasoning models, Nemotron 3 allows for inference-time reasoning budget control. Users can set a specific token budget for the model’s “thinking trace.” When the model reaches this limit, a </think> token is appended, forcing the model to conclude its reasoning and generate a response.

Link to original paper: https://arxiv.org/abs/2512.20856

A massive congratulations as always to the research team at NVIDIA — specifically the leadership team including Andrew Tao, Bita Darvish Rouhani, Boris Ginsburg, Bryan Catanzaro, Carlo del Mundo, Eileen Long, Eric Chung, Jane Polak Scowcroft, Jan Kautz, Jian Zhang, Joey Conway, Jonathan Cohen, Kari Briski, Mohammad Shoeybi, Mostofa Patwary, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pavlo Molchanov, Ran El-Yaniv, Ran Zilberstein, Yonatan Geifman, and Yejin Choi , alongside the extensive teams across Data, Architecture, Pretraining, and Infrastructure.

Until next time folks…
El Psy Congroo

Hybrid by Design: Inside the Mamba-MoE Engine of Nemotron 3 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked