[D]NVIDIA Rubin proves that Inference is now a System Problem, not a Chip Problem.

Everyone is focusing on the FLOPs, but looking at the Rubin specs released at CES, it’s clear the bottleneck has completely shifted.

The Specs:

• 1.6 TB/s scale-out bandwidth per GPU (ConnectX-9).

• 72 GPUs operating as a single NVLink domain.

• HBM Capacity is only up 1.5x, while Bandwidth is up 2.8x and Compute is up 5x.

The Thesis:

We have officially hit the point where the “Chip” is no longer the limiting factor. The limiting factor is feeding the chip.

Jensen explicitly said: “The future is orchestrating multiple great models at every step of the reasoning chain.”

If you look at the HBM-to-Compute ratio, it’s clear we can’t just “load bigger models” statically. We have to use that massive 1.6 TB/s bandwidth to stream and swap experts dynamically.

We are moving from “Static Inference” (loading weights and waiting) to “System Orchestration” (managing state across 72 GPUs in real-time).

If your software stack isn’t built for orchestration, a Rubin Pod is just a very expensive space heater.

submitted by /u/pmv143
[link] [comments]

Liked Liked