[D]NVIDIA Rubin proves that Inference is now a System Problem, not a Chip Problem.
Everyone is focusing on the FLOPs, but looking at the Rubin specs released at CES, it’s clear the bottleneck has completely shifted.
The Specs:
• 1.6 TB/s scale-out bandwidth per GPU (ConnectX-9).
• 72 GPUs operating as a single NVLink domain.
• HBM Capacity is only up 1.5x, while Bandwidth is up 2.8x and Compute is up 5x.
The Thesis:
We have officially hit the point where the “Chip” is no longer the limiting factor. The limiting factor is feeding the chip.
Jensen explicitly said: “The future is orchestrating multiple great models at every step of the reasoning chain.”
If you look at the HBM-to-Compute ratio, it’s clear we can’t just “load bigger models” statically. We have to use that massive 1.6 TB/s bandwidth to stream and swap experts dynamically.
We are moving from “Static Inference” (loading weights and waiting) to “System Orchestration” (managing state across 72 GPUs in real-time).
If your software stack isn’t built for orchestration, a Rubin Pod is just a very expensive space heater.
submitted by /u/pmv143
[link] [comments]