Warps, Memory Hierarchy, and Why Bandwidth Beats FLOPS : How GPUs Actually Work, Part 1
A working mental model of GPU hardware for ML engineers who use these chips daily but have never traced what happens below the CUDA API Generating a single token from a 70 billion parameter model on an H100 requires reading roughly 140 GB of weights from memory and performing about 140 billion arithmetic operations on them. That works out to one operation per byte loaded. Memory bandwidth, not compute throughput, determines how fast that token comes out. By the […]