A 2-hour blackboard session watched at 1.25x speed

If you are like me and spend most of your time thinking about what happens inside the model,and not much on the hardware side of things this video will definitely fascinate you. Dwarkesh and Reiner Pope spent two hours at a blackboard going through the actual hardware economics of training and running LLMs and i got to learn a lot things i previously didn’tknow obviously.

One of my biggest takeaways for me was the 6ND formula for calculating FLOPS (be familiar with FLOPS please. Here a post that helped me to learn more about FLOPS https://todatabeyond.substack.com/p/a-gentle-introduction-to-flops-and) I knew the number, I did not completely understand where it came from. The forward pass is 2ND. The backward pass is 4ND because you compute gradients with respect to both input matrices. That is it. 2 + 4 = 6. They talk about this in depth i just summarized it for this post along with other things.

They also showed that if you set pretraining, RL, and inference costs equal to each other (the heuristic optimum, since they trade off), and account for the fact that decode runs at roughly ⅕ the MFU of prefill, you get D_pretrain ≈ D_inference. A frontier model serving 50M tokens per second globally for two months accumulates ~200T inference tokens so it should also be pretrained on ~200T tokens. Chinchilla optimal for a 100B active parameter model is 2T. That means frontier models are roughly 100× over Chinchilla optimal, almost entirely because of inference and RL economics, not because pretraining is wasteful in isolation.

Finally you get to see the API pricing analysis accompanied with some good graphs. Gemini charges ~50% more above 200K tokens because that is the crossover where KV cache fetch time overtakes compute time and cost starts rising linearly with context. Below it you are compute-bound and cost per token is flat. From that one pricing datapoint, Reiner backs out that KV cache is roughly 1.7 KB per token on Gemini at that scale. Output tokens are 3–5× more expensive than input tokens because during decode you load all the weights just to produce one token, while during prefill you amortize that fetch across the whole sequence in parallel. The bottleneck for long context is not compute it is memory bandwidth, and there is no clean hardware fix on the horizon. Sparse attention helps but not infinitely.

The last thing Dwarkesh and Reiner debate is whether 1M context would be prohibitively expensive at scale DeepSeekV4 has since accomplished this. Would love to see them reconvene.

Here is the video: https://www.youtube.com/watch?v=xmkSf5IS-zw

And there are also flashcards you can use to follow along and obviously i couldn’t compress all 2hrs here.

Also if you are out there and have GPUs that need to go brrr, reach out. And big shout out to Reiner Pope for making this accessible.

submitted by /u/Public_Expression_92
[link] [comments]

Liked Liked