Learning CUDA From First Principles

Author(s): Ayoub Nainia Originally published on Towards AI. Being a PhD student working on AI and NLP, I’ve spent quite some time using PyTorch and other high-level frameworks that abstract away the GPU. But recent discussions about whether I should learn CUDA pushed me to step back and revisit the basics: where all of this started, and why it changed. This isn’t an official “learning guide” or a course roadmap. It’s going back to first principles and sharing the insights I’m learning as I go. Photo by Anya Chernykh on Unsplash TL;DR We stopped getting “free speedups” when CPU clock scaling hit power/heat limits in the early 2000s. CPUs tried to preserve the sequential world with complexity (caches, prediction, reordering). GPUs accepted the new world and leaned into parallelism. CUDA is explicit by design: you don’t move your whole program to the GPU. You carve out the data-parallel parts and pay attention to memory movement. GPU performance is mostly about data reuse and access patterns, not just “more threads.” The big constraints that keep coming up: latency hiding (warps/occupancy), memory traffic, divergence, and coalescing. The day “free performance” ended For a long time, computers got faster in a very simple way. Each new CPU ran at a higher clock speed, so the same program just finished sooner. We didn’t need to change our code. We didn’t even need to think about it. We just had to wait for next year’s CPU to show up. That stopped in the early 2000s. Not because innovation slowed down, but because physics got in the way. Faster clocks meant too much heat and too much power. So chip makers changed direction: instead of one faster brain, they gave us several slower ones on the same chip. Here’s the catch: Most software was written as a single sequence of steps. That kind of program could only use one core. And one core wasn’t getting much faster anymore. So the old promise of “your software will run faster next year” quietly disappeared. From that point on, performance became something we design for. If your work can be split, it speeds up. If it can’t, it stalls. A lot of today’s complexity traces back to this moment. Nothing went wrong, but the rules just changed. CPUs protected the illusion. GPUs embraced reality Once the rules changed, hardware didn’t respond with a single solution. It split. One path tried to preserve the old world, while the other accepted that it was gone. How GPU Acceleration CPUs: preserve the sequential world CPUs mostly took the first path. They became multicore, but each core stayed large and complex. The goal was clear: keep single-threaded programs fast. To do that, CPUs use prediction, reordering, and large caches to hide waiting. When memory is slow, the hardware works hard behind the scenes to make one stream of instructions keep moving. This works extremely well for many workloads, but it doesn’t scale indefinitely. All that complexity doesn’t increase raw computation. It mostly exists to protect the illusion that programs are still sequential. GPUs: don’t hide waiting , outrun it GPUs made a different choice. They stopped trying to hide waiting. Instead, GPUs run so many threads that when some are stalled on memory, others are ready to run. Waiting still happens, but it just stopped being the bottleneck. That’s the core idea. GPUs aren’t faster because they’re smarter. They’re faster because they always have more work lined up. CUDA: acknowledging what GPUs actually are At first, using GPUs for general computation was awkward. You had to pretend your problem was a graphics problem, but CUDA changed that by acknowledging reality: GPUs weren’t graphics accelerators anymore. They were parallel computers. CUDA didn’t only add an API. It aligned the programming model with the hardware model: lots of threads, predictable structure, explicit control over what runs where, explicit control over data movement. Which leads to the most practical realization I’ve had so far: A CUDA program isn’t “running on the GPU” It starts on the CPU. It stays there for the parts that don’t parallelize well.And it jumps to the GPU only when there’s real data-parallel work to do. That jump happens through a kernel launch: create a large number of lightweight threads, do the work, return control back to the CPU. You don’t move the whole program to the GPU. You carve out the parts where parallelism is actually worth it. The GPU doesn’t replace theCPU, it’s a tool that you invoke deliberately. Parallel models are not competing Revisiting parallelism made something click: parallel programming didn’t “fail” because we lacked tools. In fact, it’s hard because every approach is a tradeoff: MPI/message passing scales to huge clusters, but nothing is shared. Every piece of data must be sent, received, synchronized deliberately. It basically scales at the cost of programmer effort. OpenMP/shared memory feels natural because threads see the same memory, but coordination and cache coherence become bottlenecks as thread counts grow. It’s convenient at the cost of scalability. GPU programming sits in between: inside the GPU, thousands of threads cooperate with low overhead. Between CPU and GPU, data movement is explicit and must be managed carefully. Same theme across all of them: Parallel models are choosing where to pay the cost : in hardware, in software, or in human effort. Once you look through that lens, the question changes. It’s no longer “which model should I learn?”. It becomes “what constraints does my problem actually have?” That’s the frame I’m keeping. GPUs exploit Data Parallelism A big unlock for me was that many problems are already parallel before we touch any hardware. Images are a good example. Each pixel captures a separate physical event at the same moment. One pixel doesn’t wait for another to be processed. We just usually pretend it does when we write sequential code. Dataflow combinations for matrix multiplication. Matrix multiplication is the cleanest illustration: each output cell is a dot product, each dot product is independent. A 1000×1000 […]

Liked Liked