Learning CUDA From First Principles

digitado ⋅ 15 de January de 2026

Parallelism, kernels, and why memory matters most

Being a PhD student working on AI and NLP, I’ve spent quite some time using PyTorch and other high-level frameworks that abstract away the GPU.

But recent discussions about whether I should learn CUDA pushed me to step back and revisit the basics: where all of this started, and why it changed.

This isn’t an official “learning guide” or a course roadmap. It’s going back to first principles and sharing the insights I’m learning as I go.

TL;DR

We stopped getting “free speedups” when CPU clock scaling hit power/heat limits in the early 2000s.

CPUs tried to preserve the sequential world with complexity (caches, prediction, reordering). GPUs accepted the new world and leaned into parallelism.

CUDA is explicit by design: you don’t move your whole program to the GPU. You carve out the data-parallel parts and pay attention to memory movement.

GPU performance is mostly about data reuse and access patterns, not just “more threads.”

The big constraints that keep coming up: latency hiding (warps/occupancy), memory traffic, divergence, and coalescing.

The day “free performance” ended

For a long time, computers got faster in a very simple way.

Each new CPU ran at a higher clock speed, so the same program just finished sooner.

We didn’t need to change our code. We didn’t even need to think about it. We just had to wait for next year’s CPU to show up.

That stopped in the early 2000s.

Not because innovation slowed down, but because physics got in the way. Faster clocks meant too much heat and too much power.

So chip makers changed direction: instead of one faster brain, they gave us several slower ones on the same chip.

Here’s the catch:

Most software was written as a single sequence of steps. That kind of program could only use one core. And one core wasn’t getting much faster anymore.

So the old promise of “your software will run faster next year” quietly disappeared.

From that point on, performance became something we design for.

If your work can be split, it speeds up. If it can’t, it stalls.

A lot of today’s complexity traces back to this moment. Nothing went wrong, but the rules just changed.

CPUs protected the illusion. GPUs embraced reality

Once the rules changed, hardware didn’t respond with a single solution. It split. One path tried to preserve the old world, while the other accepted that it was gone.

CPUs: preserve the sequential world

CPUs mostly took the first path. They became multicore, but each core stayed large and complex. The goal was clear: keep single-threaded programs fast.

To do that, CPUs use prediction, reordering, and large caches to hide waiting. When memory is slow, the hardware works hard behind the scenes to make one stream of instructions keep moving.

This works extremely well for many workloads, but it doesn’t scale indefinitely. All that complexity doesn’t increase raw computation. It mostly exists to protect the illusion that programs are still sequential.

GPUs: don’t hide waiting , outrun it

GPUs made a different choice. They stopped trying to hide waiting. Instead, GPUs run so many threads that when some are stalled on memory, others are ready to run. Waiting still happens, but it just stopped being the bottleneck.

That’s the core idea. GPUs aren’t faster because they’re smarter. They’re faster because they always have more work lined up.

CUDA: acknowledging what GPUs actually are

At first, using GPUs for general computation was awkward. You had to pretend your problem was a graphics problem, but CUDA changed that by acknowledging reality:

GPUs weren’t graphics accelerators anymore.

They were parallel computers.

CUDA didn’t only add an API. It aligned the programming model with the hardware model:

lots of threads,
predictable structure,
explicit control over what runs where,
explicit control over data movement.

Which leads to the most practical realization I’ve had so far:

A CUDA program isn’t “running on the GPU”

It starts on the CPU. It stays there for the parts that don’t parallelize well.
And it jumps to the GPU only when there’s real data-parallel work to do.

That jump happens through a kernel launch: create a large number of lightweight threads, do the work, return control back to the CPU.

You don’t move the whole program to the GPU. You carve out the parts where parallelism is actually worth it.

The GPU doesn’t replace theCPU, it’s a tool that you invoke deliberately.

Parallel models are not competing

Revisiting parallelism made something click: parallel programming didn’t “fail” because we lacked tools. In fact, it’s hard because every approach is a tradeoff:

MPI/message passing scales to huge clusters, but nothing is shared. Every piece of data must be sent, received, synchronized deliberately. It basically scales at the cost of programmer effort.
OpenMP/shared memory feels natural because threads see the same memory, but coordination and cache coherence become bottlenecks as thread counts grow. It’s convenient at the cost of scalability.
GPU programming sits in between: inside the GPU, thousands of threads cooperate with low overhead. Between CPU and GPU, data movement is explicit and must be managed carefully.

Same theme across all of them:

Parallel models are choosing where to pay the cost :

in hardware,
in software,
or in human effort.

Once you look through that lens, the question changes. It’s no longer “which model should I learn?”. It becomes “what constraints does my problem actually have?”

That’s the frame I’m keeping.

GPUs exploit Data Parallelism

A big unlock for me was that many problems are already parallel before we touch any hardware.

Images are a good example. Each pixel captures a separate physical event at the same moment. One pixel doesn’t wait for another to be processed. We just usually pretend it does when we write sequential code.

Dataflow combinations for matrix multiplication.

Matrix multiplication is the cleanest illustration:

each output cell is a dot product,
each dot product is independent.

A 1000×1000 multiplication isn’t one big computation, but it’s a million small ones that don’t need to talk to each other.

As a conclusion, GPUs don’t create parallelism. They exploit independence that’s already in the data. This also explains why GPUs aren’t a silver bullet, because as soon as computations depend tightly on each other, parallelism becomes expensive and fragile.

Kernels and indexing

Understanding kernels got much easier once I thought about where the loops went.

In a CPU matrix multiplication, you write nested loops over i and j. In CUDA, those loops often disappear. Instead, the grid of threads takes their place.

Each thread represents one logical (i, j) iteration.

All threads run the same kernel code.
What differentiates them is their index.

The simplest mental model is:

int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

Instead of incrementing loop counters in software, the hardware generates all indices and launches all iterations at once.

Grids and blocks stop feeling abstract when you see them as what they are:

Simply, a structured way to assign work across a massive number of threads while keeping the kernel logic simple.

CUDA makes memory boundaries impossible to ignore

Another thing CUDA forces you to face is that the CPU and GPU do not share memory by default. If the GPU needs data:

you allocate space on the device,
you copy it over,
you run the kernel,
you copy the result back.

At first it feels like boilerplate. Then you realize it’s the point.

Computation on the GPU is cheap, but data movement is expensive. Every cudaMemcpy is a deliberate cost. Whether CUDA helps or hurts often comes down to how much useful parallel work you do between transfers.

So CUDA is about deciding when the cost of moving data is justified by the parallelism you get back. It is not about running the code on GPUs, which makes performance a design decision, rather than something you hope the compiler figures out.

Latency hiding

GPUs deal with latency by having something else to do.

Threads execute in groups called warps. Many warps are kept resident on a streaming multiprocessor (SM).

If a warp stalls (memory, pipeline, branch), the hardware simply stops issuing instructions for it and switches to another warp that’s ready.

There is no heavyweight context switch. The swap is effectively “free.” This design choice explains a lot:

why GPUs want many threads,
why block size matters,
and why under-occupying an SM hurts performance.

With this. we are not trying to make one thread fast. We are trying to make sure the machine never runs out of work.

Compute per memory access

In a naïve matrix multiplication kernel, each inner iteration loads values from global memory and does a multiply + add.

That can mean roughly “not much work” per global memory access.

This ratio, of how much computation you get per memory access, matters more than raw peak compute.

A GPU might advertise huge FLOPS, but if each operation requires pulling data from slow global memory, you’ll use only a small fraction of that peak.

The takeaway that keeps repeating is that GPU performance comes from reusing data, and not just parallelizing math.

If you don’t increase work per memory access, adding more threads won’t help.

Memory types: declaring a variable is a performance decision

CUDA memory types are really about scope, lifetime, and cost.

Registers: private per thread, extremely fast, limited.
Shared memory: visible to all threads in a block, fast, small.
Constant memory: read-only, grid-wide, efficient under the right access patterns.
Global memory: visible everywhere, persistent, large, slow.

What clicks here is that “where a variable lives” is not an implementation detail, byt rather a performance decision.

That’s why GPU optimization often looks like “restructuring the algorithm” more than “micro-optimizing the math.”

You move frequently used data closer to the threads that need it.

Tiling and cooperation: the shared memory pattern

Reducing global memory traffic is often about cooperation. Global memory is large but slow. Shared memory is small but fast.

The classic pattern is tiling:

split data into tiles that fit in shared memory,
threads in a block cooperatively load a tile once,
synchronize,
reuse that tile for multiple computations,
move to the next tile.

The key idea is reuse: load once, use many times.

This increases compute per memory access,which is what unlocks performance. Once you see this, performance stops being “more threads” and becomes more about locality + reuse + coordination.

Occupancy: memory usage limits parallelism

Fast memory is limited. Registers and shared memory live on-chip, and every thread/block consumes some of them.

So there’s a tradeoff: More registers per thread, or more shared memory per block:

fewer threads/blocks can run simultaneously
fewer warps available
weaker latency hiding.

This makes GPU performance feel different from CPU optimization. You’re constantly trading per-thread efficiency against global parallelism.

Using “too much” fast memory can slow down the whole kernel, even if each thread is doing “better” work.

Divergence: execution is collective, not individual

Threads don’t execute independently. Warps do.

All threads in a warp run the same instruction at the same time. This is extremely efficient when they follow the same control flow.

Problems start when threads in the same warp take different branches.

The hardware then runs each branch separately while disabling threads that don’t belong to it. Correct result, slower execution.

So divergence means that branching is expensive when threads in the same warp disagree, and that’s why many GPU algorithms are written to align work along warp boundaries.

Coalescing: bandwidth depends on how you access memory

Global memory bandwidth isn’t only about how much data you access. It depends heavily on access patterns.

DRAM prefers reading chunks of consecutive addresses. GPUs exploit this by checking how memory requests line up within a warp.

If threads in a warp load consecutive addresses, the hardware combines requests into a small number of efficient transactions.
If accesses are scattered, the same instruction can turn into many transactions and bandwidth collapses.

Same math, but very different memory traffic.

This ties back to shared memory: even if your algorithm “wants” a bad access pattern, threads can cooperatively load data in a coalesced way, store it in shared memory, and then reuse it freely.

Once the data is on-chip, access order stops being a performance concern. Rule of thumb: Align thread access with how memory wants to be read.

The lens I’m keeping

The more I revisit these fundamentals, the more one pattern keeps showing up:

Parallel programming models aren’t competing to be “the best.” They’re each choosing where to pay the cost:

in hardware,
in software,
or in human effort.

CUDA just makes that trade explicit, especially the memory boundary and the resource tradeoffs. So the question I’m carrying forward isn’t “should I learn CUDA?” It’s:

What constraints does my problem actually have?
And where am I willing to pay the cost?

What‘s next

I’m continuing this first-principles walk with a more “hands-on performance” phase:

profiling (Nsight),
roofline thinking (compute vs bandwidth bound),
occupancy calculators and real resource tradeoffs,
shared memory tiling patterns beyond matmul,
and warp-level primitives.

If you’re also learning CUDA, feel free to steal this structure. It made everything feel less mystical and more mechanical, in a good way.

I’m also posting daily CUDA notes on X as a thread. If you prefer bite-sized updates, you can follow along there.

https://medium.com/media/cc7205860b19c3e0869db658f64189de/href

Learning CUDA From First Principles was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked