What Happens When You Put “n” Billion Weights in Your RAM

digitado ⋅ 1 de March de 2026

I was in full vibe-coding mode with Headphones on. Letting Copilot autocomplete half my thoughts. Prompt here, tab there. Confidence at an all-time high. It honestly felt like I had rented extra IQ from the cloud ( having this kind of feel for a while )

Then the power cut happened.

Wi-Fi was cut-off and copilot went silent. The editor stopped suggesting things. And I suddenly felt like a very ordinary human being again. Slightly underpowered. Slightly exposed. Like someone had quietly downgraded my brain to the free tier ( Felt like the me 1 year ago )

Thats when I remembered I had downloaded a 7B model and ran it locally. The first time I loaded it, I was not staring at the output, I was staring at Activity Monitor. Memory jumped by 18GB in about six seconds. GPU usage shot up to numbers I usually associate with rendering 4K video. The CPU briefly considered resigning. And I just sat there, black coffee next to me watching Ollama calmly allocate seven billion parameters into RAM like this was a perfectly reasonable thing to do on a Tuesday night. No API, no endpoint, no dependency on someone else’s data center. Under the hood, llama.cpp had loaded a quantised model into memory, and the GPU was quietly running billions of matrix multiplications per second, all happening inside the same machine I carry in my bag.

No internet, that was the part that hit harder than I expected and nothing crashed. It just worked. Tokens appeared at 30 per second like my laptop had quietly decided it was time to become a transformer host. And that felt more unsettling than if it had broken.

The Model File Is Frozen Learning. Not Code.

When you download a 7B model, you are not downloading logic. You are downloading seven billion numbers. Not functions, just numbers!! Each of those numbers is called a parameter, more specifically, a weight. These weights live inside matrices. And those matrices are arranged into layers’ attention layers, feedforward layers, normalization layers, stacked on top of each other like floors in a building.

A typical 7B transformer has around 32 such layers. Every token you generate passes through all 32 of them, top to bottom, every single time. No shortcuts.

A transformer is the architecture behind all of this. Introduced in 2017, it replaced older sequential models with something much more parallel and much more scalable. Instead of reading text word by word, it lets every token attend to every other token through self-attention.

During training, these weights start as random noise. The model is shown trillions of tokens. It predicts the next one. It is wrong. The error is measured. Backpropagation computes gradients, small signals that tell each weight how it should change to reduce that error next time. Each update is tiny. Almost insignificant on its own. But doing that billions of times across trillions of tokens, those random numbers slowly start encoding patterns in how language behaves.

Not meaning in the human sense.Not understanding in the philosophical sense. But statistical structure. Probability distributions over what tends to come next

Now about those numbers. Each weight is stored as a floating-point value. By default, this is 32-bit floating point, often written as FP32. That format gives you roughly 7 decimal digits of precision and a very wide numeric range. I mean yes.. why 32-bit? Because during training, gradient updates are extremely small. If your number representation is too coarse, those tiny updates get rounded away. Over billions of steps, that rounding error accumulates and the model simply fails to learn properly. Precision matters when you are shaping billions of parameters gradually.

32 bits means 4 bytes per weight.

Seven billion weights × 4 bytes ≈ 28GB, That is 28GB just to exist in memory.

Before inference. Before a single token appears and before anything intelligent happens. Twenty-eight gigabytes of compressed statistical history, sitting in RAM, waiting to be multiplied.

Your laptop, meanwhile, is pretending this is normal.

FLOPs

A FLOP is a floating-point operation. One multiplication or addition between two decimal numbers. Now do it seven billion times in a fraction of a second. That is what your hardware is doing every time a token appears on screen.

When you send a prompt, your text gets converted into vectors, lists of numbers, one per token. Think of each token as a point floating in a very high-dimensional space. For a 7B model, that space has 4096 dimensions. Every number in that vector needs to interact with every weight in the model to produce the next token. That interaction is a matrix multiplication. You are taking your token vector and multiplying it against a giant grid of weights. Then doing it again with a different grid. Then again. Across 32 layers, each with multiple steps.

To make it concrete in each layer there are two main blocks doing the heavy work

The attention block figures out which tokens in your conversation should influence each other. When I see the word bank, should I be thinking about rivers or money? It does this by running your token representations through four separate matrix multiplications called Q, K, V projections and an output projection. Each one is a massive grid of numbers being multiplied against your input. The feedforward block then takes that result, stretches it into an even higher-dimensional space from 4096 up to 11008 dimensions applies a mathematical squeeze, and projects it back down. Two more large matrix multiplications.

So that is roughly six large matrix multiplications per layer. Times 32 layers. For every single token you generate. Generating one token costs around 14 billion floating point operations. A 500 token response means your laptop just did 7 trillion operations. Silently. In about 15 seconds.

Now GPU is built for exactly this. Not because of AI, but because graphics rendering is also just matrix math at scale. Games, 3D rendering, physics simulations, all linear algebra. LLMs showed up and said same. So GPU manufacturers accidentally built the perfect hardware for this years before transformers existed ( Golden time period if you are in semiconductor business ) But here is the catch, raw compute power only matters if the weights can reach the GPU cores fast enough. And for every token generated, the GPU needs to read the entire model from memory all 4.5GB of it at Q4 quantization to do those multiplications.

At 30 tokens per second, that means your hardware is moving roughly 135GB of data per second just to keep words appearing on your screen. If the model lives entirely in VRAM, that movement is fast and local. Generation feels fluid. Tokens arrive steadily. If the model spills into system RAM, that data has to travel across a physical connection between your RAM and GPU called the memory bus. And that bus is slow compared to what the GPU can handle internally. Suddenly you’re not limited by how fast your GPU can calculate. You are limited by how fast data can travel to it.

The model is not thinking slowly. It is stuck in traffic. And that traffic is the distance between your weights and your compute measured in gigabytes per second, felt in tokens per second, experienced as you watching a progress bar and quietly reconsidering your life choices.

Quantization

So how does any of this run on consumer hardware? Approximation!! During training, weights use 32-bit precision because gradients are fragile and tiny updates need to be exact. During inference, you are not learning anymore. You are just running. So you can relax.

Quantization reduces each weight from 32 bits to 8, 4, sometimes even 2. A 7B model drops from ~28GB to roughly 4–5GB at 4-bit. Suddenly it fits inside a mid-range GPU. Bus traffic disappears. Tokens-per-second jump. Your RAM stops sending you passive-aggressive signals.

The formats you’ll live in:

Q4_K_M — 4-bit, medium quality.
Q5_K_M — Better quality, slightly bigger. For when Q4 makes you nervous.
Q8_0 — Nearly full quality, nearly full size. Pick your battles.
Q2_K — Tiny. Aggressive. The cognitive equivalent of three hours of sleep. Emergency use only.

The reason this works at all comes down to something called outlier sensitivity. Most weights in a trained model cluster in a small range they are not uniformly distributed across all possible values. Quantization exploits this. You are not losing random precision. You are losing precision in regions where the model barely lived anyway.

There is also mixed precision quantization, where attention layers and embeddings, which are more sensitive, stay at higher bit-width while feedforward layers get compressed harder. Tools like llama.cpp do this automatically. You don’t see it. But it is why Q4_K_M feels better than a naive 4-bit implementation would suggest.

What’s fascinating is how resilient these systems are to all of this. They don’t need every decimal to be exact. They need the statistical shape of what they learned to remain intact. Which is uncomfortable, because that is also how humans operate. We don’t remember exact details. We remember patterns. We approximate confidently and fill in the rest. Models and humans, both just vibing on approximations. Deeply reassuring.

Quantization (or lower precision) leads to higher FLOPs.

Why Long Context Hurts (Attention Is O(n²))

Self-attention scales as O(n²). If your context has n tokens, each token attends to every other token. Pairwise comparisons across the entire sequence. Double your context length and you don’t double the work. You roughly quadruple it!

On top of that, the KV cache, the stored key/value tensors from previous tokens grows linearly with context size. More context means more memory, more compute, and slower inference. I have had a model forget the beginning of a long conversation because the context window filled up. So when you see 128K context window, that’s capability. It’s also a quadratic compute curve waiting to show up in your tokens-per-second graph at the worst possible moment. The cloud abstracts this away cleanly. Your laptop does not. Your laptop will demonstrate exactly what quadratic scaling feels like, in real time, with no warning.

It Stops Being Magic

Calling an API keeps everything comfortable. You send text, you get text back, and somewhere far away a GPU does the actual work. You never see it. You never feel it. It is just a response that arrives, and you move on. Running local removes that distance. You watch 18GB disappear in four seconds. You make a decision about quantization not because someone recommended it but because your hardware has a ceiling and you have personally met it. You realise context length is not a setting, it is a cost that accumulates quietly until it is not quiet anymore.

And somewhere in that process, the mental model shifts. It stops being a service you consume and starts being a system you understand. Not because you read the documentation, but because you watched it run out of memory at 2am and had to figure out why. That is a different kind of knowing. We are in a phase where most people interact with AI the way they interact with electricity. It exists. It works. Someone else manages the infrastructure. You just use it. And that is fine. It is convenient.But there is something specific that happens when you run the model yourself. When the weights are on your drive and the matrix multiplications are happening on your silicon and the only thing between you and the output is your own hardware making decisions in real time. You stop guessing what is happening inside the system. You start knowing!! And in 2026, where so much of what we build sits on top of models we do not control, on infrastructure we do not own, that feels like a useful thing to have.

Run something local this week. See how the internal architechure is behaving to understand the real potential of AI models.

What Happens When You Put “n” Billion Weights in Your RAM was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked