ML Internals: The Week I Stopped Treating Embeddings as a Black Box

I spent years calling models via API : Bedrock, Sagemaker, Anthropic. This week was about understanding what’s actually happening inside the box, and why that matters for building systems around it.

Coming into this project, I had built production AI systems on AWS Bedrock using flows, agents. I knew how to wire models into applications. What I didn’t know was what the models were actually doing with the text I gave them, why RAG works at all, or how to reason about a model’s behavior from the inside rather than treating it as a black box. This week was about opening the box.

:::tip
GOAL OF THE WEEK

Understand how a model fits into an ML system. Specifically, the foundations that make RAG work: tokenization, embeddings, and semantic similarity.

Secondary goal: run a real training loop and develop the habit of reading model outputs critically rather than just reporting accuracy numbers.

:::

1. The Reading That Reframed Everything

Before writing any code, I spent a day reading. Two pieces in particular changed how I thought about the whole project.

Jay Alammar’s The Illustrated Word2Vec was the first time I understood, visually, how meaning becomes a number. The core idea: if you train a model to predict what words appear near other words across a large corpus, similar words end up with similar numerical representations. “Server” and “instance” land near each other in vector space because they appear in similar contexts. The meaning of a word emerges from its company, not from any explicit definition.

Chip Huyen’s “Real-time Machine Learning: Challenges and Solutions reframed what I was actually trying to learn. The article is about the tradeoffs between batch prediction and online prediction i.e. when you precompute predictions ahead of time vs. when you compute them at request time. The article also goes into trade-offs for the infrastructure, latency, and freshness implications of each choice. Coming from using Bedrock, I had been treating inference as a single thing: you send a request, you get a response. The article made clear that how and when predictions are generated is itself an architectural decision with significant downstream consequences.

The insight that landed hardest: the same model can behave very differently depending on whether it’s serving batch or online predictions, because the data it sees, the latency it must hit, and the feedback loop it operates within are all different. ML engineering is as much about those surrounding decisions as about the model itself. The model is one component.

How and when predictions are generated is an architectural decision. The model is one component. Serving strategy, latency, and data freshness are the others.

2. What I Built — Day by Day

The week had four building blocks. Each one turned out to be directly relevant to RAG, though I didn’t frame it that way until the end of the week.

Embeddings and Semantic Similarity

The first thing I built was an embedding explorer. Load a sentence encoder model, convert text to vectors, measure similarity between them.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

model_name = "sentence-transformers/all-MiniLM-L6-v2"
# 22MB model, 384-dim output, runs on CPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pool over token embeddings (excluding padding)
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    summed = (outputs.last_hidden_state * mask).sum(dim=1)
    embedding = summed / mask.sum(dim=1).clamp(min=1e-9)
    return F.normalize(embedding, p=2, dim=1).squeeze()

e1 = get_embedding("The server crashed at 3am")
e2 = get_embedding("The system went down overnight")
e3 = get_embedding("I love pizza")

print(F.cosine_similarity(e1, e2, dim=0))  # 0.83 — similar meaning
print(F.cosine_similarity(e1, e3, dim=0))  # 0.12 — unrelated

The scores aren’t the interesting part. What’s interesting is running it on domain-specific text — AWS error messages, runbook entries, incident reports — and seeing what the model considers semantically similar. “OOMKilled pod” and “container killed due to memory” score high. “OOMKilled pod” and “S3 access denied” score low. The model has never seen your specific runbook, but it learned these relationships from the broader corpus it was trained on.

Three conceptual questions came up while building this that turned out to be the most important things I learned all week.

:::info
→ WHY DOES EVERY SENTENCE PRODUCE THE SAME 384 NUMBERS?

384 is not counting words or tokens. It’s a coordinate in a learned space, like GPS coordinates representing a location. Every sentence, whether two words or two hundred, gets mapped to a position in that 384-dimensional space. Similar meanings land close together. The number of words in the input determines how much information goes in; it doesn’t determine the size of the output representation.


→ DOES “THE” HAVE THE SAME EMBEDDING IN “THE SERVER CRASHED” AND “THE NICE PERSON”?

No, completely different vectors. This is the entire point of transformers. At each attention layer, every token looks at every other token and updates its own representation based on what it sees. “The” in “The server crashed” absorbs context from “server” and “crashed”. Its 384-dim vector encodes the whole sentence’s influence. Static word embeddings (Word2Vec) give “The” the same vector everywhere. Transformer contextual embeddings give it a different vector in every sentence.


→ WHAT’S ACTUALLY BEING TRAINED VS. WHAT’S BEING COMPUTED?

The attention weights. How much token A pays attention to token B are computed fresh for every sentence. What gets trained are the Query, Key, and Value projection matrices that produce those weights. This distinction matters for understanding what fine-tuning actually changes: you’re adjusting the matrices that determine how tokens relate to each other, not storing sentence-specific lookup tables.

:::

A Tokenization Correction Worth Documenting

I wrote in my notes that “unbelievable” tokenizes as ["un", "##believable"] under BERT’s WordPiece tokenizer. A reader caught this and ran the code themselves:

tokenizer.tokenize(“unbelievable”) → [‘unbelievable’]  # single token — common enough to be in the 30k vocab

tokenizer.tokenize(“OOMKilled”) → [‘oo’, ‘##mk’, ‘##illed’]  # technical term — splits into noise

tokenizer.tokenize(“CloudFormation”) → [‘cloud’, ‘##formation’]  # AWS service — partially recognized

tokenizer.tokenize(“DynamoDB”) → [‘dynamic’, ‘##od’, ‘##b’]  # brand name — tokenizes into garbage

WordPiece uses a greedy longest-match strategy. Words that appeared frequently enough in BERT’s training corpus get their own vocabulary slot and stay intact. Words that didn’t — AWS service names, Kubernetes terms, internal jargon — get split into whatever subwords are closest. Those subwords often carry no meaningful signal.

This has a direct production consequence: off-the-shelf embedding models have degraded representations for domain-specific vocabulary. A query for “OOMKilled pod” might not reliably retrieve a chunk that mentions “OOMKilled” because both are tokenized into noise before the model ever processes them. It’s one of the real failure modes of RAG systems on technical corpora, and you can observe it directly by running your domain terms through the tokenizer before building anything.

:::info
PRODUCTION IMPLICATION

Before deploying a RAG system on a specialized corpus — medical records, legal contracts, internal engineering docs — run your most important domain terms through the embedding model’s tokenizer. If critical terms split into meaningless subwords, the model has weak representations for them. You’ll need either a domain-fine-tuned embedding model or a workaround like term boosting at retrieval time.


:::

The First Training Loop

Day 3 was about running a real training loop — not to build a classifier I cared about, but to internalize the training loop structure. The task: fine-tune DistilBERT on AG News, a four-class news classification dataset.

from transformers import (AutoModelForSequenceClassification,
    AutoTokenizer, TrainingArguments, Trainer)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

dataset  = load_dataset("ag_news", split="train[:2000]")
test_ds  = load_dataset("ag_news", split="test[:500]")

model_name = "distilbert-base-uncased"
tokenizer  = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

dataset = dataset.map(tokenize, batched=True)
test_ds = test_ds.map(tokenize, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=4)

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=2,
        per_device_train_batch_size=16,
        eval_strategy="epoch",  # renamed from evaluation_strategy in v4.36+
        logging_steps=50),
    train_dataset=dataset,
    eval_dataset=test_ds,
    compute_metrics=lambda p: {
        "accuracy": accuracy_score(
            p.label_ids, np.argmax(p.predictions, axis=-1))})

trainer.train()

The training output after two epochs:

epoch 1: loss=0.496, eval_loss=0.336, accuracy=89.6%
epoch 2: loss=0.275, eval_loss=0.364, accuracy=88.6%

The eval loss went up at epoch 2 while training loss continued down. That’s overfitting — the model started memorizing specific training examples rather than learning generalizable patterns. Epoch 1 was the optimal stopping point. I would not have noticed this without watching both metrics simultaneously, which is exactly why logging both matters.

:::info
THE HABIT THAT MATTERS

Always track training loss and eval loss separately. Training loss always goes down — that’s guaranteed by the optimizer. Eval loss tells you whether the model is learning generalizable patterns or memorizing the training set. When eval loss starts rising while training loss falls, you’ve crossed from learning into memorization. That’s your stopping signal.

:::

Evaluation Beyond Accuracy

Day 4 was about the habit of looking into the failures where model predictions did not meet the expected values.

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

predictions = trainer.predict(test_ds)
preds  = np.argmax(predictions.predictions, axis=-1)
labels = predictions.label_ids

label_names = ["World", "Sports", "Business", "Sci/Tech"]
print(classification_report(labels, preds, target_names=label_names))

# Find and inspect failures manually
test_texts = load_dataset("ag_news", split="test[:500]")["text"]
wrong = [(test_texts[i], labels[i], preds[i])
         for i in range(len(labels)) if labels[i] != preds[i]]

for text, true, pred in wrong[:10]:
    print(f"TRUE: {label_names[true]}")
    print(f"PRED: {label_names[pred]}")
    print(f"TEXT: {text[:150]}n")

The manual failure inspection turned out to be the most valuable part of the day. Reading ten wrong predictions reveals systematic patterns that the confusion matrix only hints at. Business articles with technology themes were classified as Sci/Tech. Sports articles about sports business were classified as Business. These aren’t random errors — they’re the model’s blind spots, and they’re interpretable.

:::tip
THE HABIT TO BUILD FROM DAY ONE

Always read your model’s failures, not just its metrics. Metrics compress information. Failures contain it. Ten manually inspected wrong predictions will tell you more about what your model has and hasn’t learned than any aggregate number. This habit applies equally to classifiers, language models, and RAG systems — where “failures” means retrieved chunks that didn’t actually answer the question.

:::

3. How This Connects to RAG

At the end of week, built a small semantic search system over a corpus of AWS ops-related sentences. It wasn’t framed as RAG — it was framed as an experiment to understand embeddings. But it’s exactly how RAG works under the hood:

import faiss
from sentence_transformers import SentenceTransformer

corpus = [
    "EC2 instance failed health check",
    "RDS database connection pool exhausted",
    "Lambda function cold start latency spike",
    "S3 bucket access denied error",
    "ECS task stopped unexpectedly",
    "DynamoDB throttling exceptions detected",
]

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(corpus)

index = faiss.IndexFlatIP(embeddings.shape[1])
faiss.normalize_L2(embeddings)
index.add(embeddings.astype("float32"))

def search(query, k=3):
    q_emb = model.encode([query])
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb.astype("float32"), k)
    return [(corpus[i], scores[0][j]) for j, i in enumerate(indices[0])]

results = search("server running out of memory")
# → ("Lambda function cold start latency spike", 0.61)
# → ("ECS task stopped unexpectedly", 0.58)
# → ("RDS database connection pool exhausted", 0.54)

The query “server running out of memory” retrieved Lambda cold start and ECS task stopped — neither contains the exact words “server,” “running,” or “memory.” The model retrieved them because they’re semantically related. That’s the core capability that makes RAG more powerful than keyword search.

The pipeline for what I built — and what every RAG system does — is the same five steps:

  1. Chunk

    Split document into pieces

  2. Embed

    Text → 384-dim vector

  3. Index

    Store in vector DB (FAISS)

  4. Retrieve

    Query → find similar chunks

  5. Generate

    LLM answers with context

Bedrock Knowledge Bases, Pinecone, LlamaIndex, LangChain’s retrieval chain — they all implement these five steps. Understanding steps 2 and 4 deeply — what an embedding is, what similarity means, what causes retrieval to fail — is what separates an engineer who can debug a RAG system from one who can only configure it.

4. What I Learned About ML Systems

The Chip Huyen framing held up across the whole week. Batch vs. online prediction is one dimension of the decision — but the broader point is that the pieces I built — tokenizer, embedding model, vector index, training loop, evaluation harness — are all components of larger systems. The model is one of them.

ML SYSTEM COMPONENTS — WHAT EACH ONE BREAKS ON

| COMPONENT | WHAT CAN GO WRONG | WEEK 1 SIGNAL |
|—-|—-|—-|
| Tokenizer | Domain terms split into noise | OOMKilled → [‘oo’, ‘##mk’, ‘##illed’] |
| Embedding model | Wrong model for domain; stale representations | Technical corpus needs domain-tuned model |
| Chunking strategy | Too large = dilution, too small = lost context | Scaling policy fact diluted in full-doc embed |
| Training data | Noisy labels, distribution mismatch | Week 2 discovers this the hard way |
| Evaluation | Wrong metrics, measuring memorization not generalization | Eval loss rising at epoch 2 — overfitting signal |
| The model | Capacity limits, wrong architecture | Often not the failure point |

The model appears last in the table intentionally. Of the five failure modes I encountered across the week, only one was directly about the model itself. The others were about the system around it — how text gets prepared before the model sees it, how the training data was constructed, how outputs get evaluated. Chip Huyen’s article frames this in terms of serving architecture — batch vs. online, latency vs. freshness — but the underlying point holds at every level: the model is one component, and often not the one that breaks.

5 What’s Next

This week gave me a clearer picture of how a model fits into a larger ML system — and how much happens before and after the model that determines whether the system actually works. Tokenization, embedding quality, chunking strategy, evaluation discipline — any of these can be the failure point, and the model often isn’t.

I’m reading Designing Machine Learning Systems by Chip Huyen alongside these experiments. The book covers the full lifecycle — data, training, deployment, monitoring — and I find it grounds the practical work in a way that pure hands-on experimentation doesn’t. Reading without building feels abstract. Building without reading feels like guessing. Both together is the pace I’m trying to keep.

Writing these posts is the third piece. Explaining something clearly enough to publish forces a precision that running code doesn’t. Several things I thought I understood this week turned out to be fuzzier than they appeared when I had to write them down.

Next week: fine-tuning a model on a real domain problem, and figuring out when that’s actually worth doing over just prompting a larger model. More soon.


Liked Liked