Building a Transformer From Scratch in Annotated PyTorch
A complete walkthrough of implementing the original Attention Is All You Need encoder-decoder Transformer—no torch. nn.Transformer, no shortcuts.
The 2017 paper “Attention Is All You Need” by Vaswani et al. fundamentally changed natural language processing by replacing recurrent networks with a purely attention-based architecture . Yet for most practitioners, the Transformer remains a black box—something you import from Hugging Face and fine-tune. This article walks through a full from-scratch PyTorch implementation of the original encoder-decoder Transformer, inspired by the Harvard NLP “Annotated Transformer” pedagogical style , covering every component from masking to multi-head attention to the Noam learning rate schedule.
The full runnable code is available as a Kaggle notebook:
Why Build It From Scratch?
Using torch.nn.Transformer or nn.MultiheadAttention abstracting away the most pedagogically important details: how masks are shaped, why we scale dot products, what teacher forcing really does at the tensor level, and how the Noam scheduler warms up. Building it from scratch forces you to confront every design decision in the original paper—and in the process, you gain an intuition that no API reference can give you. This notebook uses no high-level transformer wrappers and implements the full encoder-decoder stack manually.
Environment and Reproducibility
Before writing a single model layer, the notebook sets up full reproducibility — a practice often skipped in tutorials but critical in production ML.
def set_seed(seed: int, deterministic: bool = False):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
if deterministic:
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False
set_seed(SEED, deterministic=DETERMINISTIC_DEBUG)
Seeds are set across Python’s random, NumPy, and all PyTorch CUDA streams. The notebook runs on a Tesla P100-PCIE-16GB GPU with PyTorch 2.9.0+cu126 and Python 3.12. A CONFIG dict centralizes all hyperparameters—model size, batch size, training steps, and visualization settings—making it easy to switch between a small_model (2 layers, dmodel=128) for quick iteration and the paper_like_model (6 layers, dmodel=512) from the original paper.
Dataset: Synthetic Reverse-Sequence Task
The notebook auto-detects whether Multi30K-style bilingual files are attached and falls back to a synthetic sequence-reversal task . While this sounds trivial, it is deliberately chosen: reversing a sequence requires the model to learn positional order and attend across the full source — exactly the skills needed for real translation.
def make_synthetic_reverse_pairs(n_pairs=25000, vocab_size=40, ...):
vocab = [f'w{i}' for i in range(1, vocab_size + 1)]
for _ in range(n_pairs):
L = rng.randint(min_len, max_len)
src_tokens = [rng.choice(vocab) for _ in range(L)]
tgt_tokens = list(reversed(src_tokens))
pairs.append((' '.join(src_tokens), ' '.join(tgt_tokens)))
With 25,000 pairs split 90/5/5 across train/val/test, the task produces 22,500 training examples with sequences of length 3–10 . The vocabulary is intentionally tiny (40 tokens + 4 special tokens = 44 total), which makes the model fast to train and the attention maps easy to interpret visually.
Tokenization and Vocabulary
The Vocab class is a clean, minimal implementation: it stores a stoi (string-to-index) dict and itos (index-to-string) list, and exposes encode_tokens and decode_ids methods.
pythonclass Vocab:
def __init__(self, stoi, itos):
self.stoi = stoi
self.itos = itos
def encode_tokens(self, tokens):
unk_id = self.stoi[UNK_TOKEN]
return [self.stoi.get(tok, unk_id) for tok in tokens]
Vocabulary is built from training data only, with min_freq=2 to suppress rare tokens and deterministic ordering by (-frequency, token_asc) to guarantee identical vocabularies across runs . Special tokens <PAD>, <BOS>, <EOS>, <UNK> occupy indices 0–3 consistently in both source and target vocabularies — a requirement enforced by assertions, since the collate function uses a single PAD_ID for both sides.
Masking: The Most Misunderstood Part
Masking is where most from-scratch implementations go wrong. The notebook implements three distinct mask types with clear shape annotations, all using the True = can attend convention :
| Mask | Shape | Purpose |
|—-|—-|—-|
| src_key_padding_mask | (B, 1, 1, S) | Prevents encoder self-attention from attending to <PAD> source positions |
| tgt_causal_mask | (B, 1, T, T) | Lower-triangular mask + target padding for decoder self-attention |
| tgt_cross_mask | (B, 1, T, S) | Prevents decoder cross-attention from attending to source pads |
n
The causal mask is particularly important. It combines three boolean tensors—a key padding mask, a query padding mask, and a lower-triangular future-blocking mask—via bitwise AND:
def make_tgt_self_attention_mask(tgt_in_ids, pad_id=0):
B, T = tgt_in_ids.shape
key_mask = (tgt_in_ids != pad_id).unsqueeze(1).unsqueeze(2) # (B,1,1,T)
query_mask = (tgt_in_ids != pad_id).unsqueeze(1).unsqueeze(3) # (B,1,T,1)
causal = torch.tril(torch.ones((T, T), dtype=torch.bool,
device=tgt_in_ids.device)).unsqueeze(0).unsqueeze(1)
return key_mask & query_mask & causal
The shapes broadcast cleanly across the heads dimension during attention computation . A mask visualization shows three panels: the source padding bar, the characteristic triangular target causal mask, and the rectangular cross-attention mask.
Teacher Forcing and the DataLoader
In the collate_fn, teacher forcing is implemented at the tensor level with a single index shift :
pythontgt_in = tgt_padded[:, :-1] # [BOS, y1, y2, ...]
tgt_out = tgt_padded[:, 1:] # [y1, y2, ..., EOS]
tgt_in is the decoder input (what the model sees at each step), and tgt_out is the prediction target (what the model must output). This shift is verified at runtime with assertions that check the BOS token is always in position 0 of tgt_in and that the shift relationship holds across all batch elements. Example batch shapes with batch_size=128 and max_len=12 look like:
textsrc_ids: (128, 12)
tgt_in_ids: (128, 11)
tgt_out_ids: (128, 11)
src_key_padding_mask:(128, 1, 1, 12)
tgt_causal_mask: (128, 1, 11, 11)
tgt_cross_mask: (128, 1, 11, 12)
Scaled Dot-Product Attention
The heart of the Transformer is scaled dot-product attention. The formula is:
The scaling factor 1/sqrt(dk) is not cosmetic — without it, dot products grow large in magnitude as d_k increases, pushing softmax into regions with near-zero gradients . The implementation applies the mask with masked_fill(~mask, -1e9) before softmax, so blocked positions get effectively zero weight after the exponential.
def scaled_dot_product_attention(Q, K, V, mask=None, dropout=None):
d_k = Q.size(-1)
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k) # (B, H, T, S)
if mask is not None:
scores = scores.masked_fill(~mask, -1e9)
attn_weights = F.softmax(scores, dim=-1)
if dropout is not None:
attn_weights = dropout(attn_weights)
return attn_weights @ V, attn_weights
Returning both the output and the raw attention weights enables the per-head, per-layer attention heatmaps produced later in the notebook.
Multi-Head Attention
Multi-head attention runs h parallel attention functions on different linear projections of Q, K, V:
MultiHead(Q,K,V)=Concat(head1,…,headh)WO n headi=Attention(QWiQ,KWiK,VWiV)
The key implementation detail is the reshape trick: instead of instantiating h separate linear layers, you project to (B, H, T, d_k) using a single (d_model, d_model) matrix and a .view() + .transpose() . This keeps the full projection as a single GPU matrix multiply, then splits logically into heads. The small model uses num_heads=4 with d_model=128, giving d_k = 32 per head; the paper model uses 8 heads with dmodel=512 (dk=64).
Sinusoidal Positional Encoding
Since the Transformer has no recurrence, it needs a way to inject position information. The original paper uses fixed sinusoidal encodings:
n
n
These are computed once and stored as a non-trainable buffer. Even and odd dimensions use sine and cosine respectively at different frequencies, so nearby positions have similar encodings but every absolute position is unique . The notebook registers this as register_buffer, meaning it moves to GPU with the model but is excluded from gradient updates and optimizer state.
Encoder and Decoder Layers
A single EncoderLayer consists of: multi-head self-attention → Add & Norm → feed-forward → Add & Norm. The feed-forward sublayer is a two-layer MLP with ReLU and an inner dimension of d_ff (typically 4× d_model) :
pythonclass FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.linear2(self.dropout(F.relu(self.linear1(x))))
A DecoderLayer has three sublayers: masked self-attention (causal), cross-attention over encoder output, and feed-forward. Each sublayer uses pre-norm or post-norm residual connections. The full encoder stacks N encoder layers; the full decoder stacks N decoder layers. The small model uses N=2 for fast iteration, while the paper model uses N=6.
The Noam Learning Rate Schedule
The original paper introduced a custom learning rate schedule that warms up linearly then decays proportionally to the inverse square root of the step:
This is implemented as a custom LambdaLR scheduler . The notebook uses warmup_steps=400 (reduced from the paper’s 4000 to fit within the 2000-step training budget on Kaggle) and Adam with betas=(0.9, 0.98) and eps=1e-9 — exactly as specified in the paper. Label smoothing with ε=0.1 is applied via CrossEntropyLoss(label_smoothing=0.1), which prevents the model from becoming overconfident on the training labels.
Training Loop
The training loop runs for 8 epochs with a maximum of 2000 gradient steps. After each epoch, validation loss is computed and a model checkpoint is saved if the validation loss improves . Gradient clipping with max_norm=1.0 prevents exploding gradients, which can occur in early training before the warmup schedule has stabilized the learning rate.
pythonfor step, batch in enumerate(train_loader):
model.train()
logits = model(src_ids, tgt_in_ids, ...)
loss = criterion(logits.view(-1, tgt_vocab_size), tgt_out_ids.view(-1))
optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), CONFIG['train']['grad_clip'])
optimizer.step()
scheduler.step()
Greedy Decoding and Attention Visualization
At inference time, the notebook uses greedy decoding: the decoder generates tokens one step at a time by taking the argmax of the output distribution, feeding each predicted token back as input for the next step . The encoder is run once to produce memory, then the decoder iterates until it emits <EOS> or reaches max_tgt_len.
After training, per-head and per-layer attention heatmaps are plotted. These visualizations are the most powerful tool for inspecting what the model has learned: encoder self-attention heatmaps reveal which source tokens attend to each other, while cross-attention heatmaps show exactly which source positions the decoder relies on when generating each target token . With the synthetic reversal task, cross-attention maps display a clear anti-diagonal pattern — position t in the target attending most strongly to position S-t in the source.
Baselines and Benchmarking
The notebook benchmarks the from-scratch Transformer against two baselines :
- GRU Seq2Seq: A bidirectional GRU encoder with a unidirectional GRU decoder and Bahdanau-style attention.
- Framework Transformer: A model built using PyTorch’s built-in
torch.nn.Transformer, used as a sanity check that the from-scratch implementation converges to similar quality.
Both baselines are trained with the same 400-step budget and same optimizer settings for fair comparison. The from-scratch Transformer typically achieves lower validation loss than the GRU baseline on the reversal task because the task demands global positional awareness — exactly what self-attention is designed for.
Error Analysis and Common Bug Checklist
The final notebook cell runs a systematic bug-check suite that covers the most common from-scratch mistakes:
- Mask convention mismatch: verify
True = attendvsTrue = blockis consistent throughout. - BOS/EOS position correctness: check first token of
tgt_inis alwaysBOS_ID. - Causal mask leakage: confirm no future token is visible during decoder self-attention.
- Embedding scale: the notebook applies
√d_modelscaling to embeddings before adding positional encodings, matching the paper’s specification. - Parameter count: print total trainable parameters to confirm the model is neither too small (underfitting) nor too large (GPU OOM).
Key Takeaways for Hackernoon Readers
Building a Transformer from scratch with annotated PyTorch teaches lessons that are invisible when using high-level APIs. The masking system is more nuanced than most tutorials suggest — three distinct mask types with different broadcast shapes, all derived from the same padding information. The Noam scheduler is not optional; without warmup, Adam with the paper’s eps=1e-9 diverges early in training . And teacher forcing, while simple as a one-line index shift, has profound implications: at inference time the model sees its own (possibly incorrect) predictions, creating a train-test mismatch that greedy decoding partially mitigates.
The notebook is fully self-contained, runs in under 3 minutes on a Kaggle P100 GPU, and produces interpretable attention visualizations that make the model’s reasoning transparent. Whether you are studying for an ML exam, writing a research paper, or simply want to understand what nn.MultiheadAttention actually does under the hood, building the Transformer from scratch in annotated PyTorch remains one of the most rewarding exercises in modern deep learning.
n
Check out the full notebook on Kaggle: