The Transformer Family Version 2.0

Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.

Notations

Symbol Meaning
$d$ The model size / hidden state dimension / positional encoding size.
$h$ The number of heads in multi-head attention layer.
$L$ The segment length of input sequence.
$N$ The total number of attention layers in the model; not considering MoE.
$mathbf{X} in mathbb{R}^{L times d}$ The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size.
$mathbf{W}^k in mathbb{R}^{d times d_k}$ The key weight matrix.
$mathbf{W}^q in mathbb{R}^{d times d_k}$ The query weight matrix.
$mathbf{W}^v in mathbb{R}^{d times d_v}$ The value weight matrix. Often we have $d_k = d_v = d$.
$mathbf{W}^k_i, mathbf{W}^q_i in mathbb{R}^{d times d_k/h}; mathbf{W}^v_i in mathbb{R}^{d times d_v/h}$ The weight matrices per head.
$mathbf{W}^o in mathbb{R}^{d_v times d}$ The output weight matrix.
$mathbf{Q} = mathbf{X}mathbf{W}^q in mathbb{R}^{L times d_k}$ The query embedding inputs.
$mathbf{K} = mathbf{X}mathbf{W}^k in mathbb{R}^{L times d_k}$ The key embedding inputs.
$mathbf{V} = mathbf{X}mathbf{W}^v in mathbb{R}^{L times d_v}$ The value embedding inputs.
$mathbf{q}_i, mathbf{k}_i in mathbb{R}^{d_k}, mathbf{v}_i in mathbb{R}^{d_v}$ Row vectors in query, key, value matrices, $mathbf{Q}$, $mathbf{K}$ and $mathbf{V}$.
$S_i$ A collection of key positions for the $i$-th query $mathbf{q}_i$ to attend to.
$mathbf{A} in mathbb{R}^{L times L}$ The self-attention matrix between a input sequence of lenght $L$ and itself. $mathbf{A} = text{softmax}(mathbf{Q}mathbf{K}^top / sqrt{d_k})$.
$a_{ij} in mathbf{A}$ The scalar attention score between query $mathbf{q}_i$ and key $mathbf{k}_j$.
$mathbf{P} in mathbb{R}^{L times d}$ position encoding matrix, where the $i$-th row $mathbf{p}_i$ is the positional encoding for input $mathbf{x}_i$.

Transformer Basics

The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.

Liked Liked