Summary: Walks through a single head of the attention-mechanism — queries, keys, values, softmax, masking — then generalises to multi-head-attention, using GPT-3’s numbers throughout.
Why attention exists
- Token embeddings out of the lookup table are context-free. The word mole has one vector whether it means a burrowing animal, a unit of substance, or a skin growth.
- Attention’s job: let surrounding tokens push information into each token’s vector so its final representation encodes its contextual meaning.
- This matters because only the last vector’s state is used to predict the next token (see src-3b1b-llms-ch2-transformers). That last vector must have absorbed everything relevant from the whole context window.
Single-head attention — the running example
Running example: “A fluffy blue creature roamed the verdant forest.” The illustrative behaviour: adjectives pushing information into their corresponding nouns.
- Queries. Each embedding is multiplied by a query matrix to produce a query vector in a smaller space (128-dim in GPT-3, vs 12,288-dim embeddings). Conceptually, “what am I looking for?”
- Keys. Each embedding is multiplied by a key matrix to produce a key in the same small space. Conceptually, “what do I offer?”
- Scores. For each (query, key) pair, take the dot product. High dot product ⇒ key strongly answers this query. This produces a context_size × context_size grid of raw scores.
- Scale and softmax. Divide by for numerical stability, then apply softmax column-by-column so each column is a probability distribution. This grid is the attention pattern.
- Masking. For autoregressive training/inference, set all entries where later tokens would influence earlier ones to before softmax → zero after. Prevents later tokens from leaking backwards during parallel training.
- Values. Each embedding is multiplied by a value matrix to produce a value vector in embedding space. The refinement is added to the original embedding to produce .
In the compact notation from the paper: .
Self- vs cross-attention (side note)
Everything above is technically self-attention: Q, K, V all drawn from the same sequence. A variant called cross-attention appears in encoder–decoder models (e.g. translation): queries come from one sequence, keys and values from another, and there is typically no causal mask because there is no notion of “future” between the two sequences. See self-attention-vs-cross-attention.
Low-rank value projection
- Naïvely, would be square (embed × embed) ≈ 151M params per head. Too many.
- In practice, is factored into two matrices of rank : a down-projection (embed → ) and an up-projection ( → embed). Same parameter budget as and .
- This constrains the value map to be low-rank, which is what gets called “low-rank value transformation” in the literature.
- Terminology note. In real implementations and papers, per-head value matrix refers only to the down-projection ; all the per head are stapled together into a single output matrix that belongs to the whole multi-head block. Same computation as 3b1b’s framing, different bookkeeping — see Terminology gotcha value matrix vs output matrix.
Multi-head attention
- A full attention block runs many single heads in parallel, each with its own . GPT-3 uses 96 heads per block.
- Each head produces its own contribution; all are summed and added to the original embedding.
- Parameter count (GPT-3):
- , , , each: 12,288 × 128 ≈ 1.57M params per head
- ~6.3M params per head × 96 heads ≈ 600M per attention block
- × 96 layers ≈ 58B params across all attention (about a third of GPT-3’s 175B)
- Context size scales quadratically. Attention patterns are — doubling context quadruples this cost. The main structural reason scaling context is hard.
Why this wins
Attention is the core of transformers not because of any one behaviour, but because it is extremely parallelisable. All token interactions happen as matrix multiplications, so GPUs can chew through whole sequences at once — unlike RNNs/LSTMs that processed one token at a time.