Summary: A full attention block in a transformer is not one attention head but many, each with its own independent , , , running in parallel. Their outputs all get added to the residual stream, giving the model the capacity to learn many distinct ways that context changes meaning.
Why many heads (run in parallel)?
A single attention head can learn one pattern of “which tokens should influence which, and how?“. But language context works in many ways at once:
- Adjectives update their nouns.
- Pronouns need to find their antecedents.
- “they crashed the” changes what kind of car is expected (structural damage, broken glass).
- The word Harry near wizard vs Harry near Queen, Sussex picks out different people.
Forcing a single head to learn all of these simultaneously would be a nightmare. Running many heads in parallel, each with its own small parameter budget, lets different heads specialise — and summing their contributions gives the token vector access to all of them at once.
The model is given the capacity to learn many distinct ways that context changes meaning.
How it’s wired
For each block and each head :
- Independent , , (the last factored into and ).
- Each head computes its own attention pattern and its own for every token position .
- The block’s output for position is the sum over heads, added to the incoming embedding:
Dimensionality example with GPT-3:
Original embedding: Refinement from head : — each head proposes a nudge in the full embedding space Number of heads: Updated embedding: — same dimensionality, moved to encode context
Summing 96 proposed changes — every head independently produces a 12,288-dim refinement; they all get added together and then added to the original:
Equivalently, as a matrix multiply (the view) — concatenate all heads’ small outputs and multiply by the output matrix:
The concatenated vector on the right has entries — one 128-dim chunk per head. is square () but is really 96 stacked matrices (each ). See terminology gotcha.
GPT-3: heads per block, blocks. So the model runs distinct attention heads in total.
Parameter count (GPT-3)
Per head:
- , : each
- : , same ~1.57M
- : , same ~1.57M
- Total per head: ~6.3M
Per block: Across all 96 layers: ~58B parameters devoted to attention. About a third of GPT-3’s 175B.
Low-rank values (why factor ?)
Without factoring, would be square () per head — so one head would blow the entire parameter budget of the Q/K matrices combined. Multiply by 96 heads × 96 layers and the arithmetic doesn’t close.
Factoring through the small -dimensional space:
- Drops per-head value params to match and (~3.1M split across the two factors).
- Across all heads in a block, the up-projection matrices can be stacked and viewed as a single big output projection (often written in papers and textbooks) — same total operation count, different bookkeeping.
- The overall linear value map has rank , which is still much less than for early models but grows as grows.
Terminology gotcha: “value matrix” vs output matrix
Theory vs practice
3b1b’s “low-rank value transformation” framing — each head owning its own and — is conceptually clean, but it is not how papers and real implementations write it. The standard convention is:
- The per-head value matrix is only the “down-projection” (maps down from embedding space → smaller -dim space ( in GPT-3)).
- When a paper says “the value matrix of head ,” this is what it means; often denoted,
- All “up-projections” are concatenated into one giant matrix, the output matrix , which belongs to the multi-head block as a whole, not to any single head.
- This maps back up to space.
The two framings compute exactly the same function — vs — but with the per-head/per-block split drawn in different places. If you only read 3b1b and then open the original Attention Is All You Need paper, this renaming is the single most confusing thing. Same operation, different bookkeeping.
What you actually get
- Parallelism within a block. All heads can run simultaneously on a GPU — this is one reason attention is GPU-friendly.
- Specialisation. Different heads learn to attend to different relationships. Interpretability work has found heads that do things like “attend to the previous token”, “attend to the same token earlier in the sequence” (induction heads), “attend to the subject of the current sentence”, etc.
- Redundancy and ablation. Many heads can be pruned after training with little loss — evidence that the model over-provisions capacity during training and then settles into a smaller effective set.
See also
- attention-mechanism — a single head in detail
- transformer-architecture — where multi-head attention blocks sit in the full stack
- src-3b1b-llms-ch3-attention