Multi-Head Attention

Summary: A full attention block in a transformer is not one attention head but many, each with its own independent $W_{Q}$ , $W_{K}$ , $W_{V}$ , running in parallel. Their outputs all get added to the residual stream, giving the model the capacity to learn many distinct ways that context changes meaning.

Why many heads (run in parallel)?

A single attention head can learn one pattern of “which tokens should influence which, and how?“. But language context works in many ways at once:

Adjectives update their nouns.
Pronouns need to find their antecedents.
“they crashed the” changes what kind of car is expected (structural damage, broken glass).
The word Harry near wizard vs Harry near Queen, Sussex picks out different people.

Forcing a single head to learn all of these simultaneously would be a nightmare. Running many heads in parallel, each with its own small parameter budget, lets different heads specialise — and summing their contributions gives the token vector access to all of them at once.

The model is given the capacity to learn many distinct ways that context changes meaning.

How it’s wired

For each block and each head $h \in {1, \dots, H}$ :

Independent $W_{Q}^{(h)}$ , $W_{K}^{(h)}$ , $W_{V}^{(h)}$ (the last factored into $W_{V}^{(h) ↓}$ and $W_{V}^{(h) ↑}$ ).
Each head computes its own attention pattern and its own $Δ E_{j}^{(h)}$ for every token position $j$ .
The block’s output for position $j$ is the sum over heads, added to the incoming embedding:

$E_{j}^{'} = E_{j} + \sum_{h = 1}^{H} Δ E_{j}^{(h)}$

Dimensionality example with GPT-3:

Original embedding: $E_{j} \in R^{12, 288}$ Refinement from head $h$ : $Δ E_{j}^{(h)} \in R^{12, 288}$ — each head proposes a nudge in the full embedding space Number of heads: $H = 96$ Updated embedding: $E_{j}^{'} \in R^{12, 288}$ — same dimensionality, moved to encode context

Summing 96 proposed changes — every head independently produces a 12,288-dim refinement; they all get added together and then added to the original:

$12, 288 \times 1 e_{0}^{'} e_{1}^{'} ⋮ e_{12287}^{'} = 12, 288 \times 1 e_{0} e_{1} ⋮ e_{12287} + 12, 288 \times 1 Δ e_{0}^{(1)} Δ e_{1}^{(1)} ⋮ Δ e_{12287}^{(1)} + 12, 288 \times 1 Δ e_{0}^{(2)} ⋮ ⋮ Δ e_{12287}^{(2)} + \dots + 12, 288 \times 1 Δ e_{0}^{(96)} ⋮ ⋮ Δ e_{12287}^{(96)}$

Equivalently, as a matrix multiply (the $W_{O}$ view) — concatenate all heads’ small outputs and multiply by the output matrix:

$12, 288 \times 1 e_{0}^{'} e_{1}^{'} ⋮ e_{12287}^{'} = 12, 288 \times 1 e_{0} e_{1} ⋮ e_{12287} + W_{O} 12, 288 \times 12, 288 \cdot ⋮ ⋮ \cdot \dots \dots \dots \dots \dots \dots \cdot ⋮ ⋮ \cdot 12, 288 \times 1 z_{0}^{(1)} ⋮ z_{127}^{(1)} z_{0}^{(2)} ⋮ z_{127}^{(96)}$

The concatenated vector on the right has $96 \times 128 = 12, 288$ entries — one 128-dim chunk per head. $W_{O}$ is square ( $12, 288 \times 12, 288$ ) but is really 96 stacked $V_{↑}$ matrices (each $12, 288 \times 128$ ). See terminology gotcha.

GPT-3: $H = 96$ heads per block, $L = 96$ blocks. So the model runs $96 \times 96 = 9, 216$ distinct attention heads in total.

Parameter count (GPT-3)

Per head:

$W_{Q}$ , $W_{K}$ : each $d_{embed} \times d_{k} = 12, 288 \times 128 \approx 1.57 M$
$W_{V}^{↓}$ : $d_{k} \times d_{embed}$ , same ~1.57M
$W_{V}^{↑}$ : $d_{embed} \times d_{k}$ , same ~1.57M
Total per head: ~6.3M

Per block: $6.3 M \times 96 heads \approx 600 M$ Across all 96 layers: ~58B parameters devoted to attention. About a third of GPT-3’s 175B.

Low-rank values (why factor $W_{V}$ ?)

Without factoring, $W_{V}$ would be square ( $12, 288 \times 12, 288 \approx 151 M$ ) per head — so one head would blow the entire parameter budget of the Q/K matrices combined. Multiply by 96 heads × 96 layers and the arithmetic doesn’t close.

Factoring $W_{V}$ through the small $d_{k}$ -dimensional space:

Drops per-head value params to match $Q$ and $K$ (~3.1M split across the two factors).
Across all heads in a block, the up-projection matrices can be stacked and viewed as a single big output projection (often written $W_{O}$ in papers and textbooks) — same total operation count, different bookkeeping.
The overall linear value map has rank $\leq H \cdot d_{k}$ , which is still much less than $d_{embed}$ for early models but grows as $H$ grows.

Terminology gotcha: “value matrix” vs output matrix

Theory vs practice

3b1b’s “low-rank value transformation” framing — each head owning its own $V_{↓}$ and $V_{↑}$ — is conceptually clean, but it is not how papers and real implementations write it. The standard convention is:

The per-head value matrix is only the “down-projection” $V_{↓} \in R^{128 \times 12, 288}$ (maps down from embedding space → smaller $d_{k}$ -dim space ( $R^{12, 288} \to R^{128}$ in GPT-3)).

When a paper says “the value matrix of head $h$ ,” this is what it means; often denoted, $W_{V} \in R^{128 \times 12, 288}$

All $H$ “up-projections” $V_{↑}^{(h)} \in R^{12, 288 \times 128}$ are concatenated into one giant matrix, the output matrix $W_{O} \in R^{12, 288 \times 128 H}$ , which belongs to the multi-head block as a whole, not to any single head.

This maps back up to $d_{embed}$ space.

The two framings compute exactly the same function — $\sum_{h} V_{↑}^{(h)} Δ attn-output^{(h)}$ vs $W_{O} [concat over h]$ — but with the per-head/per-block split drawn in different places. If you only read 3b1b and then open the original Attention Is All You Need paper, this renaming is the single most confusing thing. Same operation, different bookkeeping.

What you actually get

Parallelism within a block. All heads can run simultaneously on a GPU — this is one reason attention is GPU-friendly.
Specialisation. Different heads learn to attend to different relationships. Interpretability work has found heads that do things like “attend to the previous token”, “attend to the same token earlier in the sequence” (induction heads), “attend to the subject of the current sentence”, etc.
Redundancy and ablation. Many heads can be pruned after training with little loss — evidence that the model over-provisions capacity during training and then settles into a smaller effective set.

Sources

src-3b1b-llms-ch3-attention

notes/

Multi-Head Attention

Why many heads (run in parallel)?

How it’s wired

Parameter count (GPT-3)

Low-rank values (why factor $W_{V}$ ?)

Terminology gotcha: “value matrix” vs output matrix

What you actually get

See also

Sources

Multi-Head Attention

Why many heads (run in parallel)?

How it’s wired

Parameter count (GPT-3)

Low-rank values (why factor WV​?)

Terminology gotcha: “value matrix” vs output matrix

What you actually get

See also

Sources

Graph View

Backlinks

Explorer

Low-rank values (why factor $W_{V}$ ?)