Summary: The operation inside a transformer that lets token embeddings communicate with each other, so that each token’s vector can be refined based on the context it appears in.
Mechanically: for every (query, key) pair, compute a dot-product alignment score, softmax the scores into weights, and use those weights to mix value vectors into the output.
Geometrically: allow other surrounding tokens to update a given token’s word-embedding vector (of dimensionality in GPT-3), moving it from its original orientation to a different part of the high-dimensionality embedding space that better captures its semantic meaning (“fluffy” in “fluffy creature” augments the meaning of “creature” by itself)
💡 The problem attention solves
After word-embedding, every instance of “mole” has the same vector, regardless of whether it means a burrowing animal, a unit of substance, or a skin growth. Only context can disambiguate. Attention is the mechanism by which surrounding tokens inject information into each token’s vector so that downstream layers see a context-aware representation.
Crucially, this matters not just for the “interesting” tokens — it matters for the last token, because only the last vector in the sequence is used to produce the next-token prediction (see unembedding). That last vector has to absorb everything from the full context.
A single head of attention
A single “head” is parameterised by three learned matrices: , , . Given the sequence of token embeddings flowing into the block:
This describes self-attention
Everything on this page — , , all drawn from the same input token sequence — is technically the self-attention variant. GPT-style decoder-only LLMs use only self-attention.
A sibling variant, cross-attention, feeds queries from one sequence (e.g. English, or speech audio) and keys/values from another (e.g. French, or text transcript; the decoder of an encoder–decoder translation model). The machinery is otherwise identical. See self-attention-vs-cross-attention.
Step 1 — Queries and keys
Each token’s embedding, , is projected into two small vectors, and . These projections are done through the query-projection matrix (), and the key projection matrix () respectively. Both matrices are of dimension (typically ; GPT-3 uses vs ).:
Dimensionality example with GPT-3:
Query and Key projection matrices: — wide matrices, which compress embedding vectors down to key-query-space Token embedding vector : (column vector) Query and Key vectors : — much shorter than the embedding vector (i.e. queries and keys exist in a lower-dimensional space than embeddings)
Query projection — a very wide matrix compresses a tall embedding into a short query vector:
Key projection — identical shape, different learned weights:
The output vector (128-dim) is ~96× shorter than the input (12,288-dim). The projection matrix has parameters.
Conceptual framing:
- Query ≈ “what am I looking for?” (e.g. the noun creature looks for preceding adjectives)
- Key ≈ “what do I offer?” (e.g. the adjective fluffy advertises itself as an adjective-in-the-preceding-position)
- Alternatively, causes each token to ask “questions” (via its query vector) and makes all other tokens attempt to “answer” that question (with their key vectors).
Keep in mind this is an interpretation — what and actually do is learned, and for most heads is nowhere near this clean.
Note of convenience: and are defined as column vectors (instead of row vectors)
We have chosen to define the individual token’s query and key vectors, and , as column vectors to look pretty 🙂. Literature commonly defines them as row vectors. A lot of the content of Steps 1-5 may appear transposed in the literature.
As a consequence of defining and as column vectors,
- Concatenating all tokens’ query vectors as columns gives the matrix :
- Concatenating all tokens’ key vectors as columns gives the matrix
- In the literature (Attention is All You Need), and are actually defined as row vectors, so the and matrices in the literature are transposed:
Step 2 — Alignment scores
For every (query, key) pair (i.e. all words’ queries against all words’ keys), take the dot product. This tells us how much (the embedding of vector ), attends to (the embedding of vector ):
Dimensionality example with GPT-3:
Query vector of token ‘s embedding:
- Concatenating all Query vectors (columns) gives , the matrix of all tokens’ Query vectors
Key vector of token ‘s embedding:
- Concatenating all Key vectors (columns) gives , the matrix of all tokens’ Key vectors
Score: — a single scalar
Full score grid: where = context length
Single dot product (how much token attends to )— two 128-dim vectors collapse to one number:
- Full score grid — for a -token context, the full score grid is . The attention pattern matrix will be the same shape:
- Notes:
- Column holds all scores for token ‘s query against every key. This grid scales quadratically with context length — a 2,048-token context produces a -entry grid.
- Column and row labels (, and ) describe Step 1 above
This produces an grid. Column contains the scores of every token as a potential source of information for token . Dot product as an alignment measure: large positive → strongly aligned, zero → unrelated, negative → anti-aligned.
Step 3 — Scale and softmax
Divide by (this is for GPT-3) for numerical stability (keeps the softmax out of saturated regions) and then apply softmax column by column so each column is a probability distribution:
Dimensionality example with GPT-3:
Scale factor: Scaled score grid: — same shape as the raw scores, just divided by 11.3 Attention pattern: — same shape, but each column now sums to 1
Take the 5-token phrase: “a big red ball bounced…”
Before softmax — raw scores divided by (a 5-token example):
After column-wise softmax — each column becomes a probability distribution (sums to 1):
The scaling by keeps the scores moderate — without it, large dot products push softmax into near-one-hot territory, killing gradients.
The resulting grid is called the attention pattern.
Attention Pattern after the scaling and softmax process (without masking):
5-token phrase: “a big red ball bounced…”
Step 4 — Masking (autoregressive only)
This step mostly relates to training. Also see the masking section below.
For a next-token-prediction transformer, tokens must not see the future — otherwise training would leak the answer. Before the softmax, set every entry where (future influencing past) to ; after softmax those entries become 0. This is called a causal mask.
Masking is what makes it safe to use every position as a training example simultaneously (parallel training). Without it, every position would be “told” what came next.
Attention Pattern after the scaling and softmax process ( WITH masking):
5-token phrase: “a big red ball bounced…”
We have the attention pattern, describing which words are relevant to (attend to) updating which other words. Now, let’s show how to update all the word embeddings with this new, richer, semantic meaning. It’s what allows “fluffy” in “fluffy creature” to augment the meaning of “creature” by itself.
Step 5 — Values and output
Project every token embedding once more, through the value projection matrix , into the embedding space:
An interpretation of the value vector, : If the token is relevant to adjusting the meaning of some other token, what exactly should be added to the embedding of that other token to reflect this?
Dimensionality example with GPT-3 (for one head of attention)
Value matrix (conceptual): — maps embedding space → embedding space
Value matrix (in practice, factored): where
- — maps down: embedding space → 128-dim and
- — maps up: 128-dim → embedding space
- For multiple attention heads, all “up-projections” are concatenated into one giant matrix, the output matrix
- (see low-rank factoring and Terminology gotcha value matrix vs output matrix)
Token embedding vector :
Value vector : — same dimensionality as the embedding
Conceptual (un-factored) — a square matrix maps a tall vector to an equally tall vector:
- Factored (actual) — down-project to 128-dim, then up-project back:
- The bottleneck through 128 dims means is low-rank — it can’t represent arbitrary transformations, only ones that pass through a 128-dim intermediate.
The refinement added to token is the weighted sum of values () weighted by the attention column for (see Step 3):
Dimensionality example with GPT-3:
- Attention weights for token : — one scalar per token in context, from column of the attention pattern
- Value vectors: each
- Refinement vector: — same shape as the embedding
- Weighted sum — for token in an -token-long context, scalars (from masked (upper triangular) attention pattern) weight value vectors, producing one vector :
Example: Take the 5-token context “a big red ball bounced”.
- Let , meaning we are interested in updating the token embedding for “ball”
- 5 scalars from Attention (masked) will weight 5 value vectors (one per token), producing one vector :
- Recall, the Attention Pattern (“with masking”) above, let’s take its 4th column ():
- Then compute , the update to “ball“‘s original embedding. Once updated, the new “ball” will include the richer meaning that comes from “big” and “red” which attend to the original “ball”
- Tokens whose attention weight is near 0 contribute almost nothing; tokens near 1 dominate the update. The result is a 12,288-dim vector — a direction in embedding space that gets added to token ‘s embedding.
This is added (residual-stream style) to the original embedding:
This attention-weighted update is done for all token embeddings in the context window (not just the token!).
Dimensionality example with GPT-3:
Original embedding: Refinement from this head: Updated embedding: — same shape, different (updated, richer, more nuanced) direction
Elementwise addition — the original embedding vector gets nudged:
This is a residual connection — the identity path means the model can learn to make small (or zero) for positions where this head has nothing useful to add.
The compact form
Important
Note: Everything in the literature appears transposed, due to our earlier decision to define a single token’s and as column vectors (see note above), resulting in the literature’s definition of the and matrices (i.e. for ALL tokens) to differ from ours. Where the literature says , our notes above will say
The compact notation below, from the paper is an example of this. In our notation, it should instead say per our definitions
The whole thing, from the original paper Attention is All You Need:
where are stacks of all query, key, and value vectors across positions. The softmax is understood to apply per column.
Dimensionality example with GPT-3:
Here each token’s vector is a row (the batched convention from the paper). For a context of tokens: Query matrix: — each row is one token’s 128-dim query Key matrix: — each row is one token’s 128-dim key Value matrix: — each row is one token’s 12,288-dim value
Step 1 — gives the score grid:
Step 2 — softmax + multiply by gives the output:
The output has one row per token, each row a 12,288-dim refinement vector. The inner dimension (128) cancels in the product; the inner dimension () cancels when the attention weights multiply .
Masking and efficient (parallelised) training
Every position in a sequence simultaneously predicts its own next token. So a sequence of length yields training signals at no extra cost — provided masking prevents position from cheating by looking at positions . This is a huge efficiency multiplier over training on “predict the last token only” one example at a time.
Example: A 12-token-long sequence contains 12 parallel training examples!
- 12-token-long sequence: To| date|,| the| cle|ve|rest| thinker| of| all| time| was…
- Contains the following 12 training examples:
- To…
- To| date…
- To| date|,…
- To| date|,| the…
- To| date|,| the| cle…
- To| date|,| the| cle|ve…
- To| date|,| the| cle|ve|rest…
- To| date|,| the| cle|ve|rest| thinker…
- To| date|,| the| cle|ve|rest| thinker| of…
- To| date|,| the| cle|ve|rest| thinker| of| all…
- To| date|,| the| cle|ve|rest| thinker| of| all| time…
- To| date|,| the| cle|ve|rest| thinker| of| all| time| was…
Also see the masking step above (mostly relevant to training only).
Context size and quadratic cost
The attention pattern has size , where is the context length. Doubling context quadruples attention compute and memory. This is the reason long-context models are nontrivial, and a major research area. update based on state of the art now (Linear Attention, Sliding-Window Attention, State-Space Models, Sparse Attention Mechanisms, Blockwise Attention, Linformer, Reformer, Ring attention, Longformer, Adaptive Attention Span, etc)
Low-rank value factoring
Naïvely, is square () and would have ~150M parameters in GPT-3 — more than the rest of a single head combined. In practice is factored into two matrices of rank : a down-projection into the small key-query space, then an up-projection back to the embedding space. This matches the parameter budget of and and constrains the value map to be low-rank — see multi-head-attention for why this matters across many heads.
What heads actually do
In the 3b1b running example, one head “has adjectives update nouns”. Real heads do much weirder things. Documented patterns from interpretability work include:
- Previous-token heads (attend strictly to the token immediately before).
- Induction heads (copy patterns — if
AB...Aappears, predictB). - Positional heads, duplicate-token heads, anaphora resolution heads, and many more.
Most heads don’t have a clean interpretation at all — they’re just linear algebra that happened to be useful during training.
See also
- self-attention-vs-cross-attention — the two variants of a single attention head
- multi-head-attention — parallel heads in one block
- transformer-architecture — how attention slots into the stack
- softmax — the normalisation used here
- word-embedding, unembedding — the data attention operates on