Summary: The operation inside a transformer that lets token embeddings communicate with each other, so that each token’s vector can be refined based on the context it appears in.

Mechanically: for every (query, key) pair, compute a dot-product alignment score, softmax the scores into weights, and use those weights to mix value vectors into the output.

Geometrically: allow other surrounding tokens to update a given token’s word-embedding vector (of dimensionality in GPT-3), moving it from its original orientation to a different part of the high-dimensionality embedding space that better captures its semantic meaning (“fluffy” in “fluffy creature” augments the meaning of “creature” by itself)

💡 The problem attention solves

After word-embedding, every instance of “mole” has the same vector, regardless of whether it means a burrowing animal, a unit of substance, or a skin growth. Only context can disambiguate. Attention is the mechanism by which surrounding tokens inject information into each token’s vector so that downstream layers see a context-aware representation.

Crucially, this matters not just for the “interesting” tokens — it matters for the last token, because only the last vector in the sequence is used to produce the next-token prediction (see unembedding). That last vector has to absorb everything from the full context.

A single head of attention

A single “head” is parameterised by three learned matrices: , , . Given the sequence of token embeddings flowing into the block:

This describes self-attention

Everything on this page — , , all drawn from the same input token sequence — is technically the self-attention variant. GPT-style decoder-only LLMs use only self-attention.

A sibling variant, cross-attention, feeds queries from one sequence (e.g. English, or speech audio) and keys/values from another (e.g. French, or text transcript; the decoder of an encoder–decoder translation model). The machinery is otherwise identical. See self-attention-vs-cross-attention.

Step 1 — Queries and keys

Each token’s embedding, , is projected into two small vectors, and . These projections are done through the query-projection matrix (), and the key projection matrix () respectively. Both matrices are of dimension (typically ; GPT-3 uses vs ).:

Conceptual framing:

  • Query ≈ “what am I looking for?” (e.g. the noun creature looks for preceding adjectives)
  • Key ≈ “what do I offer?” (e.g. the adjective fluffy advertises itself as an adjective-in-the-preceding-position)
  • Alternatively, causes each token to ask “questions” (via its query vector) and makes all other tokens attempt to “answer” that question (with their key vectors).

Keep in mind this is an interpretation — what and actually do is learned, and for most heads is nowhere near this clean.

Note of convenience: and are defined as column vectors (instead of row vectors)

We have chosen to define the individual token’s query and key vectors, and , as column vectors to look pretty 🙂. Literature commonly defines them as row vectors. A lot of the content of Steps 1-5 may appear transposed in the literature.

As a consequence of defining and as column vectors,

  • Concatenating all tokens’ query vectors as columns gives the matrix :
  • Concatenating all tokens’ key vectors as columns gives the matrix
  • In the literature (Attention is All You Need), and are actually defined as row vectors, so the and matrices in the literature are transposed:

Step 2 — Alignment scores

For every (query, key) pair (i.e. all words’ queries against all words’ keys), take the dot product. This tells us how much (the embedding of vector ), attends to (the embedding of vector ):

This produces an grid. Column contains the scores of every token as a potential source of information for token . Dot product as an alignment measure: large positive → strongly aligned, zero → unrelated, negative → anti-aligned.

Step 3 — Scale and softmax

Divide by (this is for GPT-3) for numerical stability (keeps the softmax out of saturated regions) and then apply softmax column by column so each column is a probability distribution:

The resulting grid is called the attention pattern.

Step 4 — Masking (autoregressive only)

This step mostly relates to training. Also see the masking section below.

For a next-token-prediction transformer, tokens must not see the future — otherwise training would leak the answer. Before the softmax, set every entry where (future influencing past) to ; after softmax those entries become 0. This is called a causal mask.

Masking is what makes it safe to use every position as a training example simultaneously (parallel training). Without it, every position would be “told” what came next.


We have the attention pattern, describing which words are relevant to (attend to) updating which other words. Now, let’s show how to update all the word embeddings with this new, richer, semantic meaning. It’s what allows “fluffy” in “fluffy creature” to augment the meaning of “creature” by itself.

Step 5 — Values and output

Project every token embedding once more, through the value projection matrix , into the embedding space:

An interpretation of the value vector, : If the token is relevant to adjusting the meaning of some other token, what exactly should be added to the embedding of that other token to reflect this?

The refinement added to token is the weighted sum of values () weighted by the attention column for (see Step 3):

This is added (residual-stream style) to the original embedding:

This attention-weighted update is done for all token embeddings in the context window (not just the token!).

The compact form

Important

Note: Everything in the literature appears transposed, due to our earlier decision to define a single token’s and as column vectors (see note above), resulting in the literature’s definition of the and matrices (i.e. for ALL tokens) to differ from ours. Where the literature says , our notes above will say

The compact notation below, from the paper is an example of this. In our notation, it should instead say per our definitions

The whole thing, from the original paper Attention is All You Need:

where are stacks of all query, key, and value vectors across positions. The softmax is understood to apply per column.

Masking and efficient (parallelised) training

Every position in a sequence simultaneously predicts its own next token. So a sequence of length yields training signals at no extra costprovided masking prevents position from cheating by looking at positions . This is a huge efficiency multiplier over training on “predict the last token only” one example at a time.

Also see the masking step above (mostly relevant to training only).

Context size and quadratic cost

The attention pattern has size , where is the context length. Doubling context quadruples attention compute and memory. This is the reason long-context models are nontrivial, and a major research area. update based on state of the art now (Linear Attention, Sliding-Window Attention, State-Space Models, Sparse Attention Mechanisms, Blockwise Attention, Linformer, Reformer, Ring attention, Longformer, Adaptive Attention Span, etc)

Low-rank value factoring

Naïvely, is square () and would have ~150M parameters in GPT-3 — more than the rest of a single head combined. In practice is factored into two matrices of rank : a down-projection into the small key-query space, then an up-projection back to the embedding space. This matches the parameter budget of and and constrains the value map to be low-rank — see multi-head-attention for why this matters across many heads.

What heads actually do

In the 3b1b running example, one head “has adjectives update nouns”. Real heads do much weirder things. Documented patterns from interpretability work include:

  • Previous-token heads (attend strictly to the token immediately before).
  • Induction heads (copy patterns — if AB...A appears, predict B).
  • Positional heads, duplicate-token heads, anaphora resolution heads, and many more.

Most heads don’t have a clean interpretation at all — they’re just linear algebra that happened to be useful during training.

See also