Attention Mechanism

Summary: The operation inside a transformer that lets token embeddings communicate with each other, so that each token’s vector can be refined based on the context it appears in.

Mechanically: for every (query, key) pair, compute a dot-product alignment score, softmax the scores into weights, and use those weights to mix value vectors into the output.

Geometrically: allow other surrounding tokens to update a given token’s word-embedding vector (of dimensionality $d_{embed} \approx 12, 000$ in GPT-3), moving it from its original orientation to a different part of the high-dimensionality embedding space that better captures its semantic meaning (“fluffy” in “fluffy creature” augments the meaning of “creature” by itself)

💡 The problem attention solves

After word-embedding, every instance of “mole” has the same vector, regardless of whether it means a burrowing animal, a unit of substance, or a skin growth. Only context can disambiguate. Attention is the mechanism by which surrounding tokens inject information into each token’s vector so that downstream layers see a context-aware representation.

Crucially, this matters not just for the “interesting” tokens — it matters for the last token, because only the last vector in the sequence is used to produce the next-token prediction (see unembedding). That last vector has to absorb everything from the full context.

A single head of attention

A single “head” is parameterised by three learned matrices: $W_{Q}$ , $W_{K}$ , $W_{V}$ . Given the sequence of token embeddings $E_{1}, \dots, E_{n}$ flowing into the block:

This describes self-attention

Everything on this page — $Q$ , $K$ , $V$ all drawn from the same input token sequence — is technically the self-attention variant. GPT-style decoder-only LLMs use only self-attention.

A sibling variant, cross-attention, feeds queries from one sequence (e.g. English, or speech audio) and keys/values from another (e.g. French, or text transcript; the decoder of an encoder–decoder translation model). The machinery is otherwise identical. See self-attention-vs-cross-attention.

Step 1 — Queries and keys

Each token’s embedding, $E_{i} \in R^{d_{embed}}$ , is projected into two small vectors, $Q_{i}$ and $K_{i} \in R^{d_{k}}$ . These projections are done through the query-projection matrix ( $W_{Q}$ ), and the key projection matrix ( $W_{K}$ ) respectively. Both matrices are of dimension $R^{d_{k} \times d_{embed}}$ (typically $d_{k} ≪ d_{embed}$ ; GPT-3 uses $d_{k} = 128$ vs $d_{embed} = 12, 288$ ).:

$Q_{i} = W_{Q} E_{i} K_{i} = W_{K} E_{i}$

Dimensionality example with GPT-3:

Query and Key projection matrices: $W_{Q}, W_{K} \in R^{128 \times 12, 288}$ — wide matrices, which compress embedding vectors down to key-query-space Token embedding vector $i$ : $E_{i} \in R^{12, 288}$ (column vector) Query and Key vectors $i$ : $Q_{i}, K_{i} \in R^{128}$ — much shorter than the embedding vector (i.e. queries and keys exist in a lower-dimensional space than embeddings)

Query projection — a very wide matrix compresses a tall embedding into a short query vector:
$Q_{i} 128 \times 1 q_{0} q_{1} ⋮ q_{127} = W_{Q} 128 \times 12, 288 w_{0, 0} w_{1, 0} ⋮ w_{127, 0} w_{0, 1} w_{1, 1} ⋮ w_{127, 1} \dots \dots ⋱ \dots w_{0, 12287} w_{1, 12287} ⋮ w_{127, 12287} E_{i} 12, 288 \times 1 e_{0} e_{1} ⋮ ⋮ ⋮ e_{12287}$
Key projection — identical shape, different learned weights:
$K_{i} 128 \times 1 k_{0} k_{1} ⋮ k_{127} = W_{K} 128 \times 12, 288 w_{0, 0}^{'} ⋮ w_{127, 0}^{'} w_{0, 1}^{'} ⋮ w_{127, 1}^{'} \dots ⋱ \dots w_{0, 12287}^{'} ⋮ w_{127, 12287}^{'} E_{i} 12, 288 \times 1 e_{0} e_{1} ⋮ ⋮ ⋮ e_{12287}$
The output vector (128-dim) is ~96× shorter than the input (12,288-dim). The projection matrix $W_{Q}$ has $128 \times 12, 288 \approx 1.57 M$ parameters.

Conceptual framing:

Query ≈ “what am I looking for?” (e.g. the noun creature looks for preceding adjectives)
Key ≈ “what do I offer?” (e.g. the adjective fluffy advertises itself as an adjective-in-the-preceding-position)
Alternatively, $W_{Q}$ causes each token to ask “questions” (via its query vector) and $W_{K}$ makes all other tokens attempt to “answer” that question (with their key vectors).

Keep in mind this is an interpretation — what $W_{Q}$ and $W_{K}$ actually do is learned, and for most heads is nowhere near this clean.

Note of convenience: $Q_{i}$ and $K_{i}$ are defined as column vectors (instead of row vectors)

We have chosen to define the individual token’s query and key vectors, $Q_{i}$ and $K_{i}$ , as column vectors to look pretty 🙂. Literature commonly defines them as row vectors. A lot of the content of Steps 1-5 may appear transposed in the literature.

As a consequence of defining $Q_{i}$ and $K_{i}$ as column vectors,

Concatenating all $n$ tokens’ query vectors as columns gives the matrix $Q$ :

$Q = Q 128 \times n ∣ Q_{0} ∣ ∣ Q_{1} ∣ \dots ∣ Q_{n - 1} ∣$

Concatenating all $n$ tokens’ key vectors as columns gives the matrix $K$

$K = K 128 \times n ∣ K_{0} ∣ ∣ K_{1} ∣ \dots ∣ K_{n - 1} ∣$

In the literature (Attention is All You Need), $Q_{i}$ and $K_{i}$ are actually defined as row vectors, so the $Q$ and $K$ matrices in the literature are transposed:

$Q_{literature} = Q^{⊤} = Q^{⊤} n \times 128 — Q_{0}^{⊤} — — Q_{1}^{⊤} — ⋮ — Q_{n - 1}^{⊤} —, K_{literature} = K^{⊤} = K^{⊤} n \times 128 — K_{0}^{⊤} — — K_{1}^{⊤} — ⋮ — K_{n - 1}^{⊤} —$

Step 2 — Alignment scores

For every (query, key) pair (i.e. all words’ queries against all words’ keys), take the dot product. This tells us how much $E_{i}$ (the embedding of vector $i$ ), attends to $E_{j}$ (the embedding of vector $j$ ):

$score_{ij} = Q_{j} \cdot K_{i}$

Dimensionality example with GPT-3:

Query vector of token $j$ ‘s embedding: $Q_{j} \in R^{128}$

Concatenating all Query vectors (columns) gives $Q$ , the matrix of all tokens’ Query vectors

Key vector of token $i$ ‘s embedding: $K_{i} \in R^{128}$

Concatenating all Key vectors (columns) gives $K$ , the matrix of all tokens’ Key vectors

Score: $score_{ij} \in R$ — a single scalar

Full score grid: $S \in R^{n \times n}$ where $n$ = context length

Single dot product (how much token $E_{i}$ attends to $E_{j}$ )— two 128-dim vectors collapse to one number:

$score_{ij} = Q_{j}^{⊤} 1 \times 128 [q_{0} q_{1} \dots q_{127}] K_{i} 128 \times 1 k_{0} k_{1} ⋮ k_{127} = m = 0 \sum 127 q_{m} k_{m} \in R$

Full score grid — for a $n$ -token context, the full score grid is $n \times n$ . The attention pattern matrix $A$ will be the same shape:

$S = K^{⊤} Q = K^{⊤} n \times 128 — K_{0}^{⊤} — — K_{1}^{⊤} — ⋮ — K_{n - 1}^{⊤} — Q 128 \times n ∣ Q_{0} ∣ ∣ Q_{1} ∣ \dots ∣ Q_{n - 1} ∣ = E_{0} W_{K} K_{0} E_{1} W_{K} K_{1} E_{2} W_{K} K_{2} ⋮ E_{n - 1} W_{K} K_{n - 1} E_{0} W_{Q} Q_{0} K_{0} \cdot Q_{0} K_{1} \cdot Q_{0} K_{2} \cdot Q_{0} ⋮ K_{n - 1} \cdot Q_{0} E_{1} W_{Q} Q_{1} K_{0} \cdot Q_{1} K_{1} \cdot Q_{1} K_{2} \cdot Q_{1} ⋮ K_{n - 1} \cdot Q_{1} E_{2} W_{Q} Q_{2} K_{0} \cdot Q_{2} K_{1} \cdot Q_{2} K_{2} \cdot Q_{2} ⋮ K_{n - 1} \cdot Q_{2} \dots \dots \dots \dots ⋱ \dots E_{n - 1} W_{Q} Q_{n - 1} K_{0} \cdot Q_{n - 1} K_{1} \cdot Q_{n - 1} K_{2} \cdot Q_{n - 1} ⋮ K_{n - 1} \cdot Q_{n - 1}_{n \times n}$

Notes:

Column $j$ holds all scores for token $j$ ‘s query against every key. This grid scales quadratically with context length — a 2,048-token context produces a $2, 048 \times 2, 048 \approx 4.2 M$ -entry grid.

Column and row labels ( $E_{0} W_{Q} Q_{0}$ , and $E_{0} W_{K} K_{0}$ ) describe Step 1 above

This produces an $n \times n$ grid. Column $j$ contains the scores of every token $i$ as a potential source of information for token $j$ . Dot product as an alignment measure: large positive → strongly aligned, zero → unrelated, negative → anti-aligned.

Step 3 — Scale and softmax

Divide by $d_{k}$ (this is $128 \approx 11.3$ for GPT-3) for numerical stability (keeps the softmax out of saturated regions) and then apply softmax column by column so each column is a probability distribution:

$attn_{ij} = softmax_{i} (\frac{Q _{j} \cdot K _{i}}{d _{k}})$

Dimensionality example with GPT-3:

Scale factor: $d_{k} = 128 \approx 11.3$ Scaled score grid: $S / d_{k} \in R^{n \times n}$ — same shape as the raw scores, just divided by 11.3 Attention pattern: $A \in R^{n \times n}$ — same shape, but each column now sums to 1

Take the 5-token phrase: “a big red ball bounced…”

Before softmax — raw scores divided by $128$ (a 5-token example):
$\frac{S}{128} = 0.06 - 6.49 - 4.72 - 1.90 - 1.78 - 7.40 0.26 - 0.50 - 2.63 - 3.62 - 2.18 - 0.48 0.16 - 4.96 - 7.76 - 2.46 8.22 8.26 0.43 - 4.90 - 0.46 - 4.26 - 4.91 - 2.86 0.05_{5 \times 5}$
After column-wise softmax — each column becomes a probability distribution (sums to 1):
$A = 0.76 0.00 0.01 0.11 0.12 0.00 0.65 0.30 0.04 0.01 0.06 0.32 0.61 0.00 0.00 0.00 0.49 0.51 0.00 0.00 0.36 0.01 0.00 0.03 0.60_{5 \times 5} where i \sum A_{ij} = 1 \forall j$
The scaling by $128$ keeps the scores moderate — without it, large dot products push softmax into near-one-hot territory, killing gradients.

The resulting grid is called the attention pattern.

Attention Pattern after the scaling and softmax process (without masking):

5-token phrase: “a big red ball bounced…”
$Unnormalized Attention Pattern + 0.7 - 73.4 - 53.4 - 21.5 - 20.1 - 83.7 + 2.9 - 5.7 - 29.7 - 40.9 - 24.7 - 5.4 + 1.8 - 56.1 - 87.8 - 27.8 + 93.0 + 93.4 + 4.9 - 55.4 - 5.2 - 48.2 - 55.6 - 32.4 + 0.6 / 128 Scaled Attention Pattern + 0.06 - 6.49 - 4.72 - 1.90 - 1.78 - 7.40 + 0.26 - 0.50 - 2.63 - 3.62 - 2.18 - 0.48 + 0.16 - 4.96 - 7.76 - 2.46 + 8.22 + 8.26 + 0.43 - 4.90 - 0.46 - 4.26 - 4.91 - 2.86 + 0.05 softmax Normalized Attention Pattern 0.76 0.00 0.01 0.11 0.12 0.00 0.65 0.30 0.04 0.01 0.06 0.32 0.61 0.00 0.00 0.00 0.49 0.51 0.00 0.00 0.36 0.01 0.00 0.03 0.60$

Step 4 — Masking (autoregressive only)

This step mostly relates to training. Also see the masking section below.

For a next-token-prediction transformer, tokens must not see the future — otherwise training would leak the answer. Before the softmax, set every entry where $i > j$ (future influencing past) to $- \infty$ ; after softmax those entries become 0. This is called a causal mask.

Masking is what makes it safe to use every position as a training example simultaneously (parallel training). Without it, every position would be “told” what came next.

Attention Pattern after the scaling and softmax process ( WITH masking):

5-token phrase: “a big red ball bounced…”
$Unnormalized Attention Pattern + 0.7 - \infty - \infty - \infty - \infty - 83.7 + 2.9 - \infty - \infty - \infty - 24.7 - 5.4 + 1.8 - \infty - \infty - 27.8 + 93.0 + 93.4 + 4.9 - \infty - 5.2 - 48.2 - 55.6 - 32.4 + 0.6 / 128 Scaled Attention Pattern + 0.06 - \infty - \infty - \infty - \infty - 7.40 + 0.26 - \infty - \infty - \infty - 2.18 - 0.48 + 0.16 - \infty - \infty - 2.46 + 8.22 + 8.26 + 0.43 - \infty - 0.46 - 4.26 - 4.91 - 2.86 + 0.05 softmax Normalized Attention Pattern 1.00 0000 0.00 1.00 000 0.06 0.33 0.62 00 0.00 0.49 0.51 0.00 0 0.36 0.01 0.00 0.03 0.60$

We have the attention pattern, describing which words are relevant to (attend to) updating which other words. Now, let’s show how to update all the word embeddings with this new, richer, semantic meaning. It’s what allows “fluffy” in “fluffy creature” to augment the meaning of “creature” by itself.

Step 5 — Values and output

Project every token embedding once more, through the value projection matrix $W_{V}$ , into the embedding space:

$V_{i} = W_{V} E_{i}$

An interpretation of the value vector, $V_{i}$ : If the token $i$ is relevant to adjusting the meaning of some other token, what exactly should be added to the embedding of that other token to reflect this?

Dimensionality example with GPT-3 (for one head of attention)

Value matrix (conceptual): $W_{V} \in R^{12, 288 \times 12, 288}$ — maps embedding space → embedding space

Value matrix (in practice, factored): $W_{V} = V_{↑} V_{↓}$ where

$V_{↓} \to W_{V} \in R^{128 \times 12, 288}$ — maps down: embedding space → 128-dim and

$V_{↑} \to W_{O} \in R^{12, 288 \times 128}$ — maps up: 128-dim → embedding space

For multiple attention heads, all $H$ “up-projections” $V_{↑}^{(h)} \in R^{12, 288 \times 128}$ are concatenated into one giant matrix, the output matrix $W_{O} \in R^{12, 288 \times 128 H}$

(see low-rank factoring and Terminology gotcha value matrix vs output matrix)

Token embedding vector $i$ : $E_{i} \in R^{12, 288}$

Value vector $i$ : $V_{i} \in R^{12, 288}$ — same dimensionality as the embedding

Conceptual (un-factored) — a square matrix maps a tall vector to an equally tall vector:

$V_{i} 12, 288 \times 1 v_{0} v_{1} ⋮ ⋮ ⋮ v_{12287} = W_{V} 12, 288 \times 12, 288 \cdot ⋮ \cdot \dots ⋱ \dots \cdot ⋮ \cdot E_{i} 12, 288 \times 1 e_{0} e_{1} ⋮ ⋮ ⋮ e_{12287}$

Factored (actual) — down-project to 128-dim, then up-project back:

$V_{i} 12, 288 \times 1 v_{0} v_{1} ⋮ ⋮ ⋮ v_{12287} = V_{↑} 12, 288 \times 128 \cdot ⋮ ⋮ \cdot \dots ⋱ \dots \cdot ⋮ ⋮ \cdot V_{↓} 128 \times 12, 288 \cdot ⋮ \cdot \dots ⋱ \dots \dots \dots \cdot ⋮ \cdot E_{i} 12, 288 \times 1 e_{0} e_{1} ⋮ ⋮ ⋮ e_{12287}$

The bottleneck through 128 dims means $W_{V}$ is low-rank — it can’t represent arbitrary $12, 288 \times 12, 288$ transformations, only ones that pass through a 128-dim intermediate.

The refinement added to token $j$ is the weighted sum of values ( $V_{i}$ ) weighted by the attention column for $j$ (see Step 3):

$Δ E_{j} = \sum_{i} attn_{ij} V_{i}$

Dimensionality example with GPT-3:

Attention weights for token $j$ : $attn_{0 j}, attn_{1 j}, \dots \in R$ — one scalar per token in context, from column $j$ of the attention pattern

Value vectors: $V_{i} \in R^{12, 288}$ each

Refinement vector: $Δ E_{j} \in R^{12, 288}$ — same shape as the embedding

Weighted sum — for token $j$ in an $n$ -token-long context, $n$ scalars (from masked (upper triangular) attention pattern) weight $n$ value vectors, producing one vector $Δ E_{j}$ :

Example: Take the 5-token context “a big red ball bounced”.

Let $j = 3$ , meaning we are interested in updating the token embedding for “ball”

5 scalars from Attention (masked) will weight 5 value vectors (one per token), producing one vector $Δ E_{j = 3}$ :

Recall, the Attention Pattern (“with masking”) above, let’s take its 4th column ( $j = 3$ ):

$A_{masked} = 1.00 0000 0.00 1.00 000 0.06 0.33 0.62 00 0.00 0.49 0.51 0.00 0 0.36 0.01 0.00 0.03 0.60 \to A_{masked}^{ij} = 0.00 0.49 0.51 0.00 0 \forall i, j = 3$

Then compute $Δ E_{3}$ , the update to “ball“‘s original embedding. Once updated, the new “ball” will include the richer meaning that comes from “big” and “red” which attend to the original “ball”

$Δ E_{j} 12, 288 \times 1 Δ e_{0} Δ e_{1} ⋮ Δ e_{12287} = attn_{0 j} 0.00 V_{0} v_{0}^{(0)} v_{1}^{(0)} ⋮ v_{12287}^{(0)} + attn_{1 j} 0.49 V_{1} v_{0}^{(1)} v_{1}^{(1)} ⋮ v_{12287}^{(1)} + attn_{2 j} 0.51 V_{2} v_{0}^{(2)} v_{1}^{(2)} ⋮ v_{12287}^{(2)} + attn_{3 j} 0.00 V_{3} v_{0}^{(3)} v_{1}^{(3)} ⋮ v_{12287}^{(3)} + attn_{4 j} 0.00 V_{4} v_{0}^{(4)} v_{1}^{(4)} ⋮ v_{12287}^{(4)}$

Tokens whose attention weight is near 0 contribute almost nothing; tokens near 1 dominate the update. The result is a 12,288-dim vector — a direction in embedding space that gets added to token $j$ ‘s embedding.

This $Δ E_{j}$ is added (residual-stream style) to the original embedding:

$E_{j}^{'} = E_{j} + Δ E_{j}$

This attention-weighted update is done for all token embeddings in the context window (not just the $j^{th}$ token!).

Dimensionality example with GPT-3:

Original embedding: $E_{j} \in R^{12, 288}$ Refinement from this head: $Δ E_{j} \in R^{12, 288}$ Updated embedding: $E_{j}^{'} \in R^{12, 288}$ — same shape, different (updated, richer, more nuanced) direction

Elementwise addition — the original embedding vector gets nudged:
$12, 288 \times 1 e_{0}^{'} e_{1}^{'} ⋮ e_{12287}^{'} = 12, 288 \times 1 e_{0} e_{1} ⋮ e_{12287} + 12, 288 \times 1 Δ e_{0} Δ e_{1} ⋮ Δ e_{12287}$
This is a residual connection — the identity path means the model can learn to make $Δ E_{j}$ small (or zero) for positions where this head has nothing useful to add.

The compact form

Important

Note: Everything in the literature appears transposed, due to our earlier decision to define a single token’s $Q_{i}$ and $K_{i}$ as column vectors (see note above), resulting in the literature’s definition of the $K$ and $Q$ matrices (i.e. for ALL tokens) to differ from ours. Where the literature says $Q K^{⊤}$ , our notes above will say $K^{⊤} Q$

The compact notation below, from the paper is an example of this. In our notation, it should instead say $K^{⊤} Q$ per our definitions

The whole thing, from the original paper Attention is All You Need:

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$

where $Q, K, V$ are stacks of all query, key, and value vectors across positions. The softmax is understood to apply per column.

Dimensionality example with GPT-3:

Here each token’s vector is a row (the batched convention from the paper). For a context of $n$ tokens: Query matrix: $Q \in R^{n \times 128}$ — each row is one token’s 128-dim query Key matrix: $K \in R^{n \times 128}$ — each row is one token’s 128-dim key Value matrix: $V \in R^{n \times 12, 288}$ — each row is one token’s 12,288-dim value

Step 1 — $Q K^{⊤}$ gives the $n \times n$ score grid:
$n \times 128 — q_{0}^{⊤} — — q_{1}^{⊤} — ⋮ — q_{n - 1}^{⊤} — 128 \times n ∣ k_{0} ∣ ∣ k_{1} ∣ \dots ∣ k_{n - 1} ∣ = n \times n s_{00} s_{10} ⋮ s_{n - 1, 0} s_{01} ⋱ \dots \dots ⋱ \dots s_{0, n - 1} ⋮ ⋮ s_{n - 1, n - 1}$
Step 2 — softmax + multiply by $V$ gives the output:
$n \times n softmax (\frac{S}{128}) n \times 12, 288 — v_{0}^{⊤} — — v_{1}^{⊤} — ⋮ — v_{n - 1}^{⊤} — = n \times 12, 288 —Δ e_{0}^{⊤} — —Δ e_{1}^{⊤} — ⋮ —Δ e_{n - 1}^{⊤} —$
The output has one row per token, each row a 12,288-dim refinement vector. The inner dimension (128) cancels in the $Q K^{⊤}$ product; the inner dimension ( $n$ ) cancels when the attention weights multiply $V$ .

Masking and efficient (parallelised) training

Every position in a sequence simultaneously predicts its own next token. So a sequence of length $n$ yields $n$ training signals at no extra cost — provided masking prevents position $j$ from cheating by looking at positions $> j$ . This is a huge efficiency multiplier over training on “predict the last token only” one example at a time.

Example: A 12-token-long sequence contains 12 parallel training examples!

12-token-long sequence: To| date|,| the| cle|ve|rest| thinker| of| all| time| was…

Contains the following 12 training examples:

To…

To| date…

To| date|,…

To| date|,| the…

To| date|,| the| cle…

To| date|,| the| cle|ve…

To| date|,| the| cle|ve|rest…

To| date|,| the| cle|ve|rest| thinker…

To| date|,| the| cle|ve|rest| thinker| of…

To| date|,| the| cle|ve|rest| thinker| of| all…

To| date|,| the| cle|ve|rest| thinker| of| all| time…

To| date|,| the| cle|ve|rest| thinker| of| all| time| was…

Also see the masking step above (mostly relevant to training only).

Context size and quadratic cost

The attention pattern has size $n^{2}$ , where $n$ is the context length. Doubling context quadruples attention compute and memory. This is the reason long-context models are nontrivial, and a major research area. update based on state of the art now (Linear Attention, Sliding-Window Attention, State-Space Models, Sparse Attention Mechanisms, Blockwise Attention, Linformer, Reformer, Ring attention, Longformer, Adaptive Attention Span, etc)

Low-rank value factoring

Naïvely, $W_{V}$ is square ( $d_{embed} \times d_{embed}$ ) and would have ~150M parameters in GPT-3 — more than the rest of a single head combined. In practice $W_{V}$ is factored into two matrices of rank $d_{k}$ : a down-projection into the small key-query space, then an up-projection back to the embedding space. This matches the parameter budget of $W_{Q}$ and $W_{K}$ and constrains the value map to be low-rank — see multi-head-attention for why this matters across many heads.

What heads actually do

In the 3b1b running example, one head “has adjectives update nouns”. Real heads do much weirder things. Documented patterns from interpretability work include:

Previous-token heads (attend strictly to the token immediately before).
Induction heads (copy patterns — if AB...A appears, predict B).
Positional heads, duplicate-token heads, anaphora resolution heads, and many more.

Most heads don’t have a clean interpretation at all — they’re just linear algebra that happened to be useful during training.

notes/

Attention Mechanism

💡 The problem attention solves

A single head of attention

Step 1 — Queries and keys

Note of convenience: $Q_{i}$ and $K_{i}$ are defined as column vectors (instead of row vectors)

Step 2 — Alignment scores

Step 3 — Scale and softmax

Step 4 — Masking (autoregressive only)

Step 5 — Values and output

The compact form

Masking and efficient (parallelised) training

Context size and quadratic cost

Low-rank value factoring

What heads actually do

See also

Sources

Attention Mechanism

💡 The problem attention solves

A single head of attention

Step 1 — Queries and keys

Note of convenience: Qi​ and Ki​ are defined as column vectors (instead of row vectors)

Step 2 — Alignment scores

Step 3 — Scale and softmax

Step 4 — Masking (autoregressive only)

Step 5 — Values and output

The compact form

Masking and efficient (parallelised) training

Context size and quadratic cost

Low-rank value factoring

What heads actually do

See also

Sources

Graph View

Backlinks

Explorer

Note of convenience: $Q_{i}$ and $K_{i}$ are defined as column vectors (instead of row vectors)