Summary: Two variants of the attention head that differ in where the queries and keys come from: self-attention draws Q, K, V from a single sequence; cross-attention draws queries from one sequence and keys/values from another.
Everything covered in attention-mechanism and multi-head-attention is — strictly speaking — self-attention. GPT-style decoder-only LLMs use self-attention exclusively. Cross-attention appears in models that have to relate two sequences, the canonical example being an encoder–decoder translation model.
The only real difference
A single head computes, for each token pair :
What changes between the two variants is which embeddings get fed into the Q, K, V projections:
| Self-attention | Cross-attention | |
|---|---|---|
| Queries come from | sequence | sequence (the “reader”) |
| Keys come from | sequence | sequence (the “source”) |
| Values come from | sequence | sequence |
| Typical role of , | one sequence refines itself | pulls information out of |
| Causal masking? | yes (for autoregressive LMs) | usually no — and are different modalities/languages, there is no “future” to leak |
| Used in | GPT, BERT, decoder self-attention of encoder-decoder models | Encoder-decoder attention (translation), multimodal models (text querying image features), speech recognition |
The projection matrices , , and the whole dot-product-softmax-weighted-sum machinery are identical. Only the inputs differ.
Concrete example: translation
In a French→English translation model:
- The encoder runs self-attention over the French sentence to produce context-aware French embeddings.
- The decoder generates English tokens. Inside each decoder block, cross-attention has:
- Queries from the (partially-generated) English sequence — “what English word am I trying to produce next?”
- Keys and values from the encoder’s French embeddings — “which French tokens are relevant to that English word?”
- The resulting attention pattern is a soft alignment between the two languages, which is why early visualisations of attention in translation models looked like fuzzy word-alignment matrices.
The decoder also contains a self-attention sub-layer (causal, over the English tokens generated so far), so an encoder-decoder block typically has both a self-attention and a cross-attention head per layer.
Why there’s no mask in cross-attention
Causal masking in self-attention exists to stop position from cheating by peeking at positions during parallel training — the tokens is attending to are the same tokens it’s trying to predict.
In cross-attention, is attending to a different sequence (e.g. the full source sentence, already encoded). There is no notion of “future” in relative to , so no masking is needed. The whole of is always visible to every position in .
When you’d reach for each
- Self-attention only → decoder-only LLMs (GPT family), encoder-only models (BERT). Good when there is exactly one sequence and the task is either next-token prediction or classification over that sequence.
- Self + cross-attention → encoder-decoder models (original Transformer, T5, translation models, Whisper). Good when input and output are distinct sequences and you need to align them.
- Cross-attention as a bridge across modalities → vision-language models (text queries, image keys/values), speech models, retrieval-augmented setups. The “two sequences” don’t have to be two languages — they can be any two streams of embeddings.
See also
- attention-mechanism — single-head attention mechanics (Q/K/V, softmax, masking)
- multi-head-attention — running many heads in parallel per block
- transformer-architecture — where attention blocks sit in the overall stack