Summary: Two variants of the attention head that differ in where the queries and keys come from: self-attention draws Q, K, V from a single sequence; cross-attention draws queries from one sequence and keys/values from another.

Everything covered in attention-mechanism and multi-head-attention is — strictly speaking — self-attention. GPT-style decoder-only LLMs use self-attention exclusively. Cross-attention appears in models that have to relate two sequences, the canonical example being an encoder–decoder translation model.

The only real difference

A single head computes, for each token pair :

What changes between the two variants is which embeddings get fed into the Q, K, V projections:

Self-attentionCross-attention
Queries come fromsequence sequence (the “reader”)
Keys come fromsequence sequence (the “source”)
Values come fromsequence sequence
Typical role of , one sequence refines itself pulls information out of
Causal masking?yes (for autoregressive LMs)usually no and are different modalities/languages, there is no “future” to leak
Used inGPT, BERT, decoder self-attention of encoder-decoder modelsEncoder-decoder attention (translation), multimodal models (text querying image features), speech recognition

The projection matrices , , and the whole dot-product-softmax-weighted-sum machinery are identical. Only the inputs differ.

Concrete example: translation

In a French→English translation model:

  • The encoder runs self-attention over the French sentence to produce context-aware French embeddings.
  • The decoder generates English tokens. Inside each decoder block, cross-attention has:
    • Queries from the (partially-generated) English sequence — “what English word am I trying to produce next?”
    • Keys and values from the encoder’s French embeddings — “which French tokens are relevant to that English word?”
  • The resulting attention pattern is a soft alignment between the two languages, which is why early visualisations of attention in translation models looked like fuzzy word-alignment matrices.

The decoder also contains a self-attention sub-layer (causal, over the English tokens generated so far), so an encoder-decoder block typically has both a self-attention and a cross-attention head per layer.

Why there’s no mask in cross-attention

Causal masking in self-attention exists to stop position from cheating by peeking at positions during parallel training — the tokens is attending to are the same tokens it’s trying to predict.

In cross-attention, is attending to a different sequence (e.g. the full source sentence, already encoded). There is no notion of “future” in relative to , so no masking is needed. The whole of is always visible to every position in .

When you’d reach for each

  • Self-attention only → decoder-only LLMs (GPT family), encoder-only models (BERT). Good when there is exactly one sequence and the task is either next-token prediction or classification over that sequence.
  • Self + cross-attention → encoder-decoder models (original Transformer, T5, translation models, Whisper). Good when input and output are distinct sequences and you need to align them.
  • Cross-attention as a bridge across modalities → vision-language models (text queries, image keys/values), speech models, retrieval-augmented setups. The “two sequences” don’t have to be two languages — they can be any two streams of embeddings.

See also