Self-Attention vs Cross-Attention

Summary: Two variants of the attention head that differ in where the queries and keys come from: self-attention draws Q, K, V from a single sequence; cross-attention draws queries from one sequence and keys/values from another.

Everything covered in attention-mechanism and multi-head-attention is — strictly speaking — self-attention. GPT-style decoder-only LLMs use self-attention exclusively. Cross-attention appears in models that have to relate two sequences, the canonical example being an encoder–decoder translation model.

The only real difference

A single head computes, for each token pair $(i, j)$ :

$attn_{ij} = softmax_{i} (\frac{Q _{j} \cdot K _{i}}{d _{k}}), Δ E_{j} = \sum_{i} attn_{ij} V_{i}$

What changes between the two variants is which embeddings get fed into the Q, K, V projections:

	Self-attention	Cross-attention
Queries come from	sequence $A$	sequence $A$ (the “reader”)
Keys come from	sequence $A$	sequence $B$ (the “source”)
Values come from	sequence $A$	sequence $B$
Typical role of $A$ , $B$	one sequence refines itself	$A$ pulls information out of $B$
Causal masking?	yes (for autoregressive LMs)	usually no — $A$ and $B$ are different modalities/languages, there is no “future” to leak
Used in	GPT, BERT, decoder self-attention of encoder-decoder models	Encoder-decoder attention (translation), multimodal models (text querying image features), speech recognition

The projection matrices $W_{Q}$ , $W_{K}$ , $W_{V}$ and the whole dot-product-softmax-weighted-sum machinery are identical. Only the inputs differ.

Concrete example: translation

In a French→English translation model:

The encoder runs self-attention over the French sentence to produce context-aware French embeddings.
The decoder generates English tokens. Inside each decoder block, cross-attention has:
- Queries from the (partially-generated) English sequence — “what English word am I trying to produce next?”
- Keys and values from the encoder’s French embeddings — “which French tokens are relevant to that English word?”
The resulting attention pattern is a soft alignment between the two languages, which is why early visualisations of attention in translation models looked like fuzzy word-alignment matrices.

The decoder also contains a self-attention sub-layer (causal, over the English tokens generated so far), so an encoder-decoder block typically has both a self-attention and a cross-attention head per layer.

Why there’s no mask in cross-attention

Causal masking in self-attention exists to stop position $j$ from cheating by peeking at positions $> j$ during parallel training — the tokens $A$ is attending to are the same tokens it’s trying to predict.

In cross-attention, $A$ is attending to a different sequence $B$ (e.g. the full source sentence, already encoded). There is no notion of “future” in $B$ relative to $A$ , so no masking is needed. The whole of $B$ is always visible to every position in $A$ .

When you’d reach for each

Self-attention only → decoder-only LLMs (GPT family), encoder-only models (BERT). Good when there is exactly one sequence and the task is either next-token prediction or classification over that sequence.
Self + cross-attention → encoder-decoder models (original Transformer, T5, translation models, Whisper). Good when input and output are distinct sequences and you need to align them.
Cross-attention as a bridge across modalities → vision-language models (text queries, image keys/values), speech models, retrieval-augmented setups. The “two sequences” don’t have to be two languages — they can be any two streams of embeddings.

Sources

src-3b1b-llms-ch3-attention

notes/

Self-Attention vs Cross-Attention

The only real difference

Concrete example: translation

Why there’s no mask in cross-attention

When you’d reach for each

See also

Sources

Self-Attention vs Cross-Attention

The only real difference

Concrete example: translation

Why there’s no mask in cross-attention

When you’d reach for each

See also

Sources

Graph View

Backlinks

Explorer