Unembedding

Summary: The final step of a transformer that takes the refined last-token vector out of the residual stream and projects it into a score (logits) for every possible next token, ready for softmax to produce a probability distribution.

The operation

After all attention and MLP blocks have run, the last vector in the sequence, $E_{n}^{(L)}$ , is multiplied by an unembedding matrix $W_{U}$ :

$ℓ = W_{U} E_{n}^{(L)}$

$W_{U}$ has one row per vocabulary token and one column per embedding dimension (unlike $W_{E}$ , which has one column per token — it’s the transposed shape):

$W_{U} \in R^{∣ V ∣ \times d_{embed}}$

For GPT-3: $50, 257 rows \times 12, 288 columns \approx 617 M$ parameters, symmetric in size to $W_{E}$ .

Each entry of the output is the dot product of one row of $W_{U}$ with $E_{n}^{(L)}$ — i.e. how aligned the final vector is with each token’s “unembedding direction”. These raw alignment scores are the logits, which then get passed through softmax to produce the next-token probability distribution.

Image of unembedding matrix $W_{U}$ and softmax operation in GPT-3

$W_{U} \in R^{∣ V ∣ \times d_{embed}}$

maps a single residual stream vector $E_{n}^{(L)} \in R^{12, 288}$ (i.e. of length $d_{embed}$ )

to a logits vector $ℓ \in R^{50, 257}$ (i.e. one score per vocabulary token)

via a $(50, 257 \times 12, 288)$ matrix multiply.

💡 Why only the last vector?

The transformer computes a full vector for every position in the context, but at inference time we only care about the prediction for position $n$ (the next token after the input). So we only unembed position $n$ .

During training things are different: every position simultaneously predicts its next token. So training runs the unembedding step at every position in the sequence, producing $n$ loss signals per forward pass rather than one. This is what makes training efficient — a length- $n$ sequence yields $n$ training examples for the cost of one forward pass through the stack.

At inference, the other positions’ output vectors are computed but discarded. They did their job earlier, when they contributed context to the positions after them through attention.

Tied vs untied weights

In some implementations, $W_{U}$ is tied to $W_{E}^{⊤}$ — they share parameters. This:

Halves the parameter count spent on embeddings/unembeddings.
Encodes the prior that “the direction for embedding a token” and “the direction for predicting it” should be the same.

Other implementations (GPT-3 included, per 3b1b’s tally) keep them untied — two independent matrices of the same shape, ~1.2B total parameters between them.

Intuition via dot products

Recall from above:

Each entry of the output is the dot product of one row of $W_{U}$ with $E_{n}^{(L)}$ — i.e. how aligned the final vector is with each token’s “unembedding direction”.

Because unembedding is literally “dot product against each row”, you can read $W_{U}$ as a bank of learned “if you see this direction in the residual stream, it’s likely to be followed by this token” detectors. The final-layer vector is a weighted cocktail of meanings; the row for “Snape” gives a high logit when that cocktail points toward “villainous Hogwarts potions master who is the hero’s least favourite professor”.

This also means any concept the model wants to make predictable must be a direction the unembedding can read — a nontrivial constraint on how the model uses its embedding space.

Sources

src-3b1b-llms-ch2-transformers

notes/

Unembedding

The operation

💡 Why only the last vector?

Tied vs untied weights

Intuition via dot products

See also

Sources

Unembedding

The operation

💡 Why only the last vector?

Tied vs untied weights

Intuition via dot products

See also

Sources

Graph View

Backlinks

Explorer