Word Embedding

Summary: The operation that turns a discrete token ID into a high-dimensional real-valued vector, where directions in the space tend to encode semantic features that the model has found useful during training.

The embedding matrix

Every transformer begins with an embedding matrix $W_{E}$ , with one column per vocabulary token, and one row per embedding dimension:

$W_{E} \in R^{d_{embed} \times ∣ V ∣}$

Embedding a token is literally a column lookup: token ID $i$ becomes column $i$ of $W_{E}$ . No computation, just indexing. The values are learned via backpropagation like any other parameter.

For GPT-3: $d_{embed} = 12, 288$ rows and $∣ V ∣ = 50, 257$ columns, so $W_{E}$ holds ~617M parameters — already more than most pre-transformer models had in total.

Abbreviated example of $W_{E}$ as described in GPT-3

$All words, \sim 50 k$ $aah aardvark aardwolf aargh ab \dots zygotic zyme zymogen zymosis zzz$ $+ 1.0 + 5.9 + 3.9 - 7.2 + 1.3 ⋮ - 3.7 + 4.3 - 0.8 - 8.7 - 6.0 - 4.6 ⋮ - 2.0 + 2.0 + 5.6 + 3.3 - 2.6 + 0.5 ⋮ - 5.7 + 0.9 - 7.6 + 3.4 + 6.4 - 8.0 ⋮ - 6.2 - 1.5 + 2.8 - 5.7 - 8.0 + 1.5 ⋮ + 8.8 \dots \dots \dots \dots \dots ⋱ \dots - 9.5 + 2.3 - 0.7 - 7.5 + 3.5 ⋮ - 8.6 + 6.6 + 8.8 - 5.1 - 3.6 - 4.6 ⋮ + 3.6 + 5.5 + 3.6 - 6.8 - 1.7 + 4.7 ⋮ - 0.9 + 7.3 - 2.8 - 7.7 - 8.6 + 9.2 ⋮ + 0.7 + 9.5 - 1.2 + 3.1 + 3.8 - 5.0 ⋮ + 7.9_{12, 288 \times 50, 257}$

Directions encode meaning

This is the empirically remarkable fact about learned embeddings: after training, directions in the embedding space correspond to semantic features.

The classic examples (from word2vec-era models but also present in transformer embeddings):

$king - man + woman \approx queen$ — a “gender” direction exists.
$cats - cat$ is approximately a “plurality” direction; its dot product with one, two, three, four increases monotonically.
$Italy - Germany + Hitler \approx Mussolini$ — the model has learned “country of origin” as a direction it can translate along.

These are not designed. They emerge because representing related concepts along a shared axis lets downstream layers generalise cheaply — the model learns to exploit linearity because linearity is what the rest of the network is good at.

The classic examples are approximate

The “queen” example is famous but imperfect — in real models the nearest neighbour to king − man + woman is often still king. Family relations (father/mother, uncle/aunt) illustrate the idea more cleanly. The underlying claim is directional, not exact-point-arithmetic.

Dot product as alignment

The dot product is the workhorse similarity measure throughout transformers:

$a \cdot b = ∥ a ∥ ∥ b ∥ cos θ$

Element-wise, it looks like:

a \cdot b = a_{1} a_{2} ⋮ a_{n} \cdot b_{1} b_{2} ⋮ b_{n} = a_{1} b_{1} + a_{2} b_{2} + \dots + a_{n} b_{n} = constant

Positive → vectors point in a similar direction (aligned).
Zero → perpendicular (unrelated).
Negative → opposing directions.

Every time the transformer asks “how much does X relate to Y?”, the answer is computed as a dot product somewhere:

Attention uses $Q \cdot K$ to score query–key alignment.
The MLP up-projection computes $row \cdot E$ to ask “how much does this vector align with a particular learned direction?”
The unembedding step uses $W_{U} E$ , i.e. dot products against one row per vocabulary token, to produce logits.

Beyond words

Inside a transformer, an embedding is not static. As it flows through attention and MLP blocks, the residual-stream at a given position gets progressively refined:

The vector that started as “king” might end up pointing in a direction that encodes “a king, in Scotland, who murdered the previous king, described in Shakespearean language”.
The vector for “mole” starts identical in “American shrew mole”, “one mole of CO₂”, and “biopsy the mole” — only context-blending in later layers separates the three meanings.

In this sense the first-layer embedding is only a seed. The interesting representations live later in the stack.

Positional information

create-page Embeddings also have to carry where a token is, not just what it is. The 3b1b series treats positional encoding as “part of the embedding” without detailing the specific scheme — different transformer variants use learned position embeddings, sinusoidal encodings, rotary (RoPE), or ALiBi. Flag for a later source that covers them.

notes/

Word Embedding

The embedding matrix

Directions encode meaning

Dot product as alignment

Beyond words

Positional information

See also

Sources

Word Embedding

The embedding matrix

Directions encode meaning

Dot product as alignment

Beyond words

Positional information

See also

Sources

Graph View

Backlinks

Explorer