Summary: The operation that turns a discrete token ID into a high-dimensional real-valued vector, where directions in the space tend to encode semantic features that the model has found useful during training.

The embedding matrix

Every transformer begins with an embedding matrix , with one column per vocabulary token, and one row per embedding dimension:

Embedding a token is literally a column lookup: token ID becomes column of . No computation, just indexing. The values are learned via backpropagation like any other parameter.

For GPT-3: rows and columns, so holds ~617M parameters — already more than most pre-transformer models had in total.

Directions encode meaning

This is the empirically remarkable fact about learned embeddings: after training, directions in the embedding space correspond to semantic features.

The classic examples (from word2vec-era models but also present in transformer embeddings):

  • — a “gender” direction exists.
  • is approximately a “plurality” direction; its dot product with one, two, three, four increases monotonically.
  • — the model has learned “country of origin” as a direction it can translate along.

These are not designed. They emerge because representing related concepts along a shared axis lets downstream layers generalise cheaply — the model learns to exploit linearity because linearity is what the rest of the network is good at.

The classic examples are approximate

The “queen” example is famous but imperfect — in real models the nearest neighbour to king − man + woman is often still king. Family relations (father/mother, uncle/aunt) illustrate the idea more cleanly. The underlying claim is directional, not exact-point-arithmetic.

Dot product as alignment

The dot product is the workhorse similarity measure throughout transformers:

Element-wise, it looks like:

  • Positive → vectors point in a similar direction (aligned).
  • Zero → perpendicular (unrelated).
  • Negative → opposing directions.

Every time the transformer asks “how much does X relate to Y?”, the answer is computed as a dot product somewhere:

  • Attention uses to score query–key alignment.
  • The MLP up-projection computes to ask “how much does this vector align with a particular learned direction?”
  • The unembedding step uses , i.e. dot products against one row per vocabulary token, to produce logits.

Beyond words

Inside a transformer, an embedding is not static. As it flows through attention and MLP blocks, the residual-stream at a given position gets progressively refined:

  • The vector that started as “king” might end up pointing in a direction that encodes “a king, in Scotland, who murdered the previous king, described in Shakespearean language”.
  • The vector for “mole” starts identical in “American shrew mole”, “one mole of CO₂”, and “biopsy the mole” — only context-blending in later layers separates the three meanings.

In this sense the first-layer embedding is only a seed. The interesting representations live later in the stack.

Positional information

create-page Embeddings also have to carry where a token is, not just what it is. The 3b1b series treats positional encoding as “part of the embedding” without detailing the specific scheme — different transformer variants use learned position embeddings, sinusoidal encodings, rotary (RoPE), or ALiBi. Flag for a later source that covers them.

See also