Summary: The operation that turns a discrete token ID into a high-dimensional real-valued vector, where directions in the space tend to encode semantic features that the model has found useful during training.
The embedding matrix
Every transformer begins with an embedding matrix , with one column per vocabulary token, and one row per embedding dimension:
Embedding a token is literally a column lookup: token ID becomes column of . No computation, just indexing. The values are learned via backpropagation like any other parameter.
For GPT-3: rows and columns, so holds ~617M parameters — already more than most pre-transformer models had in total.
Abbreviated example of as described in GPT-3
Directions encode meaning
This is the empirically remarkable fact about learned embeddings: after training, directions in the embedding space correspond to semantic features.
The classic examples (from word2vec-era models but also present in transformer embeddings):
- — a “gender” direction exists.
- is approximately a “plurality” direction; its dot product with one, two, three, four increases monotonically.
- — the model has learned “country of origin” as a direction it can translate along.
These are not designed. They emerge because representing related concepts along a shared axis lets downstream layers generalise cheaply — the model learns to exploit linearity because linearity is what the rest of the network is good at.
The classic examples are approximate
The “queen” example is famous but imperfect — in real models the nearest neighbour to
king − man + womanis often stillking. Family relations (father/mother, uncle/aunt) illustrate the idea more cleanly. The underlying claim is directional, not exact-point-arithmetic.
Dot product as alignment
The dot product is the workhorse similarity measure throughout transformers:
Element-wise, it looks like:
- Positive → vectors point in a similar direction (aligned).
- Zero → perpendicular (unrelated).
- Negative → opposing directions.
Every time the transformer asks “how much does X relate to Y?”, the answer is computed as a dot product somewhere:
- Attention uses to score query–key alignment.
- The MLP up-projection computes to ask “how much does this vector align with a particular learned direction?”
- The unembedding step uses , i.e. dot products against one row per vocabulary token, to produce logits.
Beyond words
Inside a transformer, an embedding is not static. As it flows through attention and MLP blocks, the residual-stream at a given position gets progressively refined:
- The vector that started as “king” might end up pointing in a direction that encodes “a king, in Scotland, who murdered the previous king, described in Shakespearean language”.
- The vector for “mole” starts identical in “American shrew mole”, “one mole of CO₂”, and “biopsy the mole” — only context-blending in later layers separates the three meanings.
In this sense the first-layer embedding is only a seed. The interesting representations live later in the stack.
Positional information
create-page Embeddings also have to carry where a token is, not just what it is. The 3b1b series treats positional encoding as “part of the embedding” without detailing the specific scheme — different transformer variants use learned position embeddings, sinusoidal encodings, rotary (RoPE), or ALiBi. Flag for a later source that covers them.
See also
- transformer-architecture — where embeddings sit in the pipeline
- unembedding — the inverse operation at the end
- attention-mechanism — how embeddings get refined
- superposition — why “one direction per concept” isn’t quite the full story