3Blue1Brown — Transformers, the tech behind LLMs

Summary: A high-level walkthrough of the transformer-architecture — tokens, embeddings, repeated attention + MLP blocks, and the final unembedding + softmax — using GPT-3’s numbers throughout.

Key ideas

GPT = Generative Pre-Trained Transformer. The chapter focuses on the decoder-only, next-token-prediction variant that underlies ChatGPT.
Pipeline. Input text → tokens → embedding vectors → many layers of attention + MLP → final vector → unembedding matrix → softmax → probability distribution over next token.
Deep learning premise. Models are layered transformations of real-valued tensors, parameterised by weight matrices. Without nonlinearities between matrix multiplications the whole model collapses to a single affine map create-page.

Embeddings

The embedding matrix $W_{E}$ has one column per vocabulary token. In GPT-3: 50,257 tokens × 12,288 dims ≈ 617M parameters just for this first step.
Token embeddings are looked up, not computed. At this stage each vector only encodes the identity of the token (plus positional info), with no context.
Directions (in the vector space) encode meaning. Training tends to settle on embeddings where directions correspond to semantic features — e.g. king − man + woman ≈ queen, or a learned “gender axis”, “plurality axis”, etc. This is empirical, not designed.
Dot product measures alignment. Positive ⇒ similar direction, zero ⇒ perpendicular, negative ⇒ opposite. Used throughout transformers as the “how related are these two vectors?” primitive.
Context size is the number of token vectors the network processes simultaneously. GPT-3: 2,048. Limits how much text can influence any single prediction.

Unembedding and softmax

After the last block, the last vector in the sequence is multiplied by the unembedding matrix $W_{U}$ (vocab_size × embed_dim) to produce one raw score per vocabulary token. These raw scores are called logits.
Why only the last vector? Training is more efficient if every position simultaneously predicts its own next token — so every position acts as a training example. At inference time we just look at the last one.
Softmax $σ (x)_{i} = e^{x_{i}} / \sum_{j} e^{x_{j}}$ turns logits into a probability distribution. A temperature $T$ can be added via $e^{x_{i} / T}$ : $T \to 0$ concentrates on the argmax; large $T$ flattens toward uniform.
$W_{U}$ in GPT-3 is another ~617M parameters (the same shape as $W_{E}$ , transposed).

Counting parameters (running tally)

Embedding ( $W_{E}$ ): 617M
Unembedding ( $W_{U}$ ): 617M
So far: ~1.2B of the 175B total in GPT-3. The rest live in the attention and MLP blocks, covered in later chapters.

Sources

Ch. 2 - Transformers, the tech behind LLMs

3Blue1Brown — Transformers, the tech behind LLMs

Key ideas

Embeddings

Unembedding and softmax

Counting parameters (running tally)

Sources

Graph View

Backlinks

Explorer