Summary: A high-level walkthrough of the transformer-architecture — tokens, embeddings, repeated attention + MLP blocks, and the final unembedding + softmax — using GPT-3’s numbers throughout.
Key ideas
- GPT = Generative Pre-Trained Transformer. The chapter focuses on the decoder-only, next-token-prediction variant that underlies ChatGPT.
- Pipeline. Input text → tokens → embedding vectors → many layers of attention + MLP → final vector → unembedding matrix → softmax → probability distribution over next token.
- Deep learning premise. Models are layered transformations of real-valued tensors, parameterised by weight matrices. Without nonlinearities between matrix multiplications the whole model collapses to a single affine map create-page.
Embeddings
- The embedding matrix has one column per vocabulary token. In GPT-3: 50,257 tokens × 12,288 dims ≈ 617M parameters just for this first step.
- Token embeddings are looked up, not computed. At this stage each vector only encodes the identity of the token (plus positional info), with no context.
- Directions (in the vector space) encode meaning. Training tends to settle on embeddings where directions correspond to semantic features — e.g.
king − man + woman ≈ queen, or a learned “gender axis”, “plurality axis”, etc. This is empirical, not designed. - Dot product measures alignment. Positive ⇒ similar direction, zero ⇒ perpendicular, negative ⇒ opposite. Used throughout transformers as the “how related are these two vectors?” primitive.
- Context size is the number of token vectors the network processes simultaneously. GPT-3: 2,048. Limits how much text can influence any single prediction.
Unembedding and softmax
- After the last block, the last vector in the sequence is multiplied by the unembedding matrix (vocab_size × embed_dim) to produce one raw score per vocabulary token. These raw scores are called logits.
- Why only the last vector? Training is more efficient if every position simultaneously predicts its own next token — so every position acts as a training example. At inference time we just look at the last one.
- Softmax turns logits into a probability distribution. A temperature can be added via : concentrates on the argmax; large flattens toward uniform.
- in GPT-3 is another ~617M parameters (the same shape as , transposed).
Counting parameters (running tally)
- Embedding (): 617M
- Unembedding (): 617M
- So far: ~1.2B of the 175B total in GPT-3. The rest live in the attention and MLP blocks, covered in later chapters.