Summary: A high-level walkthrough of the transformer-architecture — tokens, embeddings, repeated attention + MLP blocks, and the final unembedding + softmax — using GPT-3’s numbers throughout.

Key ideas

  • GPT = Generative Pre-Trained Transformer. The chapter focuses on the decoder-only, next-token-prediction variant that underlies ChatGPT.
  • Pipeline. Input text → tokensembedding vectors → many layers of attention + MLP → final vector → unembedding matrix → softmax → probability distribution over next token.
  • Deep learning premise. Models are layered transformations of real-valued tensors, parameterised by weight matrices. Without nonlinearities between matrix multiplications the whole model collapses to a single affine map create-page.

Embeddings

  • The embedding matrix has one column per vocabulary token. In GPT-3: 50,257 tokens × 12,288 dims ≈ 617M parameters just for this first step.
  • Token embeddings are looked up, not computed. At this stage each vector only encodes the identity of the token (plus positional info), with no context.
  • Directions (in the vector space) encode meaning. Training tends to settle on embeddings where directions correspond to semantic features — e.g. king − man + woman ≈ queen, or a learned “gender axis”, “plurality axis”, etc. This is empirical, not designed.
  • Dot product measures alignment. Positive ⇒ similar direction, zero ⇒ perpendicular, negative ⇒ opposite. Used throughout transformers as the “how related are these two vectors?” primitive.
  • Context size is the number of token vectors the network processes simultaneously. GPT-3: 2,048. Limits how much text can influence any single prediction.

Unembedding and softmax

  • After the last block, the last vector in the sequence is multiplied by the unembedding matrix (vocab_size × embed_dim) to produce one raw score per vocabulary token. These raw scores are called logits.
  • Why only the last vector? Training is more efficient if every position simultaneously predicts its own next token — so every position acts as a training example. At inference time we just look at the last one.
  • Softmax turns logits into a probability distribution. A temperature can be added via : concentrates on the argmax; large flattens toward uniform.
  • in GPT-3 is another ~617M parameters (the same shape as , transposed).

Counting parameters (running tally)

  • Embedding (): 617M
  • Unembedding (): 617M
  • So far: ~1.2B of the 175B total in GPT-3. The rest live in the attention and MLP blocks, covered in later chapters.