Summary: OpenAI’s 175-billion-parameter decoder-only transformer language model, released in 2020. Historically important as the first model to make “scale alone produces qualitative capability jumps” widely credible, and the running numerical example in the 3Blue1Brown LLM series.

The numbers

All pulled from 3b1b’s walkthrough. These are the reference points behind every “how many parameters does X contribute?” table in this wiki.

QuantityValue
Total parameters~175B
Embedding dimension 12,288
Vocabulary size = 50,257
Context size (tokens)2,048
Number of layers 96
Attention heads per layer96
Key/query dimension 128
MLP hidden size~49,152 (= )

Where the parameters live

ComponentDimension BreakdownTotal ParametersShare
Embedding617,558,0160.4%
Key14,495,514,6248.3%
Query14,495,514,6248.3%
Value14,495,514,6248.3%
Output14,495,514,6248.3%
Up-projection57,982,058,49633%
Down-projection57,982,058,49633%
Unembedding617,558,0160.4%
LayerNorm + biases?~0%
Total100%

Two takeaways:

  1. Most of a transformer’s weights live in MLPs, not attention. Attention gets the conceptual spotlight but only a third of the memory.
  2. Embedding and unembedding are relatively tiny (<1% combined). Widening the residual stream has much bigger downstream costs than widening the vocabulary.

Training

  • Data. Roughly 500B tokens of filtered internet text (Common Crawl, WebText, Wikipedia, books). A human reading nonstop would need ~2,600 years to finish.
  • Compute. Estimated single-digit thousands of petaflop-days — feasible only because transformers parallelise across sequence position.
  • Objective. Self-supervised next-token prediction (pretraining). GPT-3 itself was released as a base model; the Instruct / Chat variants came from subsequent rlhf fine-tuning.

Historical significance

  • Demonstrated in-context learning — the model could perform novel tasks just from examples in the prompt, with no gradient updates. This wasn’t obvious from GPT-2 and kicked off the entire prompt-engineering era.
  • Validated scaling laws: capabilities improved predictably with (parameters × data × compute), suggesting that further scale would keep paying off. It did.
  • Catalysed the LLM product boom: the OpenAI API, GitHub Copilot, ChatGPT (a fine-tuned descendant), and the entire “LLM platform” ecosystem all trace back to GPT-3.

See also