Summary: OpenAI’s 175-billion-parameter decoder-only transformer language model, released in 2020. Historically important as the first model to make “scale alone produces qualitative capability jumps” widely credible, and the running numerical example in the 3Blue1Brown LLM series.
The numbers
All pulled from 3b1b’s walkthrough. These are the reference points behind every “how many parameters does X contribute?” table in this wiki.
| Quantity | Value | |
|---|---|---|
| Total parameters | ~175B | |
| Embedding dimension | 12,288 | |
| Vocabulary size | = 50,257 | |
| Context size (tokens) | 2,048 | |
| Number of layers | 96 | |
| Attention heads per layer | 96 | |
| Key/query dimension | 128 | |
| MLP hidden size | ~49,152 (= ) |
Where the parameters live
| Component | Dimension Breakdown | Total Parameters | Share |
|---|---|---|---|
| Embedding | 617,558,016 | 0.4% | |
| Key | 14,495,514,624 | 8.3% | |
| Query | 14,495,514,624 | 8.3% | |
| Value | 14,495,514,624 | 8.3% | |
| Output | 14,495,514,624 | 8.3% | |
| Up-projection | 57,982,058,496 | 33% | |
| Down-projection | 57,982,058,496 | 33% | |
| Unembedding | 617,558,016 | 0.4% | |
| LayerNorm + biases? | ~0% | ||
| Total | 100% |
Two takeaways:
- Most of a transformer’s weights live in MLPs, not attention. Attention gets the conceptual spotlight but only a third of the memory.
- Embedding and unembedding are relatively tiny (<1% combined). Widening the residual stream has much bigger downstream costs than widening the vocabulary.
Training
- Data. Roughly 500B tokens of filtered internet text (Common Crawl, WebText, Wikipedia, books). A human reading nonstop would need ~2,600 years to finish.
- Compute. Estimated single-digit thousands of petaflop-days — feasible only because transformers parallelise across sequence position.
- Objective. Self-supervised next-token prediction (pretraining). GPT-3 itself was released as a base model; the Instruct / Chat variants came from subsequent rlhf fine-tuning.
Historical significance
- Demonstrated in-context learning — the model could perform novel tasks just from examples in the prompt, with no gradient updates. This wasn’t obvious from GPT-2 and kicked off the entire prompt-engineering era.
- Validated scaling laws: capabilities improved predictably with (parameters × data × compute), suggesting that further scale would keep paying off. It did.
- Catalysed the LLM product boom: the OpenAI API, GitHub Copilot, ChatGPT (a fine-tuned descendant), and the entire “LLM platform” ecosystem all trace back to GPT-3.
See also
- transformer-architecture — the architecture GPT-3 instantiates
- large-language-model — the broader concept
- pretraining, rlhf — the training phases
- src-3b1b-llms-ch2-transformers, src-3b1b-llms-ch3-attention, src-3b1b-llms-ch4-mlps-store-facts — where the numbers come from