GPT-3

Summary: OpenAI’s 175-billion-parameter decoder-only transformer language model, released in 2020. Historically important as the first model to make “scale alone produces qualitative capability jumps” widely credible, and the running numerical example in the 3Blue1Brown LLM series.

The numbers

All pulled from 3b1b’s walkthrough. These are the reference points behind every “how many parameters does X contribute?” table in this wiki.

Quantity	Value
Total parameters	~175B
Embedding dimension $d_{embed}$	12,288
Vocabulary size	$∣ V ∣$ = 50,257
Context size (tokens)	2,048
Number of layers $L$	96
Attention heads per layer	96
Key/query dimension $d_{k}$	128
MLP hidden size	~49,152 (= $4 \times d_{embed}$ )

Where the parameters live

Component	Dimension Breakdown	Total Parameters	Share
Embedding	$d_{embed} \times n_{vocab} = 12, 288 \times 50, 257$	617,558,016	0.4%
Key	$d_{query} \times d_{embed} \times n_{heads} \times n_{layers} = 128 \times 12, 288 \times 96 \times 96$	14,495,514,624	8.3%
Query	$d_{query} \times d_{embed} \times n_{heads} \times n_{layers} = 128 \times 12, 288 \times 96 \times 96$	14,495,514,624	8.3%
Value	$d_{value} \times d_{embed} \times n_{heads} \times n_{layers} = 128 \times 12, 288 \times 96 \times 96$	14,495,514,624	8.3%
Output	$d_{embed} \times d_{value} \times n_{heads} \times n_{layers} = 12, 288 \times 128 \times 96 \times 96$	14,495,514,624	8.3%
Up-projection	$n_{neurons} \times d_{embed} \times n_{layers} = 49, 152 \times 12, 288 \times 96$	57,982,058,496	33%
Down-projection	$d_{embed} \times n_{neurons} \times n_{layers} = 12, 288 \times 49, 152 \times 96$	57,982,058,496	33%
Unembedding	$n_{vocab} \times d_{embed} = 50, 257 \times 12, 288$	617,558,016	0.4%
LayerNorm + biases?			~0%
Total			100%

Two takeaways:

Most of a transformer’s weights live in MLPs, not attention. Attention gets the conceptual spotlight but only a third of the memory.
Embedding and unembedding are relatively tiny (<1% combined). Widening the residual stream has much bigger downstream costs than widening the vocabulary.

Training

Data. Roughly 500B tokens of filtered internet text (Common Crawl, WebText, Wikipedia, books). A human reading nonstop would need ~2,600 years to finish.
Compute. Estimated single-digit thousands of petaflop-days — feasible only because transformers parallelise across sequence position.
Objective. Self-supervised next-token prediction (pretraining). GPT-3 itself was released as a base model; the Instruct / Chat variants came from subsequent rlhf fine-tuning.

Historical significance

Demonstrated in-context learning — the model could perform novel tasks just from examples in the prompt, with no gradient updates. This wasn’t obvious from GPT-2 and kicked off the entire prompt-engineering era.
Validated scaling laws: capabilities improved predictably with (parameters × data × compute), suggesting that further scale would keep paying off. It did.
Catalysed the LLM product boom: the OpenAI API, GitHub Copilot, ChatGPT (a fine-tuned descendant), and the entire “LLM platform” ecosystem all trace back to GPT-3.

notes/

GPT-3

The numbers

Where the parameters live

Training

Historical significance

See also

Sources

GPT-3

The numbers

Where the parameters live

Training

Historical significance

See also

Sources

Graph View

Backlinks

Explorer