Summary: The neural network architecture underlying all modern large language models — alternating blocks of attention and MLP operating on a sequence of token embeddings, with residual connections throughout.
Also see word embeddings
The pipeline
For a decoder-only, next-token-prediction transformer (the GPT family), data flows as:
text
└─► tokenize → sequence of token IDs (length ≤ context size)
└─► embed (W_E) → sequence of vectors [n × d_embed]
└─► + positional info
└─► ┌─ attention block ─┐
│ (multi-head) │
└── residual add ───┘
└─► ┌─ MLP block ───────┐
│ (up → ReLU → dn)│
└── residual add ───┘
... repeat L times ...
└─► final last-token vector
└─► unembed (W_U) → logits over vocab [|V|]
└─► softmax → probability distribution
└─► sample → next token
For GPT-3: , context size , vocab , number of layers , attention heads per layer .
Key design choices
1. Everything is a tensor
Inputs are embedded into real-valued vectors. All intermediate state is a sequence of vectors. Weights are packed into matrices, and the core operation everywhere is matrix-vector multiplication (interpreted as a weighted sum). Nonlinearities (softmax in attention, ReLU/GELU in MLPs) are sprinkled in to prevent the whole model from collapsing to a single affine map.
2. Parallelism over sequence position
Unlike RNNs/LSTMs, a transformer processes all tokens simultaneously. All cross-token information transfer happens inside attention blocks via matrix multiplications that GPUs eat for breakfast. This parallelism is the main reason transformers scaled when previous architectures didn’t.
3. Residual stream
Each block’s output is added to its input, not replacing it. This means every embedding is a running accumulation: it starts as the bare lookup from and gets progressively refined by each block. This is essential for gradients to flow cleanly through deep stacks and for interpretability — you can read the residual stream as the model’s “working memory” for that token position.
4. Only the last vector predicts
At inference time, only the final vector in the sequence (the one for the last input token) is multiplied by to produce the next-token distribution. All the other vectors in the last layer are ignored at inference — but during training every position is used to predict its own next token, which makes each sequence yield training signals instead of one.
The two block types
| Block | What it does | Cross-token? | Where facts live |
|---|---|---|---|
| attention-mechanism | Lets tokens share information with each other based on context | Yes | Mostly no |
| multilayer-perceptron | Transforms each token vector independently through a large hidden layer | No (per-position) | Yes — ~2/3 of GPT-3’s params live here |
Attention answers “who should influence whom?”; MLPs answer “given what this token now represents, what more do I know about it?“. The interleaving lets each token gather context (attention) and then react to that context (MLP) repeatedly as data flows through the depth.
Emergence from training
Researchers specify the architecture. Everything else — what directions mean, what each head attends to, what facts each MLP neuron gates on — is discovered by gradient-descent minimising the next-token cross-entropy loss over trillions of training tokens. No human sets a single weight.
Parameter count (GPT-3)
| Component | Params |
|---|---|
| Embedding | 617M |
| Unembedding | 617M |
| Attention (96 layers × 96 heads × 4 matrices) | ~58B |
| MLP (96 layers × 2 matrices) | ~116B |
| LayerNorm + biases | ~49K |
| Total | ~175B |
MLPs hold roughly two-thirds of the weights; attention holds a third. Attention gets most of the attention in explanations, but most of the memory lives elsewhere.
See also
- word-embedding, tokenization — the front door
- attention-mechanism, multi-head-attention — the cross-token operation
- multilayer-perceptron — the per-token operation
- unembedding, softmax — the back door
- superposition — why the residual stream’s “directions encode meaning” picture still leaves interpretability hard