Summary: The neural network architecture underlying all modern large language models — alternating blocks of attention and MLP operating on a sequence of token embeddings, with residual connections throughout.

Also see word embeddings

The pipeline

For a decoder-only, next-token-prediction transformer (the GPT family), data flows as:

text
 └─► tokenize            → sequence of token IDs          (length ≤ context size)
 └─► embed (W_E)         → sequence of vectors             [n × d_embed]
 └─► + positional info
 └─► ┌─ attention block ─┐
     │   (multi-head)    │
     └── residual add ───┘
 └─► ┌─ MLP block ───────┐
     │   (up → ReLU → dn)│
     └── residual add ───┘
 ... repeat L times ...
 └─► final last-token vector
 └─► unembed (W_U)       → logits over vocab               [|V|]
 └─► softmax             → probability distribution
 └─► sample              → next token

For GPT-3: , context size , vocab , number of layers , attention heads per layer .

Key design choices

1. Everything is a tensor

Inputs are embedded into real-valued vectors. All intermediate state is a sequence of vectors. Weights are packed into matrices, and the core operation everywhere is matrix-vector multiplication (interpreted as a weighted sum). Nonlinearities (softmax in attention, ReLU/GELU in MLPs) are sprinkled in to prevent the whole model from collapsing to a single affine map.

2. Parallelism over sequence position

Unlike RNNs/LSTMs, a transformer processes all tokens simultaneously. All cross-token information transfer happens inside attention blocks via matrix multiplications that GPUs eat for breakfast. This parallelism is the main reason transformers scaled when previous architectures didn’t.

3. Residual stream

Each block’s output is added to its input, not replacing it. This means every embedding is a running accumulation: it starts as the bare lookup from and gets progressively refined by each block. This is essential for gradients to flow cleanly through deep stacks and for interpretability — you can read the residual stream as the model’s “working memory” for that token position.

4. Only the last vector predicts

At inference time, only the final vector in the sequence (the one for the last input token) is multiplied by to produce the next-token distribution. All the other vectors in the last layer are ignored at inference — but during training every position is used to predict its own next token, which makes each sequence yield training signals instead of one.

The two block types

BlockWhat it doesCross-token?Where facts live
attention-mechanismLets tokens share information with each other based on contextYesMostly no
multilayer-perceptronTransforms each token vector independently through a large hidden layerNo (per-position)Yes — ~2/3 of GPT-3’s params live here

Attention answers “who should influence whom?”; MLPs answer “given what this token now represents, what more do I know about it?“. The interleaving lets each token gather context (attention) and then react to that context (MLP) repeatedly as data flows through the depth.

Emergence from training

Researchers specify the architecture. Everything else — what directions mean, what each head attends to, what facts each MLP neuron gates on — is discovered by gradient-descent minimising the next-token cross-entropy loss over trillions of training tokens. No human sets a single weight.

Parameter count (GPT-3)

ComponentParams
Embedding 617M
Unembedding 617M
Attention (96 layers × 96 heads × 4 matrices)~58B
MLP (96 layers × 2 matrices)~116B
LayerNorm + biases~49K
Total~175B

MLPs hold roughly two-thirds of the weights; attention holds a third. Attention gets most of the attention in explanations, but most of the memory lives elsewhere.

See also