Transformer Architecture

Summary: The neural network architecture underlying all modern large language models — alternating blocks of attention and MLP operating on a sequence of token embeddings, with residual connections throughout.

Also see word embeddings

The pipeline

For a decoder-only, next-token-prediction transformer (the GPT family), data flows as:

text
 └─► tokenize            → sequence of token IDs          (length ≤ context size)
 └─► embed (W_E)         → sequence of vectors             [n × d_embed]
 └─► + positional info
 └─► ┌─ attention block ─┐
     │   (multi-head)    │
     └── residual add ───┘
 └─► ┌─ MLP block ───────┐
     │   (up → ReLU → dn)│
     └── residual add ───┘
 ... repeat L times ...
 └─► final last-token vector
 └─► unembed (W_U)       → logits over vocab               [|V|]
 └─► softmax             → probability distribution
 └─► sample              → next token

For GPT-3: $d_{embed} = 12, 288$ , context size $n = 2, 048$ , vocab $∣ V ∣ = 50, 257$ , number of layers $L = 96$ , attention heads per layer $= 96$ .

Key design choices

1. Everything is a tensor

Inputs are embedded into real-valued vectors. All intermediate state is a sequence of vectors. Weights are packed into matrices, and the core operation everywhere is matrix-vector multiplication (interpreted as a weighted sum). Nonlinearities (softmax in attention, ReLU/GELU in MLPs) are sprinkled in to prevent the whole model from collapsing to a single affine map.

2. Parallelism over sequence position

Unlike RNNs/LSTMs, a transformer processes all $n$ tokens simultaneously. All cross-token information transfer happens inside attention blocks via matrix multiplications that GPUs eat for breakfast. This parallelism is the main reason transformers scaled when previous architectures didn’t.

3. Residual stream

Each block’s output is added to its input, not replacing it. This means every embedding is a running accumulation: it starts as the bare lookup from $W_{E}$ and gets progressively refined by each block. This is essential for gradients to flow cleanly through deep stacks and for interpretability — you can read the residual stream as the model’s “working memory” for that token position.

4. Only the last vector predicts

At inference time, only the final vector in the sequence (the one for the last input token) is multiplied by $W_{U}$ to produce the next-token distribution. All the other vectors in the last layer are ignored at inference — but during training every position is used to predict its own next token, which makes each sequence yield $n$ training signals instead of one.

The two block types

Block	What it does	Cross-token?	Where facts live
attention-mechanism	Lets tokens share information with each other based on context	Yes	Mostly no
multilayer-perceptron	Transforms each token vector independently through a large hidden layer	No (per-position)	Yes — ~2/3 of GPT-3’s params live here

Attention answers “who should influence whom?”; MLPs answer “given what this token now represents, what more do I know about it?“. The interleaving lets each token gather context (attention) and then react to that context (MLP) repeatedly as data flows through the depth.

Emergence from training

Researchers specify the architecture. Everything else — what directions mean, what each head attends to, what facts each MLP neuron gates on — is discovered by gradient-descent minimising the next-token cross-entropy loss over trillions of training tokens. No human sets a single weight.

Parameter count (GPT-3)

Component	Params
Embedding $W_{E}$	617M
Unembedding $W_{U}$	617M
Attention (96 layers × 96 heads × 4 matrices)	~58B
MLP (96 layers × 2 matrices)	~116B
LayerNorm + biases	~49K
Total	~175B

MLPs hold roughly two-thirds of the weights; attention holds a third. Attention gets most of the attention in explanations, but most of the memory lives elsewhere.

notes/

Transformer Architecture

The pipeline

Key design choices

1. Everything is a tensor

2. Parallelism over sequence position

3. Residual stream

4. Only the last vector predicts

The two block types

Emergence from training

Parameter count (GPT-3)

See also

Sources

Transformer Architecture

The pipeline

Key design choices

1. Everything is a tensor

2. Parallelism over sequence position

3. Residual stream

4. Only the last vector predicts

The two block types

Emergence from training

Parameter count (GPT-3)

See also

Sources

Graph View

Backlinks

Explorer