Summary: Unpacks the MLP blocks inside a transformer, shows how a single fact like Michael Jordan plays basketball could in principle be stored as directions + neurons, and introduces superposition as the reason real models aren’t so clean.

Where facts live

  • A 2023 Google DeepMind post found that factual recall localises to the MLP blocks of a transformer, not the attention blocks. The mechanism is not fully understood but the evidence pins the locus.
  • In GPT-3, MLP blocks hold about two-thirds of the 175B total parameters (~116B). Attention holds about a third. So “most of the model” is MLP.

Single MLP block structure

Each token vector passes through the block in parallel and independently (no cross-token communication — that’s attention’s job). The sequence of operations:

  1. Up-projection: multiply the incoming embedding by a matrix with ~4× as many rows as the embedding dimension (GPT-3: ~49,152 rows × 12,288 cols), then add a bias .
  2. Nonlinearity: apply ReLU (or GELU in practice) element-wise. The outputs of this step are called the neurons of the MLP. A neuron is “active” if > 0.
  3. Down-projection: multiply by (same shape as ) and add bias , bringing the vector back to embedding dimension.
  4. Residual add: the result is added to the original (residual connection).

The Michael Jordan example

Assume three nearly-orthogonal directions exist in the embedding space: (first name Michael), (last name Jordan), (basketball).

  • Suppose row 0 of equals . Then , which evaluates to 2 only when encodes both names, otherwise ≤ 1.
  • Suppose the bias at row 0 is . Now the pre-ReLU value is positive iff the full name is encoded — a clean AND gate. ReLU kills the negative cases.
  • Suppose column 0 of is the direction. Then when neuron 0 is active, we add to the output, i.e. inject “basketball” into the embedding.
  • Summary: the row of = question being asked; the column of = answer written back if the neuron fires. Bias terms let rows encode AND-like thresholds.
  • An earlier attention layer must first consolidate Michael + Jordan onto a single token — this example assumes that’s already happened.
    • i.e. attention from a previous layer has encoded the entire rich meaning, Michael Jordan, into the second vector, Jordan. That second token is much richer and more nuanced than the vanilla Jordan token’s embedding vector, looked-up from the word-embedding matrix, .

Parameter count (GPT-3, finishing the tally)

  • : ~49,152 × 12,288 ≈ 604M params per block
  • : same, ≈ 604M per block
  • ~1.2B per MLP block × 96 layers ≈ 116B — roughly two-thirds of GPT-3’s 175B total
  • Together with attention (~58B) + embedding/unembedding (~1.2B) → 175B total. LayerNorm contributes a trivial ~49K.

Superposition — why the clean story is wrong

  • In a strictly perpendicular world, an -dimensional space holds at most distinct features. A neuron-as-feature picture assumes this.
  • If you relax “perpendicular” to “nearly perpendicular” (say, 85°–95°), the Johnson–Lindenstrauss lemma says the number of near-orthogonal vectors you can cram in grows exponentially with dimension.
  • For 85° tolerance and 12,288 dims (GPT-3 embedding), well over near-orthogonal directions fit — far more features than neurons. At ~116K dims (GPT-4 scale) the count blows past .
  • Consequence: a real feature almost never corresponds to a single neuron firing — it corresponds to a pattern across many neurons. This is superposition, and it’s a major reason interpretability is hard and why larger models gain capability faster than a linear count would suggest.
  • 3b1b notes an error in the video’s Python demo: at 100 dims, the (89°, 91°) tolerance is too tight for JL to really kick in. Tolerance matters a lot — 85° works much sooner.

Follow-on reading