Summary: The hypothesis — backed by a growing body of interpretability work — that neural networks store far more features than they have neurons by assigning each feature to a direction that is only nearly perpendicular to every other feature, rather than strictly orthogonal. This lets a -dimensional space carry exponentially many features, but means individual neurons rarely correspond to single clean concepts.
The pigeonhole, and the escape from it
If you want a set of vectors in to be strictly orthogonal to one another, you can fit at most of them. That’s the mathematical definition of dimension. By this standard, GPT-3’s 12,288-dimensional embedding space could encode at most 12,288 distinct features.
But real models appear to track far more features than that. How?
Relax “orthogonal” to “nearly orthogonal”. Allow feature directions to be at angles somewhere in the neighbourhood of 90° — say to — instead of exactly . The Johnson–Lindenstrauss lemma then implies that the maximum number of such vectors grows exponentially with .
For GPT-3’s 12,288 dims with an 85° tolerance, this allows well over near-orthogonal directions — far more than the number of neurons in the model. For GPT-4-scale models at ~100K dims, the headroom is beyond .
Why neural networks exploit this
Training pressures favour representing as many distinct features as possible with as little interference as possible. If feature directions are nearly perpendicular, adding a component in one direction barely perturbs the others — so the interference cost is low. With exponential headroom, a model can cram a huge number of concepts into the same embedding space, at the cost of a small but nonzero interference budget.
This is the key mechanism behind the (empirical) observation that capability scales faster than linearly with width: a model with 10× more dimensions can store vastly more than 10× as many features.
Consequences for interpretability
- Single neurons usually don’t mean anything clean. If features are superimposed, any individual neuron fires for some specific combination of features rather than one recognisable concept. The naïve “one feature per neuron” intuition from small image classifiers is misleading at LLM scale.
- Features are visible as patterns across many neurons. To recover interpretable features you have to find the right basis of the activation space, not stare at individual neurons.
- Sparse autoencoders (SAEs) are the current mainstream tool for this — they train a wide, sparse autoencoder on top of a model’s activations to extract a large dictionary of near-orthogonal “feature directions” that are individually interpretable.
- It’s partly why LLMs are hard to debug. You can’t just find “the Michael Jordan neuron” — the Michael Jordan feature is some specific direction in the activation space, spread across many neurons, and you need machinery to decode it.
Caveats
- Tolerance matters a lot. JL’s exponential scaling only kicks in once the allowed angle is loose enough. As 3b1b noted, at the tight tolerance, 100 dimensions aren’t enough to fit more than ~100 near-perp vectors — JL starts mattering around . At , you get huge headroom even at GPT-3 scale.
- “Superposition exists” ≠ “every feature is superposed”. Some features do appear to be close to dedicated to a single neuron in real models. Superposition is a mechanism the model can use, not a universal rule.
- Mechanistic interpretability is new. The hypothesis is well-supported but far from a closed book.
See also
- johnson-lindenstrauss-lemma — the math that makes superposition possible
- multilayer-perceptron — where the MLP neurons that sparse autoencoders usually target live
- word-embedding — the residual stream also exhibits superposition
- src-3b1b-llms-ch4-mlps-store-facts — the 3b1b walkthrough
Further reading
- Anthropic, Toy Models of Superposition
- Anthropic, Decomposing Language Models With Dictionary Learning