Superposition

Summary: The hypothesis — backed by a growing body of interpretability work — that neural networks store far more features than they have neurons by assigning each feature to a direction that is only nearly perpendicular to every other feature, rather than strictly orthogonal. This lets a $d$ -dimensional space carry exponentially many features, but means individual neurons rarely correspond to single clean concepts.

The pigeonhole, and the escape from it

If you want a set of vectors in $R^{d}$ to be strictly orthogonal to one another, you can fit at most $d$ of them. That’s the mathematical definition of dimension. By this standard, GPT-3’s 12,288-dimensional embedding space could encode at most 12,288 distinct features.

But real models appear to track far more features than that. How?

Relax “orthogonal” to “nearly orthogonal”. Allow feature directions to be at angles somewhere in the neighbourhood of 90° — say $85°$ to $95°$ — instead of exactly $90°$ . The Johnson–Lindenstrauss lemma then implies that the maximum number of such vectors grows exponentially with $d$ .

For GPT-3’s 12,288 dims with an 85° tolerance, this allows well over $1 0^{10}$ near-orthogonal directions — far more than the number of neurons in the model. For GPT-4-scale models at ~100K dims, the headroom is beyond $1 0^{100}$ .

Why neural networks exploit this

Training pressures favour representing as many distinct features as possible with as little interference as possible. If feature directions are nearly perpendicular, adding a component in one direction barely perturbs the others — so the interference cost is low. With exponential headroom, a model can cram a huge number of concepts into the same embedding space, at the cost of a small but nonzero interference budget.

This is the key mechanism behind the (empirical) observation that capability scales faster than linearly with width: a model with 10× more dimensions can store vastly more than 10× as many features.

Consequences for interpretability

Single neurons usually don’t mean anything clean. If features are superimposed, any individual neuron fires for some specific combination of features rather than one recognisable concept. The naïve “one feature per neuron” intuition from small image classifiers is misleading at LLM scale.
Features are visible as patterns across many neurons. To recover interpretable features you have to find the right basis of the activation space, not stare at individual neurons.
Sparse autoencoders (SAEs) are the current mainstream tool for this — they train a wide, sparse autoencoder on top of a model’s activations to extract a large dictionary of near-orthogonal “feature directions” that are individually interpretable.
It’s partly why LLMs are hard to debug. You can’t just find “the Michael Jordan neuron” — the Michael Jordan feature is some specific direction in the activation space, spread across many neurons, and you need machinery to decode it.

Caveats

Tolerance matters a lot. JL’s exponential scaling only kicks in once the allowed angle is loose enough. As 3b1b noted, at the tight $(89°, 91°)$ tolerance, 100 dimensions aren’t enough to fit more than ~100 near-perp vectors — JL starts mattering around $d > 250, 000$ . At $85°$ , you get huge headroom even at GPT-3 scale.
“Superposition exists” ≠ “every feature is superposed”. Some features do appear to be close to dedicated to a single neuron in real models. Superposition is a mechanism the model can use, not a universal rule.
Mechanistic interpretability is new. The hypothesis is well-supported but far from a closed book.

Sources

src-3b1b-llms-ch4-mlps-store-facts

notes/

Superposition

The pigeonhole, and the escape from it

Why neural networks exploit this

Consequences for interpretability

Caveats

See also

Further reading

Sources

Superposition

The pigeonhole, and the escape from it

Why neural networks exploit this

Consequences for interpretability

Caveats

See also

Further reading

Sources

Graph View

Backlinks

Explorer