Summary: A result in high-dimensional geometry that says you can project a set of points from very high dimensions down to surprisingly few dimensions while approximately preserving all pairwise distances — and equivalently, that the number of vectors you can fit in that are nearly orthogonal to one another grows exponentially in . This second framing is the one that matters for superposition in neural networks.
The classic statement
Given points in (think: high-dimensional feature vectors) and an error tolerance , there exists a linear map with
such that for every pair of input points,
In words: you can project points down to about dimensions and still have every pairwise distance preserved up to a multiplicative error. Remarkably, the needed dimension depends only on and — not on the original dimension .
A standard proof uses a random Gaussian projection: pick at random, and with high probability it preserves distances for your fixed point set.
The “near-orthogonal vectors” framing
The version that shows up in superposition discussions is a dual consequence:
In , the maximum number of unit vectors you can fit so that every pair has dot product in absolute value (i.e. all pairs within some angle of perpendicular) grows exponentially in .
Intuitively: as grows, high-dimensional space is so “spacious” that random unit vectors are already close to perpendicular to each other, and you don’t need many extra dimensions to fit a lot of such vectors.
Scaling intuition
The scaling is sharply nonlinear in the tolerance:
- Tight tolerance → the exponential regime doesn’t really kick in until is in the hundreds of thousands.
- Looser tolerance (, dot product ) → headroom explodes much earlier.
- At and (GPT-3’s embedding width), well over near-orthogonal unit vectors fit.
- At and (frontier model scale), the count is somewhere past — a googol of room.
This tolerance sensitivity matters: the “superposition gives exponential capacity” story needs a reasonably loose angular budget to be load-bearing.
Why it matters for LLMs
The embedding/residual-stream vectors of a transformer live in a space with dimensions. Because JL lets the model pack exponentially many near-orthogonal feature directions into that space, it can represent vastly more distinct concepts than it has dimensions (or neurons). See superposition for the interpretability consequences.
See also
- superposition — the main application here
- word-embedding — the space these vectors live in
- transformer-architecture — where high-dimensional embedding spaces come from