Johnson–Lindenstrauss Lemma

Summary: A result in high-dimensional geometry that says you can project a set of points from very high dimensions down to surprisingly few dimensions while approximately preserving all pairwise distances — and equivalently, that the number of vectors you can fit in $R^{d}$ that are nearly orthogonal to one another grows exponentially in $d$ . This second framing is the one that matters for superposition in neural networks.

The classic statement

Given $n$ points in $R^{D}$ (think: high-dimensional feature vectors) and an error tolerance $ε \in (0, 1)$ , there exists a linear map $f : R^{D} \to R^{d}$ with

$d = O (\frac{l o g n}{ε ^{2}})$

such that for every pair of input points,

$(1 - ε) ∥ u - v ∥^{2} \leq ∥ f (u) - f (v) ∥^{2} \leq (1 + ε) ∥ u - v ∥^{2} .$

In words: you can project $n$ points down to about $lo g n$ dimensions and still have every pairwise distance preserved up to a multiplicative $ε$ error. Remarkably, the needed dimension $d$ depends only on $n$ and $ε$ — not on the original dimension $D$ .

A standard proof uses a random Gaussian projection: pick $f$ at random, and with high probability it preserves distances for your fixed point set.

The “near-orthogonal vectors” framing

The version that shows up in superposition discussions is a dual consequence:

In $R^{d}$ , the maximum number of unit vectors you can fit so that every pair has dot product $\leq ε$ in absolute value (i.e. all pairs within some angle of perpendicular) grows exponentially in $d$ .

Intuitively: as $d$ grows, high-dimensional space is so “spacious” that random unit vectors are already close to perpendicular to each other, and you don’t need many extra dimensions to fit a lot of such vectors.

Scaling intuition

The scaling is sharply nonlinear in the tolerance:

Tight tolerance $(89°, 91°)$ → the exponential regime doesn’t really kick in until $d$ is in the hundreds of thousands.
Looser tolerance ( $85°$ , dot product $≲ 0.087$ ) → headroom explodes much earlier.
At $85°$ and $d = 12, 288$ (GPT-3’s embedding width), well over $1 0^{10}$ near-orthogonal unit vectors fit.
At $85°$ and $d \sim 116, 000$ (frontier model scale), the count is somewhere past $1 0^{100}$ — a googol of room.

This tolerance sensitivity matters: the “superposition gives exponential capacity” story needs a reasonably loose angular budget to be load-bearing.

Why it matters for LLMs

The embedding/residual-stream vectors of a transformer live in a space with $d \sim 1 0^{4}$ dimensions. Because JL lets the model pack exponentially many near-orthogonal feature directions into that space, it can represent vastly more distinct concepts than it has dimensions (or neurons). See superposition for the interpretability consequences.

Sources

src-3b1b-llms-ch4-mlps-store-facts

notes/

Johnson–Lindenstrauss Lemma

The classic statement

The “near-orthogonal vectors” framing

Scaling intuition

Why it matters for LLMs

See also

Sources

Johnson–Lindenstrauss Lemma

The classic statement

The “near-orthogonal vectors” framing

Scaling intuition

Why it matters for LLMs

See also

Sources

Graph View

Backlinks

Explorer