Logits

Summary: The unnormalised real-valued scores that a neural network produces before softmax turns them into a probability distribution. In an LLM, “the logits” specifically means the output of the unembedding step — one score per vocabulary token — before softmax is applied for sampling.

What they are

For a vocabulary of size $∣ V ∣$ , the logits are a vector $ℓ \in R^{∣ V ∣}$ :

$ℓ = W_{U} E_{n}^{(L)}$

Each entry is the dot product of one row of the unembedding matrix with the last layer’s final-token residual vector. Logits can be any real number — positive, negative, unbounded — which is why they need softmax before they can be interpreted as probabilities.

Why the name

“Logit” comes from the log-odds of a binary outcome: $logit (p) = lo g \frac{p}{1 - p}$ . The multi-class generalisation is log-probabilities up to an additive constant, because softmax of a log-prob vector recovers the probabilities:

$p_{i} = \frac{e ^{ℓ_{i}}}{\sum _{j} e ^{ℓ_{j}}} ⟺ ℓ_{i} = lo g p_{i} + C$

In deep learning, “logits” is used more loosely to mean any pre-softmax raw output, even when the connection to log-odds is informal.

Things to remember

Only differences matter. Adding a constant to every logit leaves the softmax output unchanged, so absolute logit values are not meaningful in isolation — only differences are.
Cross-entropy loss is computed from logits directly. It’s almost always more numerically stable to combine log-softmax + NLL into a single operation (F.cross_entropy in PyTorch) than to softmax first and log second.
Sampling tricks live here. Temperature scaling, top- $k$ and top- $p$ (nucleus) sampling, repetition penalties, and logit biases all work by modifying the logits before softmax.

Sources

src-3b1b-llms-ch2-transformers

notes/

Logits

What they are

Why the name

Things to remember

See also

Sources

Logits

What they are

Why the name

Things to remember

See also

Sources

Graph View

Backlinks

Explorer