Summary: The unnormalised real-valued scores that a neural network produces before softmax turns them into a probability distribution. In an LLM, “the logits” specifically means the output of the unembedding step — one score per vocabulary token — before softmax is applied for sampling.
What they are
For a vocabulary of size , the logits are a vector :
Each entry is the dot product of one row of the unembedding matrix with the last layer’s final-token residual vector. Logits can be any real number — positive, negative, unbounded — which is why they need softmax before they can be interpreted as probabilities.
Why the name
“Logit” comes from the log-odds of a binary outcome: . The multi-class generalisation is log-probabilities up to an additive constant, because softmax of a log-prob vector recovers the probabilities:
In deep learning, “logits” is used more loosely to mean any pre-softmax raw output, even when the connection to log-odds is informal.
Things to remember
- Only differences matter. Adding a constant to every logit leaves the softmax output unchanged, so absolute logit values are not meaningful in isolation — only differences are.
- Cross-entropy loss is computed from logits directly. It’s almost always more numerically stable to combine log-softmax + NLL into a single operation (
F.cross_entropyin PyTorch) than to softmax first and log second. - Sampling tricks live here. Temperature scaling, top- and top- (nucleus) sampling, repetition penalties, and logit biases all work by modifying the logits before softmax.
See also
- softmax — the function that normalises logits into probabilities
- unembedding — the layer that produces logits in an LLM
- word-embedding — the mirror operation at the input side