Summary: The unnormalised real-valued scores that a neural network produces before softmax turns them into a probability distribution. In an LLM, “the logits” specifically means the output of the unembedding step — one score per vocabulary token — before softmax is applied for sampling.

What they are

For a vocabulary of size , the logits are a vector :

Each entry is the dot product of one row of the unembedding matrix with the last layer’s final-token residual vector. Logits can be any real number — positive, negative, unbounded — which is why they need softmax before they can be interpreted as probabilities.

Why the name

“Logit” comes from the log-odds of a binary outcome: . The multi-class generalisation is log-probabilities up to an additive constant, because softmax of a log-prob vector recovers the probabilities:

In deep learning, “logits” is used more loosely to mean any pre-softmax raw output, even when the connection to log-odds is informal.

Things to remember

  • Only differences matter. Adding a constant to every logit leaves the softmax output unchanged, so absolute logit values are not meaningful in isolation — only differences are.
  • Cross-entropy loss is computed from logits directly. It’s almost always more numerically stable to combine log-softmax + NLL into a single operation (F.cross_entropy in PyTorch) than to softmax first and log second.
  • Sampling tricks live here. Temperature scaling, top- and top- (nucleus) sampling, repetition penalties, and logit biases all work by modifying the logits before softmax.

See also

  • softmax — the function that normalises logits into probabilities
  • unembedding — the layer that produces logits in an LLM
  • word-embedding — the mirror operation at the input side