Summary: A function that turns an arbitrary vector of real numbers into a probability distribution — all entries in and summing to 1 — by exponentiating and normalising. It’s the standard way deep learning models produce categorical outputs.

Definition

Given a vector , softmax produces:

  • Numerator is always positive, so every output is .
  • Denominator is the sum of the numerators, so outputs sum to 1.
  • Monotonic — larger input → larger output.
  • Invariant to constants. In practice implementations subtract from every entry before exponentiating to avoid numerical overflow.

Example of a vector, , run through

where

  • is the vector of logits, of length = number of vocabulary tokens (see image in unembedding note)
  • is the vector of probabilities (same length as , and elements sum to )

But why does softmax work?

See notebook 04_from_bigrams_to_nns where I figured this out, and activation-function for the generalisation

But why do elements in xenc @ W "behave like" log-counts?

Softmax assumes its inputs are log-counts (that’s why it exponentiates them).

  • I.e. by feeding xenc @ W into softmax, we are declaring those values to be log-counts by convention.
  • The network then learns W such that xenc @ W actually produces useful log-counts — ones that after softmax give good probability distributions.

Here’s the flow

  1. Randomly initialised Wxenc @ W is garbage numbers
  2. Softmax is applied anyway, treating those garbage numbers as log-counts
  3. We measure how bad the resulting probabilities are (loss)
  4. Gradient descent nudges W so that xenc @ W produces log-counts that give better probabilities
    • Works because every step is differentiable (xenc @ W, .exp(), normalisation)
  5. Repeat until xenc @ W genuinely behaves like log-counts

The label “log-counts” is the contract softmax imposes on its input — not a property the matrix multiply xenc @ W produces on its own.

Why “soft” max?

The hard version — pick the argmax — is discrete and non-differentiable. Softmax is a smooth approximation that still concentrates mass on the largest entry. If one input dominates, its output is close to 1 and the others are close to 0; if the inputs are similar, the distribution is close to uniform.

Unlike argmax, softmax is differentiable end-to-end, which is what makes it usable anywhere you want a probability distribution inside a neural network.

Where it shows up in transformers

Softmax is the standard nonlinearity wherever a neural network needs to output or use a probability distribution:

  1. Attention patterns — applied column-by-column to the raw scores in attention, turning alignment scores into mixing weights over source tokens.
  2. Next-token prediction — applied to the logits produced by the unembedding step, turning them into a probability distribution over the vocabulary.
  3. Classification heads generally — last-layer outputs of any categorical classifier.

Temperature

You can generalise softmax with a positive scalar called temperature:

  • — standard softmax.
  • — the largest input dominates completely; distribution collapses to argmax (one-hot on the max).
  • — all exponents shrink to 0; distribution approaches uniform.

Temperature has two main uses in LLMs:

  1. Sampling. Chatbots sample from the next-token distribution with some . Low → more conservative and repetitive; high → more diverse and riskier.
  2. Attention scaling. The division by inside the attention softmax is a temperature-like scaling chosen for numerical stability, not diversity control — without it, the softmax would saturate as grows.

The name comes from an analogy to thermodynamic temperature — higher makes the distribution more “noisy” in a loose Boltzmann-distribution sense.

A common gotcha

Logits (the inputs to softmax) are not calibrated probabilities on their own — comparing raw logit values across different contexts is meaningless because the softmax’s shift-invariance means the model is free to add a constant to everything. Only differences between logits carry probabilistic meaning.

See also