Summary: A function that turns an arbitrary vector of real numbers into a probability distribution — all entries in and summing to 1 — by exponentiating and normalising. It’s the standard way deep learning models produce categorical outputs.
Definition
Given a vector , softmax produces:
- Numerator is always positive, so every output is .
- Denominator is the sum of the numerators, so outputs sum to 1.
- Monotonic — larger input → larger output.
- Invariant to constants — . In practice implementations subtract from every entry before exponentiating to avoid numerical overflow.
Example of a vector, , run through
where
- is the vector of logits, of length = number of vocabulary tokens (see image in unembedding note)
- is the vector of probabilities (same length as , and elements sum to )
But why does softmax work?
See notebook 04_from_bigrams_to_nns where I figured this out, and activation-function for the generalisation
But why do elements in
xenc @ W"behave like" log-counts?Softmax assumes its inputs are log-counts (that’s why it exponentiates them).
- I.e. by feeding
xenc @ Winto softmax, we are declaring those values to be log-counts by convention.- The network then learns
Wsuch thatxenc @ Wactually produces useful log-counts — ones that after softmax give good probability distributions.Here’s the flow
- Randomly initialised
W→xenc @ Wis garbage numbers- Softmax is applied anyway, treating those garbage numbers as log-counts
- We measure how bad the resulting probabilities are (loss)
- Gradient descent nudges
Wso thatxenc @ Wproduces log-counts that give better probabilities
- Works because every step is differentiable (
xenc @ W,.exp(), normalisation)- Repeat until
xenc @ Wgenuinely behaves like log-countsThe label “log-counts” is the contract softmax imposes on its input — not a property the matrix multiply
xenc @ Wproduces on its own.
Why “soft” max?
The hard version — pick the argmax — is discrete and non-differentiable. Softmax is a smooth approximation that still concentrates mass on the largest entry. If one input dominates, its output is close to 1 and the others are close to 0; if the inputs are similar, the distribution is close to uniform.
Unlike argmax, softmax is differentiable end-to-end, which is what makes it usable anywhere you want a probability distribution inside a neural network.
Where it shows up in transformers
Softmax is the standard nonlinearity wherever a neural network needs to output or use a probability distribution:
- Attention patterns — applied column-by-column to the raw scores in attention, turning alignment scores into mixing weights over source tokens.
- Next-token prediction — applied to the logits produced by the unembedding step, turning them into a probability distribution over the vocabulary.
- Classification heads generally — last-layer outputs of any categorical classifier.
Temperature
You can generalise softmax with a positive scalar called temperature:
- — standard softmax.
- — the largest input dominates completely; distribution collapses to argmax (one-hot on the max).
- — all exponents shrink to 0; distribution approaches uniform.
Temperature has two main uses in LLMs:
- Sampling. Chatbots sample from the next-token distribution with some . Low → more conservative and repetitive; high → more diverse and riskier.
- Attention scaling. The division by inside the attention softmax is a temperature-like scaling chosen for numerical stability, not diversity control — without it, the softmax would saturate as grows.
The name comes from an analogy to thermodynamic temperature — higher makes the distribution more “noisy” in a loose Boltzmann-distribution sense.
A common gotcha
Logits (the inputs to softmax) are not calibrated probabilities on their own — comparing raw logit values across different contexts is meaningless because the softmax’s shift-invariance means the model is free to add a constant to everything. Only differences between logits carry probabilistic meaning.
See also
- logits — the raw values softmax consumes
- unembedding — the layer that produces the logits softmax is applied to
- attention-mechanism — where softmax normalises attention patterns
- activation-function — general family