Softmax

Summary: A function that turns an arbitrary vector of real numbers into a probability distribution — all entries in $(0, 1)$ and summing to 1 — by exponentiating and normalising. It’s the standard way deep learning models produce categorical outputs.

Definition

Given a vector $x = (x_{1}, \dots, x_{K})$ , softmax produces:

$softmax (x)_{i} = \frac{e ^{x_{i}}}{\sum _{j = 1}^{K} e ^{x_{j}}}$

Numerator $e^{x_{i}}$ is always positive, so every output is $> 0$ .
Denominator is the sum of the numerators, so outputs sum to 1.
Monotonic — larger input → larger output.
Invariant to constants — $softmax (x + c) = softmax (x)$ . In practice implementations subtract $max_{j} x_{j}$ from every entry before exponentiating to avoid numerical overflow.

Example of a vector, $x$ , run through $softmax ()$

$ℓ = W_{U} E_{n}^{(L)} - 0.8 - 5.0 + 0.5 + 1.5 + 3.4 - 2.3 + 2.5 \to \to softmax e^{x_{1}} / \sum_{n = 0}^{N - 1} e^{x_{n}} e^{x_{2}} / \sum_{n = 0}^{N - 1} e^{x_{n}} e^{x_{3}} / \sum_{n = 0}^{N - 1} e^{x_{n}} e^{x_{4}} / \sum_{n = 0}^{N - 1} e^{x_{n}} e^{x_{5}} / \sum_{n = 0}^{N - 1} e^{x_{n}} e^{x_{6}} / \sum_{n = 0}^{N - 1} e^{x_{n}} e^{x_{7}} / \sum_{n = 0}^{N - 1} e^{x_{n}} = = p 0.01 0.00 0.03 0.09 0.61 0.00 0.25$
where

$ℓ = W_{U} E_{n}^{(L)}$ is the vector of logits, of length = number of vocabulary tokens (see image in unembedding note)

$p$ is the vector of probabilities (same length as $ℓ$ , and elements sum to $1$ )

But why does softmax work?

See notebook 04_from_bigrams_to_nns where I figured this out, and activation-function for the generalisation

But why do elements in xenc @ W "behave like" log-counts?

Softmax assumes its inputs are log-counts (that’s why it exponentiates them).

I.e. by feeding xenc @ W into softmax, we are declaring those values to be log-counts by convention.

The network then learns W such that xenc @ W actually produces useful log-counts — ones that after softmax give good probability distributions.

Here’s the flow

Randomly initialised W → xenc @ W is garbage numbers

Softmax is applied anyway, treating those garbage numbers as log-counts

We measure how bad the resulting probabilities are (loss)

Gradient descent nudges W so that xenc @ W produces log-counts that give better probabilities

Works because every step is differentiable (xenc @ W, .exp(), normalisation)

Repeat until xenc @ W genuinely behaves like log-counts

The label “log-counts” is the contract softmax imposes on its input — not a property the matrix multiply xenc @ W produces on its own.

Why “soft” max?

The hard version — pick the argmax — is discrete and non-differentiable. Softmax is a smooth approximation that still concentrates mass on the largest entry. If one input dominates, its output is close to 1 and the others are close to 0; if the inputs are similar, the distribution is close to uniform.

Unlike argmax, softmax is differentiable end-to-end, which is what makes it usable anywhere you want a probability distribution inside a neural network.

Where it shows up in transformers

Softmax is the standard nonlinearity wherever a neural network needs to output or use a probability distribution:

Attention patterns — applied column-by-column to the raw $Q K^{⊤} / d_{k}$ scores in attention, turning alignment scores into mixing weights over source tokens.
Next-token prediction — applied to the logits produced by the unembedding step, turning them into a probability distribution over the vocabulary.
Classification heads generally — last-layer outputs of any categorical classifier.

Temperature

You can generalise softmax with a positive scalar $T$ called temperature:

$softmax_{T} (x)_{i} = \frac{e ^{x_{i} / T}}{\sum _{j} e ^{x_{j} / T}}$

$T = 1$ — standard softmax.
$T \to 0$ — the largest input dominates completely; distribution collapses to argmax (one-hot on the max).
$T \to \infty$ — all exponents shrink to 0; distribution approaches uniform.

Temperature has two main uses in LLMs:

Sampling. Chatbots sample from the next-token distribution with some $T > 0$ . Low $T$ → more conservative and repetitive; high $T$ → more diverse and riskier.
Attention scaling. The division by $d_{k}$ inside the attention softmax is a temperature-like scaling chosen for numerical stability, not diversity control — without it, the softmax would saturate as $d_{k}$ grows.

The name comes from an analogy to thermodynamic temperature — higher $T$ makes the distribution more “noisy” in a loose Boltzmann-distribution sense.

A common gotcha

Logits (the inputs to softmax) are not calibrated probabilities on their own — comparing raw logit values across different contexts is meaningless because the softmax’s shift-invariance means the model is free to add a constant to everything. Only differences between logits carry probabilistic meaning.

notes/

Softmax

Definition

But why does softmax work?

Here’s the flow

Why “soft” max?

Where it shows up in transformers

Temperature

A common gotcha

See also

Sources

Softmax

Definition

But why does softmax work?

Here’s the flow

Why “soft” max?

Where it shows up in transformers

Temperature

A common gotcha

See also

Sources

Graph View

Backlinks

Explorer