Activation Function

Summary: The nonlinear function applied after each layer’s weighted sum — without it, a neural-network collapses to a single linear transformation no matter how many layers it has.

What an activation function does

Each neuron in a neural-network holds a single number — its activation — which is the output of its activation function. The neuron first computes a pre-activation $z_{j} = \sum_{k} w_{jk} a_{k} + b_{j}$ (a weighted sum of incoming activations plus a bias), then applies the activation function: $a_{j} = σ (z_{j})$ . This output $a_{j}$ is the value that flows forward along outgoing edges to the next layer.

The choice of $σ$ determines the neuron’s output range and how the gradient flows during backpropagation.

Why nonlinearity is essential

Each layer computes $z = Wa + b$ . If you apply no function (or a linear one), stacking $k$ layers just gives $W_{k} \dots W_{1} x + const$ — a single affine map. You gain no expressiveness from depth. A nonlinear activation is what makes the network able to learn complex, non-linear decision boundaries.

But why do activation functions work?

See notebook 04_from_bigrams_to_nns where I figured this out, and softmax for specific instance

Why do random, arbitrary initial parameter values "behave" usefully, even meaningfully when fed through non-linearity?

Every activation function imposes a contract on its input, declaring what those raw numbers are “supposed to represent”:

Sigmoid: 1 / (1 + exp(-x)) — squashes input to (0, 1), declaring the raw output as a log-odds ratio.

A raw output of 0 → 0.5 probability, large positive → near 1, large negative → near 0.

Used in binary classification.

Tanh — squashes to (-1, 1), declaring outputs as signed activations. Similar contract to sigmoid but centred at 0, which is better behaved for gradient flow.

ReLU: max(0, x) — declares inputs as “signals where only positive values matter.” Negative values are treated as no activation. The network learns weights that produce meaningful positive signals and uses negatives as an off-switch.

GELU / SiLU — smoother versions of ReLU used in modern transformers. Same idea, gate activations, but differentiable everywhere.

The broader point: Architecture is a set of contracts

Every choice (activation function, normalisation, residual connections) declares something about what the raw values (activations, $w x + b$ ) flowing through the network are supposed to represent. What they ought to do.

gradient-descent simply finds weights that honour those contracts well enough to minimise loss.

It does this by locally differentiating each operation in backpropagation (via chain rule)

Sigmoid

$σ (z) = \frac{1}{1 + e ^{- z}}$

Maps any real number to $(0, 1)$ . Concretely: $σ (- 5) \approx 0.007$ , $σ (0) = 0.5$ , $σ (5) \approx 0.993$ .
Intuition: “how positive is the weighted sum?” — very negative → ~0, very positive → ~1.
Problem: At extremes, the curve is nearly flat ( $σ^{'} \approx 0$ ). During backpropagation, the chain rule multiplies through $σ^{'} (z)$ at each layer. When this factor is near zero, the gradient signal effectively vanishes and weights barely update. This vanishing gradient problem makes training slow, especially in deep networks.

Sigmoid plot

S-curve from $z = - 6$ to $z = 6$ , output axis from 0 to 1.

Mark the flat saturation regions and the steep transition around $z = 0$ .

ReLU (Rectified Linear Unit)

$ReLU (z) = max (0, z)$

Outputs 0 for negative inputs, passes positive inputs unchanged.
Key advantage: The gradient is exactly 1 for all $z > 0$ — no saturation, so the gradient signal passes through undiminished regardless of the magnitude of $z$ . This makes training significantly faster.
The output is unbounded above, which breaks the biological “neuron is either on or off” analogy — but that analogy was never necessary for the math to work. What matters at the output layer is which neurons are more active than others, not the absolute scale.

ReLU plot

piecewise linear: flat at 0 for $z < 0$ , identity line for $z \geq 0$ .

Note, the gradient = 0 and gradient = 1 regions.

GELU plot expand

When to use which

Function	Pros	Cons	Typical use
Sigmoid	Output in $(0, 1)$ ; smooth	Vanishing gradients; slow training	Historically common; now mainly for output layers needing probabilities
ReLU	Fast training; no saturation for $z > 0$	“Dead neurons” if $z < 0$ always	Default hidden-layer activation in modern networks

Variants like Leaky ReLU, GELU, and Swish address ReLU’s dead-neuron problem, but the core principle is the same: inject nonlinearity without killing gradients.

notes/

Activation Function

What an activation function does

Why nonlinearity is essential

But why do activation functions work?

The broader point: Architecture is a set of contracts

Sigmoid

ReLU (Rectified Linear Unit)

When to use which

Sources

Activation Function

What an activation function does

Why nonlinearity is essential

But why do activation functions work?

The broader point: Architecture is a set of contracts

Sigmoid

ReLU (Rectified Linear Unit)

When to use which

Sources

Graph View

Backlinks

Explorer