Backpropagation

Summary: The algorithm that efficiently computes the gradient, $\nabla C_{θ}$ , of the cost-function with respect to every weight and bias, $θ,$ in a neural-network, enabling gradient-descent.

Notes on indexes:

Superscripts on each neuron are the layer it’s on. Subscript indicates which neuron it is.
Superscripts on each parameter (weight or bias) indicate the layer they are feeding INTO. Subscripts indicate which neuron they came from and are going to (indexing is backwards, so it’s going-to ← coming-from).

Intuition

For a single training example, consider the output neuron that should be active. Three levers can increase its activation:

Increase its bias — constant, predictable shift to the weighted sum.
Increase weights connected to high-activation input neurons — because the weight update is proportional to $a^{(L - 1)}$ , nudging a weight has more effect when the input neuron’s activation is large. This is the Hebbian echo: “neurons that fire together wire together.”
Change previous-layer activations — you can’t adjust them directly (they’re determined by earlier weights/biases), but you record the desired changes and propagate them backward to the parameters that do control those activations.

Repeat this layer by layer, from output back to input — that’s backpropagation.

Backpropagation computation tree for a single training example

Tree diagram showing $w^{(L)} \to z^{(L)} \to a^{(L)} \to C_{0}$ , with arrows labelled by the partial derivatives along each edge (assumes a simple network with only 1 neuron per layer).

The cost of this one training example is $C_{0} = (a^{(L)} - y)^{2}$

where $a^{(L)}$ is the actual output of the last layer, $L$ , of the the network.

and $y$ is the desired output from that layer

As a reminder, this last activation $a^{(L)}$ is determined by a weight, a bias, and the previous neuron’s activation, all pumped through some special nonlinear function like a sigmoid or a ReLU.

$a^{(L)} = σ (w^{(L)} a^{(L - 1)} + b^{(L)})$

For convenience, the weighted sum is called a weighted input, $z$ , with the same superscript as the activation:

$z^{(L)} a^{(L)} = w^{(L)} a^{(L - 1)} + b^{(L)} = σ (z^{(L)})$

The weight, $w^{(L)}$ , the previous activation, $a^{(L - 1)}$ , and the bias, $b^{(L)}$ , together let us compute $z^{(L)}$

This lets us compute $a^{(L)}$

This, along with the constant $y$ , lets us compute the cost, $C_{0}$ .

The calculus

Single path (one neuron per layer)

Per example above, define: $z^{(L)} = w^{(L)} a^{(L - 1)} + b^{(L)}$ , $a^{(L)} = σ (z^{(L)})$ , $C_{0} = (a^{(L)} - y)^{2}$ .

Recall, chain rule (expand)

By the chain rule (See original source for detailed derivation), how sensitive is $C_{0}$ to small changes in the weight $w^{(L)}$ :

\frac{\partial C _{0}}{\partial w ^{(L)}} = input activity a^{(L - 1)} \cdot activation slope σ^{'} (z^{(L)}) \cdot prediction error 2 (a^{(L)} - y)

Factor	Meaning
$a^{(L - 1)}$	Weight matters more when input neuron is active
$σ^{'} (z^{(L)})$	Sigmoid saturation kills gradients; ReLU avoids this for positive $z$
$2 (a^{(L)} - y)$	The further off the prediction, the stronger the signal

For the bias ( $C_{0}$ sensitivity to $b^{(L)}$ ): same chain but $\partial z / \partial b = 1$ , so one fewer factor.

Propagating backward

The derivative w.r.t. $a^{(L - 1)}$ swaps the first factor for $w^{(L)}$ (See original source). This gives the cost’s sensitivity to the previous activation, which is then used to compute derivatives for the weights/biases in that earlier layer — the recursive step.

Multiple neurons per layer

The single-neuron derivation above generalises cleanly. What changes is the notation — there are more indices to track — and one genuinely new idea at the end.

Notation

Use $j$ to index neurons in layer $L$ and $k$ to index neurons in layer $L - 1$ . The weighted sum feeding neuron $j$ is now a sum over all $k$ source neurons:

$z_{j}^{(L)} = \sum_{k} w_{jk}^{(L)} a_{k}^{(L - 1)} + b_{j}^{(L)}$

The cost for one training example sums over all output neurons:

$C_{0} = \sum_{j} (a_{j}^{(L)} - y_{j})^{2}$

Weight derivative (structurally identical)

The chain rule for a specific weight $w_{jk}^{(L)}$ follows the exact same three-factor pattern as the single-neuron case — because $w_{jk}$ only affects neuron $j$ , not any other neuron in the layer:

\frac{\partial C _{0}}{\partial w _{jk}^{(L)}} = = a_{k}^{(L - 1)} \frac{\partial z _{j}^{(L)}}{\partial w _{jk}^{(L)}} \cdot = σ^{'} (z_{j}^{(L)}) \frac{\partial a _{j}^{(L)}}{\partial z _{j}^{(L)}} \cdot = 2 (a_{j}^{(L)} - y_{j}) \frac{\partial C _{0}}{\partial a _{j}^{(L)}}

Same structure, just with subscripts $j, k$ tracking which weight we mean.

Activation derivative (the new idea)

Here is where multi-neuron networks differ. In the single-neuron case, $a^{(L - 1)}$ influenced $C_{0}$ through exactly one path ( $z^{(L)} \to a^{(L)} \to C_{0}$ ). Now, $a_{k}^{(L - 1)}$ feeds into every neuron $j$ in layer $L$ , creating multiple paths to the cost:

$a_{k}^{(L - 1)} ⟶ z_{0}^{(L)}, z_{1}^{(L)}, \dots ⟶ C_{0}$

To capture the total sensitivity, sum the chain-rule contribution from each path:

\frac{\partial C _{0}}{\partial a _{k}^{(L - 1)}} = j \sum = w_{jk}^{(L)} \frac{\partial z _{j}^{(L)}}{\partial a _{k}^{(L - 1)}} \cdot \frac{\partial a _{j}^{(L)}}{\partial z _{j}^{(L)}} \cdot \frac{\partial C _{0}}{\partial a _{j}^{(L)}}

This is the only genuinely new equation. Once you have $\frac{\partial C _{0}}{\partial a _{k}^{(L - 1)}}$ for every neuron $k$ in layer $L - 1$ , you repeat the same process to get derivatives for the weights and biases in layer $L - 1$ , and so on backward through the network.

Full cost

Average over all training examples: $\frac{\partial C}{\partial w} = \frac{1}{n} \sum_{k} \frac{\partial C _{k}}{\partial w}$ .

Sources

TODO: Deep dive into backpropagation algorithm

Input x: Set the corresponding activation a1 for the input layer.

Feedforward: For each l=2,3,…,L compute zl=wlal−1+bl and al=σ(zl).

Output error δL: Compute the vector δL=∇aC⊙σ′(zL).

Backpropagate the error: For each l=L−1,L−2,…,2 compute δl=((wl+1)Tδl+1)⊙σ′(zl).

Output: The gradient of the cost function is given by ∂C∂wljk=al−1kδlj and ∂C∂blj=δlj.

$δ^{L} δ^{l} \frac{\partial C}{\partial b _{j}^{l}} \frac{\partial C}{\partial w _{jk}^{l}} = \nabla_{a} C \circ σ^{'} (z^{L}) = ((w^{l + 1})^{T} δ^{l + 1}) \circ σ^{'} (z^{l}) = δ_{j}^{l} = a_{k}^{l - 1} δ_{j}^{l} (BP1) (BP2) (BP3) (BP4)$

Backpropagation

Intuition

The calculus

Single path (one neuron per layer)

Propagating backward

Multiple neurons per layer

Notation

Weight derivative (structurally identical)

Activation derivative (the new idea)

Full cost

Sources

Graph View

Backlinks

Explorer