3Blue1Brown — Backpropagation calculus

Summary: Formalises the backpropagation intuition from Ch4 using the chain rule, first on a toy one-neuron-per-layer network, then generalising.

Setup (one neuron per layer)

Define intermediate variables for the last layer:

$z^{(L)} = w^{(L)} a^{(L - 1)} + b^{(L)}, a^{(L)} = σ (z^{(L)}), C_{0} = (a^{(L)} - y)^{2}$

Chain rule decomposition

\frac{\partial C _{0}}{\partial w ^{(L)}} = \partial z / \partial w a^{(L - 1)} \cdot \partial a / \partial z σ^{'} (z^{(L)}) \cdot \partial C / \partial a 2 (a^{(L)} - y)

Each factor has a clear meaning:

$a^{(L - 1)}$ : the weight matters more when the input neuron is active (Hebbian echo).
$σ^{'} (z^{(L)})$ : the activation function’s slope gates how much the signal passes through (sigmoid saturation kills gradients here).
$2 (a^{(L)} - y)$ : the further off the prediction, the stronger the gradient signal.

Bias derivative

Same chain but $\partial z / \partial b = 1$ , so it’s simply $σ^{'} (z) \cdot 2 (a^{(L)} - y)$ .

Propagating to earlier layers

The derivative w.r.t. $a^{(L - 1)}$ replaces the first factor with $w^{(L)}$ . This gives the sensitivity of the cost to the previous activation, which you then use to repeat the process one layer back — the recursive heart of backprop.

Multi-neuron generalisation

The video argues that “not much changes” — and structurally it doesn’t. The derivation builds up in three steps:

Step 1 — New notation

With multiple neurons per layer, every variable picks up a subscript. Use $j$ for neurons in layer $L$ , $k$ for neurons in layer $L - 1$ . The weighted sum, activation, and cost become:

$z_{j}^{(L)} = \sum_{k} w_{jk}^{(L)} a_{k}^{(L - 1)} + b_{j}^{(L)}, a_{j}^{(L)} = σ (z_{j}^{(L)}), C_{0} = \sum_{j} (a_{j}^{(L)} - y_{j})^{2}$

The weight subscript $jk$ means “going to neuron $j$ , coming from neuron $k$ ” — row $j$ , column $k$ of the weight matrix.

Step 2 — Weight derivative (same structure)

The chain rule for a specific weight $w_{jk}^{(L)}$ is the same three-factor product as before, just with indices:

\frac{\partial C _{0}}{\partial w _{jk}^{(L)}} = \partial z_{j} / \partial w_{jk} a_{k}^{(L - 1)} \cdot \partial a_{j} / \partial z_{j} σ^{'} (z_{j}^{(L)}) \cdot \partial C_{0} / \partial a_{j} 2 (a_{j}^{(L)} - y_{j})

This works unchanged because $w_{jk}$ only affects neuron $j$ — it touches exactly one path through the network.

Step 3 — Activation derivative (the new idea)

In the one-neuron-per-layer case, $a^{(L - 1)}$ influenced the cost through a single path. Now, $a_{k}^{(L - 1)}$ feeds into every neuron $j$ in the next layer, so it influences $C_{0}$ along multiple paths. The total sensitivity is the sum of each path’s chain-rule contribution:

\frac{\partial C _{0}}{\partial a _{k}^{(L - 1)}} = j \sum \partial z_{j} / \partial a_{k} w_{jk}^{(L)} \cdot σ^{'} (z_{j}^{(L)}) \cdot 2 (a_{j}^{(L)} - y_{j})

This is the only genuinely new equation in the multi-neuron case. Once computed, it feeds backward into the next layer’s weight/bias derivatives — the same recursive step as before.

Full cost

Average over all training examples: $\frac{\partial C}{\partial w} = \frac{1}{n} \sum_{k} \frac{\partial C _{k}}{\partial w}$ .

Sources

Ch. 5 - Backpropagation calculus