Summary: Formalises the backpropagation intuition from Ch4 using the chain rule, first on a toy one-neuron-per-layer network, then generalising.
Setup (one neuron per layer)
Define intermediate variables for the last layer:
Chain rule decomposition
Each factor has a clear meaning:
- : the weight matters more when the input neuron is active (Hebbian echo).
- : the activation function’s slope gates how much the signal passes through (sigmoid saturation kills gradients here).
- : the further off the prediction, the stronger the gradient signal.
Bias derivative
Same chain but , so it’s simply .
Propagating to earlier layers
The derivative w.r.t. replaces the first factor with . This gives the sensitivity of the cost to the previous activation, which you then use to repeat the process one layer back — the recursive heart of backprop.
Multi-neuron generalisation
The video argues that “not much changes” — and structurally it doesn’t. The derivation builds up in three steps:
Step 1 — New notation
With multiple neurons per layer, every variable picks up a subscript. Use for neurons in layer , for neurons in layer . The weighted sum, activation, and cost become:
The weight subscript means “going to neuron , coming from neuron ” — row , column of the weight matrix.
Step 2 — Weight derivative (same structure)
The chain rule for a specific weight is the same three-factor product as before, just with indices:
This works unchanged because only affects neuron — it touches exactly one path through the network.
Step 3 — Activation derivative (the new idea)
In the one-neuron-per-layer case, influenced the cost through a single path. Now, feeds into every neuron in the next layer, so it influences along multiple paths. The total sensitivity is the sum of each path’s chain-rule contribution:
This is the only genuinely new equation in the multi-neuron case. Once computed, it feeds backward into the next layer’s weight/bias derivatives — the same recursive step as before.
Full cost
Average over all training examples: .