Summary: The algorithm that efficiently computes the gradient, , of the cost-function with respect to every weight and bias, in a neural-network, enabling gradient-descent.
Notes on indexes:
- Superscripts on each neuron are the layer it’s on. Subscript indicates which neuron it is.
- Superscripts on each parameter (weight or bias) indicate the layer they are feeding INTO. Subscripts indicate which neuron they came from and are going to (indexing is backwards, so it’s going-to ← coming-from).
Intuition
For a single training example, consider the output neuron that should be active. Three levers can increase its activation:
- Increase its bias — constant, predictable shift to the weighted sum.
- Increase weights connected to high-activation input neurons — because the weight update is proportional to , nudging a weight has more effect when the input neuron’s activation is large. This is the Hebbian echo: “neurons that fire together wire together.”
- Change previous-layer activations — you can’t adjust them directly (they’re determined by earlier weights/biases), but you record the desired changes and propagate them backward to the parameters that do control those activations.
Repeat this layer by layer, from output back to input — that’s backpropagation.
Backpropagation computation tree for a single training example
- Tree diagram showing , with arrows labelled by the partial derivatives along each edge (assumes a simple network with only 1 neuron per layer).
- The cost of this one training example is
- where is the actual output of the last layer, , of the the network.
- and is the desired output from that layer
- As a reminder, this last activation is determined by a weight, a bias, and the previous neuron’s activation, all pumped through some special nonlinear function like a sigmoid or a ReLU.
- For convenience, the weighted sum is called a weighted input, , with the same superscript as the activation:
- The weight, , the previous activation, , and the bias, , together let us compute
- This lets us compute
- This, along with the constant , lets us compute the cost, .
The calculus
Single path (one neuron per layer)
Per example above, define: , , .
Recall, chain rule (expand)
By the chain rule (See original source for detailed derivation), how sensitive is to small changes in the weight :
| Factor | Meaning |
|---|---|
| Weight matters more when input neuron is active | |
| Sigmoid saturation kills gradients; ReLU avoids this for positive | |
| The further off the prediction, the stronger the signal |
For the bias ( sensitivity to ): same chain but , so one fewer factor.
Propagating backward
The derivative w.r.t. swaps the first factor for (See original source). This gives the cost’s sensitivity to the previous activation, which is then used to compute derivatives for the weights/biases in that earlier layer — the recursive step.
Multiple neurons per layer
The single-neuron derivation above generalises cleanly. What changes is the notation — there are more indices to track — and one genuinely new idea at the end.
Notation
Use to index neurons in layer and to index neurons in layer . The weighted sum feeding neuron is now a sum over all source neurons:
The cost for one training example sums over all output neurons:
Weight derivative (structurally identical)
The chain rule for a specific weight follows the exact same three-factor pattern as the single-neuron case — because only affects neuron , not any other neuron in the layer:
Same structure, just with subscripts tracking which weight we mean.
Activation derivative (the new idea)
Here is where multi-neuron networks differ. In the single-neuron case, influenced through exactly one path (). Now, feeds into every neuron in layer , creating multiple paths to the cost:
To capture the total sensitivity, sum the chain-rule contribution from each path:
This is the only genuinely new equation. Once you have for every neuron in layer , you repeat the same process to get derivatives for the weights and biases in layer , and so on backward through the network.
Full cost
Average over all training examples: .
Sources
- src-3b1b-neural-networks-ch4
- src-3b1b-neural-networks-ch5
- Also see: Neural networks and deep learning - Chapter 2: How the backpropagation algorithm works
TODO: Deep dive into backpropagation algorithm
- Input x: Set the corresponding activation a1 for the input layer.
- Feedforward: For each l=2,3,…,L compute zl=wlal−1+bl and al=σ(zl).
- Output error δL: Compute the vector δL=∇aC⊙σ′(zL).
- Backpropagate the error: For each l=L−1,L−2,…,2 compute δl=((wl+1)Tδl+1)⊙σ′(zl).
- Output: The gradient of the cost function is given by ∂C∂wljk=al−1kδlj and ∂C∂blj=δlj.
