Summary: The algorithm that efficiently computes the gradient, , of the cost-function with respect to every weight and bias, in a neural-network, enabling gradient-descent.

Notes on indexes:

  • Superscripts on each neuron are the layer it’s on. Subscript indicates which neuron it is.
  • Superscripts on each parameter (weight or bias) indicate the layer they are feeding INTO. Subscripts indicate which neuron they came from and are going to (indexing is backwards, so it’s going-to coming-from).

Intuition

For a single training example, consider the output neuron that should be active. Three levers can increase its activation:

  1. Increase its bias — constant, predictable shift to the weighted sum.
  2. Increase weights connected to high-activation input neurons — because the weight update is proportional to , nudging a weight has more effect when the input neuron’s activation is large. This is the Hebbian echo: “neurons that fire together wire together.”
  3. Change previous-layer activations — you can’t adjust them directly (they’re determined by earlier weights/biases), but you record the desired changes and propagate them backward to the parameters that do control those activations.

Repeat this layer by layer, from output back to input — that’s backpropagation.

Backpropagation computation tree for a single training example

  • Tree diagram showing , with arrows labelled by the partial derivatives along each edge (assumes a simple network with only 1 neuron per layer).
  • The cost of this one training example is
    • where is the actual output of the last layer, , of the the network.
    • and is the desired output from that layer

  • As a reminder, this last activation is determined by a weight, a bias, and the previous neuron’s activation, all pumped through some special nonlinear function like a sigmoid or a ReLU.

  • For convenience, the weighted sum is called a weighted input, , with the same superscript as the activation:

  • The weight, , the previous activation, , and the bias, , together let us compute
    • This lets us compute
      • This, along with the constant , lets us compute the cost, .

The calculus

Single path (one neuron per layer)

Per example above, define: , , .

By the chain rule (See original source for detailed derivation), how sensitive is to small changes in the weight :

FactorMeaning
Weight matters more when input neuron is active
Sigmoid saturation kills gradients; ReLU avoids this for positive
The further off the prediction, the stronger the signal

For the bias ( sensitivity to ): same chain but , so one fewer factor.

Propagating backward

The derivative w.r.t. swaps the first factor for (See original source). This gives the cost’s sensitivity to the previous activation, which is then used to compute derivatives for the weights/biases in that earlier layer — the recursive step.

Multiple neurons per layer

The single-neuron derivation above generalises cleanly. What changes is the notation — there are more indices to track — and one genuinely new idea at the end.

Notation

Use to index neurons in layer and to index neurons in layer . The weighted sum feeding neuron is now a sum over all source neurons:

The cost for one training example sums over all output neurons:

Weight derivative (structurally identical)

The chain rule for a specific weight follows the exact same three-factor pattern as the single-neuron case — because only affects neuron , not any other neuron in the layer:

Same structure, just with subscripts tracking which weight we mean.

Activation derivative (the new idea)

Here is where multi-neuron networks differ. In the single-neuron case, influenced through exactly one path (). Now, feeds into every neuron in layer , creating multiple paths to the cost:

To capture the total sensitivity, sum the chain-rule contribution from each path:

This is the only genuinely new equation. Once you have for every neuron in layer , you repeat the same process to get derivatives for the weights and biases in layer , and so on backward through the network.

Full cost

Average over all training examples: .

Sources