Summary: Builds intuition for backpropagation — the algorithm that computes the gradient — before touching any calculus.

Core intuition

For a single training example, there are three levers to improve any neuron’s activation:

  1. Adjust the bias — simplest; shifts the weighted sum by a constant.
  2. Adjust the weights — most effective when the corresponding input activation is large, because the weight update is proportional to that activation.
  3. Wish for different previous-layer activations — you can’t change them directly, but you can record what you’d want, then propagate that desire backward to the weights/biases that do control those activations.

This backward propagation of desired changes is what gives the algorithm its name.

Hebbian connection

“Neurons that fire together, wire together.” The weight updates are largest between neurons that are both highly active — the math naturally mirrors this biological principle.

Averaging over all examples

Each training example produces its own set of desired nudges. The actual gradient step is the average of all those nudges — no single example dominates.

Stochastic gradient descent (SGD)

  • Computing the gradient over the full dataset is expensive.
  • Mini-batches (e.g. 100 examples each) approximate the true gradient cheaply.
  • Each mini-batch gradient is an unbiased but noisy estimate of the true gradient — the path zigzags, but each step is far cheaper, so wall-clock convergence is faster.