3Blue1Brown — What is backpropagation really doing?

Summary: Builds intuition for backpropagation — the algorithm that computes the gradient — before touching any calculus.

Core intuition

For a single training example, there are three levers to improve any neuron’s activation:

Adjust the bias — simplest; shifts the weighted sum by a constant.
Adjust the weights — most effective when the corresponding input activation $a^{(L - 1)}$ is large, because the weight update is proportional to that activation.
Wish for different previous-layer activations — you can’t change them directly, but you can record what you’d want, then propagate that desire backward to the weights/biases that do control those activations.

This backward propagation of desired changes is what gives the algorithm its name.

Hebbian connection

“Neurons that fire together, wire together.” The weight updates are largest between neurons that are both highly active — the math naturally mirrors this biological principle.

Averaging over all examples

Each training example produces its own set of desired nudges. The actual gradient step is the average of all those nudges — no single example dominates.

Stochastic gradient descent (SGD)

Computing the gradient over the full dataset is expensive.
Mini-batches (e.g. 100 examples each) approximate the true gradient cheaply.
Each mini-batch gradient is an unbiased but noisy estimate of the true gradient — the path zigzags, but each step is far cheaper, so wall-clock convergence is faster.

Sources

Ch. 4 - What is backpropagation really doing?

3Blue1Brown — What is backpropagation really doing?

Core intuition

Hebbian connection

Averaging over all examples

Stochastic gradient descent (SGD)

Sources

Graph View

Backlinks

Explorer