3Blue1Brown — Gradient descent, how neural networks learn

Summary: Explains how a neural network learns by minimising a cost-function via gradient-descent.

Key ideas

Learning = minimising a cost function. The cost measures how badly the network performs across all training examples.
The cost-function for a single example is the sum of squared differences between the network’s output activations and the desired activations. The overall cost averages this over all training examples.
You can’t solve for the minimum analytically (e.g. 13,002 parameters in the MNIST example network), so you use gradient-descent: start at a random point, compute the gradient, step in the negative-gradient direction, repeat.

The gradient is a vector with one entry per weight/bias. Each entry’s sign says which direction to nudge; its magnitude says how much that nudge matters relative to the others.
Think of it as encoding “bang for your buck” — which weight changes will reduce cost fastest.

Each step is $- η \nabla C$ , where $η$ is the learning rate.
Too large → overshoot and oscillate. Too small → converge painfully slowly.
Making step size proportional to slope naturally produces smaller steps near minima.

Gradient descent finds a local minimum, not necessarily the global one. Which minimum you land in depends on the random initialisation.