Summary: Explains how a neural network learns by minimising a cost-function via gradient-descent.
Key ideas
- Learning = minimising a cost function. The cost measures how badly the network performs across all training examples.
- The cost-function for a single example is the sum of squared differences between the network’s output activations and the desired activations. The overall cost averages this over all training examples.
- You can’t solve for the minimum analytically (e.g. 13,002 parameters in the MNIST example network), so you use gradient-descent: start at a random point, compute the gradient, step in the negative-gradient direction, repeat.
The gradient as relative importance
- The gradient is a vector with one entry per weight/bias. Each entry’s sign says which direction to nudge; its magnitude says how much that nudge matters relative to the others.
- Think of it as encoding “bang for your buck” — which weight changes will reduce cost fastest.
Learning rate
- Each step is , where is the learning rate.
- Too large → overshoot and oscillate. Too small → converge painfully slowly.
- Making step size proportional to slope naturally produces smaller steps near minima.
Local minima caveat
- Gradient descent finds a local minimum, not necessarily the global one. Which minimum you land in depends on the random initialisation.