Summary: Explains how a neural network learns by minimising a cost-function via gradient-descent.

Key ideas

  • Learning = minimising a cost function. The cost measures how badly the network performs across all training examples.
  • The cost-function for a single example is the sum of squared differences between the network’s output activations and the desired activations. The overall cost averages this over all training examples.
  • You can’t solve for the minimum analytically (e.g. 13,002 parameters in the MNIST example network), so you use gradient-descent: start at a random point, compute the gradient, step in the negative-gradient direction, repeat.

The gradient as relative importance

  • The gradient is a vector with one entry per weight/bias. Each entry’s sign says which direction to nudge; its magnitude says how much that nudge matters relative to the others.
  • Think of it as encoding “bang for your buck” — which weight changes will reduce cost fastest.

Learning rate

  • Each step is , where is the learning rate.
  • Too large → overshoot and oscillate. Too small → converge painfully slowly.
  • Making step size proportional to slope naturally produces smaller steps near minima.

Local minima caveat

  • Gradient descent finds a local minimum, not necessarily the global one. Which minimum you land in depends on the random initialisation.