Summary: An iterative optimisation algorithm that minimises a cost-function by repeatedly stepping in the direction of steepest descent, (-dim gradient vector).

How it works (geometric interpretation)

The negative gradient tells us how to change the weights and biases to decrease the cost most effectively.

  1. Initialise all parameters (i.e. weights and biases) randomly. This is a point in the parameter space with an associated cost function value.
  2. Compute the gradient of the cost function (wrt parameters ) — a vector pointing in the direction of steepest ascent from that point.
  3. Update parameters: , where is the learning rate.
  4. Repeat until cost is acceptably low (or stops decreasing).

The gradient as a vector of nudges (vector interpretation)

For a network with parameters, (e.g. 13,002 for the 3Blue1Brown MNIST example), the gradient, , is an -dimensional vector (see backpropagation):

Each component of tells you how to nudge the corresponding component of :

  • Sign: nudge this parameter (weight or bias) up or down?
  • Magnitude: how sensitive is the cost to this parameter, relative to others?

This encodes “bang for your buck” — which changes matter most.

Learning rate ()

Controls step size. There’s a tension:

Too largeToo small
Overshoots minima, oscillatesConverges painfully slowly

Making steps proportional to the gradient magnitude naturally shrinks steps near minima (where the slope flattens).

Local minima

Gradient descent finds a local minimum, which may not be the global one. Which minimum you reach depends on the random starting point. In high-dimensional spaces (thousands of parameters), the landscape has many local minima and saddle points.

Stochastic gradient descent (SGD)

Computing the exact gradient requires processing every training example — expensive. SGD approximates it:

  1. Shuffle the training data.
  2. Split into mini-batches (e.g. 100 examples).
  3. Compute the gradient on each mini-batch and take a step.

Each mini-batch gradient is an unbiased but high-variance estimate of the true gradient — it points roughly downhill but zigzags due to sampling noise. The trade-off: each step is much cheaper (1/100th the compute for 100 mini-batches), so you take many more steps in the same wall-clock time, and the noise averages out over a full pass (epoch) through the data.

Geometric interpretation: Gradient descent vs SGD path

  • Gradient descent (full batch): smooth curve slowly descending to a minimum
  • SGD: jagged zigzag path quickly reaching the same region