Summary: An iterative optimisation algorithm that minimises a cost-function by repeatedly stepping in the direction of steepest descent, (-dim gradient vector).
How it works (geometric interpretation)
The negative gradient tells us how to change the weights and biases to decrease the cost most effectively.
- Initialise all parameters (i.e. weights and biases) randomly. This is a point in the parameter space with an associated cost function value.
- Compute the gradient of the cost function (wrt parameters ) — a vector pointing in the direction of steepest ascent from that point.
- Update parameters: , where is the learning rate.
- Repeat until cost is acceptably low (or stops decreasing).


The gradient as a vector of nudges (vector interpretation)
For a network with parameters, (e.g. 13,002 for the 3Blue1Brown MNIST example), the gradient, , is an -dimensional vector (see backpropagation):
Each component of tells you how to nudge the corresponding component of :
- Sign: nudge this parameter (weight or bias) up or down?
- Magnitude: how sensitive is the cost to this parameter, relative to others?
This encodes “bang for your buck” — which changes matter most.
Learning rate ()
Controls step size. There’s a tension:
| Too large | Too small |
|---|---|
| Overshoots minima, oscillates | Converges painfully slowly |
Making steps proportional to the gradient magnitude naturally shrinks steps near minima (where the slope flattens).
Local minima
Gradient descent finds a local minimum, which may not be the global one. Which minimum you reach depends on the random starting point. In high-dimensional spaces (thousands of parameters), the landscape has many local minima and saddle points.
Stochastic gradient descent (SGD)
Computing the exact gradient requires processing every training example — expensive. SGD approximates it:
- Shuffle the training data.
- Split into mini-batches (e.g. 100 examples).
- Compute the gradient on each mini-batch and take a step.
Each mini-batch gradient is an unbiased but high-variance estimate of the true gradient — it points roughly downhill but zigzags due to sampling noise. The trade-off: each step is much cheaper (1/100th the compute for 100 mini-batches), so you take many more steps in the same wall-clock time, and the noise averages out over a full pass (epoch) through the data.
Geometric interpretation: Gradient descent vs SGD path
- Gradient descent (full batch): smooth curve slowly descending to a minimum
- SGD: jagged zigzag path quickly reaching the same region
