Gradient Descent

Summary: An iterative optimisation algorithm that minimises a cost-function by repeatedly stepping in the direction of steepest descent, $- \nabla C$ ( $N$ -dim gradient vector).

How it works (geometric interpretation)

The negative gradient tells us how to change the weights and biases to decrease the cost most effectively.

Initialise all parameters $θ$ (i.e. weights and biases) randomly. This is a point in the parameter space with an associated cost function value.
Compute the gradient $\nabla C$ of the cost function (wrt parameters $θ$ ) — a vector pointing in the direction of steepest ascent from that point.
Update parameters: $θ \leftarrow θ - η \nabla C$ , where $η$ is the learning rate.
Repeat until cost is acceptably low (or stops decreasing).

The gradient as a vector of nudges (vector interpretation)

For a network with $N$ parameters, $θ \in R^{N}$ (e.g. 13,002 for the 3Blue1Brown MNIST example), the gradient, $\nabla C_{θ}$ , is an $N$ -dimensional vector (see backpropagation):

θ = w^{(1)} b^{(1)} ⋮ w^{(L)} b^{(L)} \to \nabla C_{θ} = \frac{\partial C}{\partial w ^{(1)}} \frac{\partial C}{\partial b ^{(1)}} ⋮ \frac{\partial C}{\partial w ^{(L)}} \frac{\partial C}{\partial b ^{(L)}}

Each component of $\nabla C_{θ}$ tells you how to nudge the corresponding component of $θ$ :

Sign: nudge this parameter (weight or bias) up or down?
Magnitude: how sensitive is the cost to this parameter, relative to others?

This encodes “bang for your buck” — which changes matter most.

Learning rate ( $η$ )

Controls step size. There’s a tension:

Too large	Too small
Overshoots minima, oscillates	Converges painfully slowly

Making steps proportional to the gradient magnitude naturally shrinks steps near minima (where the slope flattens).

Local minima

Gradient descent finds a local minimum, which may not be the global one. Which minimum you reach depends on the random starting point. In high-dimensional spaces (thousands of parameters), the landscape has many local minima and saddle points.

Stochastic gradient descent (SGD)

Computing the exact gradient requires processing every training example — expensive. SGD approximates it:

Shuffle the training data.
Split into mini-batches (e.g. 100 examples).
Compute the gradient on each mini-batch and take a step.

Each mini-batch gradient is an unbiased but high-variance estimate of the true gradient — it points roughly downhill but zigzags due to sampling noise. The trade-off: each step is much cheaper (1/100th the compute for 100 mini-batches), so you take many more steps in the same wall-clock time, and the noise averages out over a full pass (epoch) through the data.

Geometric interpretation: Gradient descent vs SGD path

Gradient descent (full batch): smooth curve slowly descending to a minimum

SGD: jagged zigzag path quickly reaching the same region

notes/

Gradient Descent

How it works (geometric interpretation)

The gradient as a vector of nudges (vector interpretation)

Learning rate ( $η$ )

Local minima

Stochastic gradient descent (SGD)

Sources

Gradient Descent

How it works (geometric interpretation)

The gradient as a vector of nudges (vector interpretation)

Learning rate (η)

Local minima

Stochastic gradient descent (SGD)

Sources

Graph View

Backlinks

Explorer

Learning rate ( $η$ )