Cost Function

Summary: A scalar measure of how badly a neural-network (all its weights and biases) performs — the thing gradient-descent minimises.

Definition

For a single training example with desired output $y$ and actual output $a^{(L)}$ , where layer $L$ is the final (output) layer:

$C_{0} = \sum_{j} (a_{j}^{(L)} - y_{j})^{2}$

The full cost averages over all $n$ training examples:

$C = \frac{1}{n} \sum_{k = 0}^{n - 1} C_{k}$

This is the mean squared error (MSE) variant. Other cost functions exist (cross-entropy, etc.), but MSE is the simplest to reason about.

What it encodes

Inputs: all the weights and biases (13,002 for the MNIST example network).
Process: The labelled training data is a fixed set of parameters baked in. Running the model on given data generates a prediction $a^{(L)}$ which can be compared against $y$ to compute the cost $C_{0}$ of that example.
Output: a single number — lower is better.
Intuition: The cost is small when the network confidently gives the right answer and large when it’s confused or wrong.

Why squared differences?

Squaring ensures all errors are positive (no cancellation between over- and under-predictions).
Large errors are penalised disproportionately — a prediction that’s off by 0.9 costs 81× more than one off by 0.1.
The function is smooth and differentiable everywhere, which is essential for gradient-descent to work.

Relationship to the gradient

The backpropagation algorithm computes $\nabla C$ — the gradient of the cost with respect to every weight and bias. The factor $\frac{\partial C _{0}}{\partial a ^{(L)}} = 2 (a^{(L)} - y)$ appears at the end of every chain-rule expression, meaning the gradient signal is strongest when the prediction is furthest from the target.

notes/

Cost Function

Definition

What it encodes

Why squared differences?

Relationship to the gradient

Sources

Cost Function

Definition

What it encodes

Why squared differences?

Relationship to the gradient

Sources

Graph View

Backlinks

Explorer