Summary: A scalar measure of how badly a neural-network (all its weights and biases) performs — the thing gradient-descent minimises.
Definition
For a single training example with desired output and actual output , where layer is the final (output) layer:
The full cost averages over all training examples:
This is the mean squared error (MSE) variant. Other cost functions exist (cross-entropy, etc.), but MSE is the simplest to reason about.
What it encodes
- Inputs: all the weights and biases (13,002 for the MNIST example network).
- Process: The labelled training data is a fixed set of parameters baked in. Running the model on given data generates a prediction which can be compared against to compute the cost of that example.
- Output: a single number — lower is better.
- Intuition: The cost is small when the network confidently gives the right answer and large when it’s confused or wrong.
Why squared differences?
- Squaring ensures all errors are positive (no cancellation between over- and under-predictions).
- Large errors are penalised disproportionately — a prediction that’s off by 0.9 costs 81× more than one off by 0.1.
- The function is smooth and differentiable everywhere, which is essential for gradient-descent to work.
Relationship to the gradient
The backpropagation algorithm computes — the gradient of the cost with respect to every weight and bias. The factor appears at the end of every chain-rule expression, meaning the gradient signal is strongest when the prediction is furthest from the target.