Summary: Introduces the structure of a feed-forward neural network through the lens of handwritten-digit recognition (MNIST).

Key ideas

  • A neural-network is ultimately just a function: 784 inputs (pixel brightnesses) → 10 outputs (digit scores), parameterised by ~13,000 weights and biases.
  • A neuron holds a single number (its activation, between 0 and 1). The activation of each neuron in a layer is a weighted sum of the previous layer’s activations, plus a bias, passed through an activation-function (e.g. sigmoid maps the sum to ; ReLU zeroes out negatives and passes positives unchanged).
  • Layers decompose hard problems into sub-problems. The hope is that early layers detect edges, middle layers detect patterns (loops, lines), and the final layer maps patterns to digits. Whether the trained network actually does this is a separate question (see src-3b1b-neural-networks-ch3).
  • Weights encode what pattern a neuron looks for; the bias encodes how strong the signal must be before the neuron fires.
  • Negative weights in the surround let a neuron detect edges (bright pixels flanked by dark ones) rather than just blobs.

Activation functions

  • The sigmoid () maps any real number to : large negative inputs → ~0, large positive → ~1. Problem: at extremes the derivative is near zero, so during backpropagation the gradient vanishes and weights barely update.
  • ReLU () avoids saturation for positive inputs, making training faster — the practical default today.
  • Some nonlinearity is essential; without it the entire network collapses to a single linear map, which is far less expressive.

Compact notation

The full layer transition is one matrix-vector operation:

where is the weight matrix (rows = next-layer neurons, cols = current-layer neurons), is the bias vector, and is applied element-wise.