3Blue1Brown — But what is a Neural Network?

Summary: Introduces the structure of a feed-forward neural network through the lens of handwritten-digit recognition (MNIST).

Key ideas

A neural-network is ultimately just a function: 784 inputs (pixel brightnesses) → 10 outputs (digit scores), parameterised by ~13,000 weights and biases.
A neuron holds a single number (its activation, between 0 and 1). The activation of each neuron in a layer is a weighted sum of the previous layer’s activations, plus a bias, passed through an activation-function (e.g. sigmoid maps the sum to $(0, 1)$ ; ReLU zeroes out negatives and passes positives unchanged).
Layers decompose hard problems into sub-problems. The hope is that early layers detect edges, middle layers detect patterns (loops, lines), and the final layer maps patterns to digits. Whether the trained network actually does this is a separate question (see src-3b1b-neural-networks-ch3).
Weights encode what pattern a neuron looks for; the bias encodes how strong the signal must be before the neuron fires.
Negative weights in the surround let a neuron detect edges (bright pixels flanked by dark ones) rather than just blobs.

The sigmoid ( $σ (z) = 1/ (1 + e^{- z})$ ) maps any real number to $(0, 1)$ : large negative inputs → ~0, large positive → ~1. Problem: at extremes the derivative $σ^{'}$ is near zero, so during backpropagation the gradient vanishes and weights barely update.
ReLU ( $max (0, x)$ ) avoids saturation for positive inputs, making training faster — the practical default today.
Some nonlinearity is essential; without it the entire network collapses to a single linear map, which is far less expressive.

The full layer transition is one matrix-vector operation:

$a^{(l + 1)} = σ (W a^{(l)} + b)$

where $W$ is the weight matrix (rows = next-layer neurons, cols = current-layer neurons), $b$ is the bias vector, and $σ$ is applied element-wise.