Summary: A computational graph of nodes (neurons) connected by weighted edges, trained to approximate functions by adjusting its weights and biases.
Mental model
The network as a whole is a parameterised function , where is the collection of all weights and biases. Training adjusts to minimise a cost-function via gradient-descent, with gradients computed by backpropagation.
Primer: Example network (image and specification)
Each neuron in the network uses a non-linear activation function (e.g. sigmoid, ReLU) to make the network highly expressive, i.e. it can approximate any function / messy data really well). See Neural networks and deep learning - Chapter 4: A visual proof that neural nets can compute any function
![]()
For the example network above (notation from Michael Nielsen, Ch. 2 warm-up):
- Index convention:
- = weight into neuron of layer from neuron of layer .
- = bias on neuron in layer .
- = activation of neuron in layer .
- Per-neuron update:
- The activation ( neuron, layer) is the sum of all neuron activations in the layer
- Matrix form (vectorise over all neurons in layer at once):
- where
- is the (incoming) weight matrix for layer — row contains all weights feeding into neuron .
- is the vector of activations of the last layer,
- is the bias vector for the current layer,
- Shape check: if layer has neurons and layer has neurons, then
- ,
- ,
- ,
- so . ✓
- Pre-activation (weighted input): Define , so . The notation is used heavily in backpropagation — it’s what you differentiate through before the nonlinearity (activation-function).
Questions about network structure
- Weights live on incoming arcs: Each connection has a weight
- Biases live on destination neurons: Each neuron after the input layer has a single bias added once to its weighted sum (i.e. one bias per neuron, not per incoming arc).
- Together, the weights and the bias form a neuron’s “weighted input”, (pre-activation-function)
- I/O layers:
- Input layer: pass-through. No weights, no biases, no activation function. Only holds raw data values (e.g. pixel brightness)
- Output layer: has incoming weights, biases, and an activation function — usually chosen to match the task (sigmoid/softmax for classification, linear for regression).
The above is the standard MLP convention. CNN/RNN/Transformer have variations — see architecture-bias-and-weight-conventions.
PyTorch convention: Why
xenc @ Wand notW @ xenc?See use in notebook: 04_from_bigrams_to_nns
In theory, the activation formula is where
- is a column vector (in our example ) and
- left-multiplies it.
- See “pre-activation” in neural-network (dropdowns), and multilayer-perceptron
In practice, inputs are stacked as a batch of row vectors:
xenc. So the order flips to to keep the inner dimensions compatible, where:
- , and
- (broadcast across rows → stretching to )
- Resulting in an output activation matrix containing:
- one row per training example and
- one column per neuron
Each row of the output is one training example’s activations across all 27 neurons. This row-major batch convention is universal across PyTorch, TensorFlow, and NumPy.
Core abstraction
A neural network is a directed graph where:
- Nodes (neurons) hold scalar values called activations — the output of the neuron’s activation-function, and the number that gets passed forward along outgoing edges.
- Edges carry weights — each weight scales the activation flowing along that connection.
- Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation-function to produce its activation.
The network as a whole is a parameterised function , where is the collection of all weights and biases. Training adjusts to minimise a cost-function via gradient-descent, with gradients computed by backpropagation.
Single neuron diagram
Historically, two types of artificial neuron are foundational (Nielsen, Ch. 1): the perceptron, which uses a step function (output is 0 or 1 depending on whether the weighted sum exceeds a threshold), and the sigmoid neuron, which uses the sigmoid function to produce a continuous output in . The sigmoid neuron’s smooth, differentiable output is what makes gradient-descent and backpropagation possible — you can’t take useful gradients through a hard step. Modern networks generalise this idea with other activation functions (ReLU, GELU, etc.), but the sigmoid neuron is the historical bridge from the perceptron to trainable deep networks.
Terminology
| Term | Meaning |
|---|---|
| Activation | The scalar value a neuron holds |
| Weight | Scalar multiplier on a connection between two neurons |
| Bias | Additive offset — shifts how large the weighted sum must be before the neuron activates |
| Layer | A group of neurons at the same depth in the graph |
| Hidden layer | Any layer between input and output |
Architectures
Different ways of wiring neurons give rise to different architectures, each suited to different data and tasks:
- multilayer-perceptron — Fully-connected feed-forward network. Every neuron in one layer connects to every neuron in the next. The simplest architecture and the foundation for understanding all others.
- CNN (convolutional neural network) — Weight-sharing and local connectivity for spatial data (images). (page TBD)
- RNN (recurrent neural network) — Connections that loop back, giving the network memory over sequences. (page TBD)
- transformer-architecture — Attention-based architecture; no recurrence, processes sequences in parallel.

