Neural Network

Summary: A computational graph of nodes (neurons) connected by weighted edges, trained to approximate functions by adjusting its weights and biases.

Mental model

The network as a whole is a parameterised function $f_{θ} : R^{n} \to R^{m}$ , where $θ$ is the collection of all weights and biases. Training adjusts $θ$ to minimise a cost-function via gradient-descent, with gradients computed by backpropagation.

Primer: Example network (image and specification)

Each neuron in the network uses a non-linear activation function (e.g. sigmoid, ReLU) to make the network highly expressive, i.e. it can approximate any function / messy data really well). See Neural networks and deep learning - Chapter 4: A visual proof that neural nets can compute any function

For the example network above (notation from Michael Nielsen, Ch. 2 warm-up):

Index convention:

$w_{jk}^{l}$ = weight into neuron $j$ of layer $l$ from neuron $k$ of layer $l - 1$ .

$b_{j}^{l}$ = bias on neuron $j$ in layer $l$ .

$a_{j}^{l}$ = activation of neuron $j$ in layer $l$ .

Per-neuron update:

The activation $a_{j}^{l}$ ( $j^{th}$ neuron, $l^{th}$ layer) is the sum of all $k$ neuron activations in the $(l - 1)^{th}$ layer $a_{j}^{l} = σ (\sum_{k} w_{jk}^{l} a_{k}^{l - 1} + b_{j}^{l})$

Matrix form (vectorise over all neurons in layer $l$ at once): $a^{l} = σ (W^{l} a^{l - 1} + b^{l})$

where

$W^{l}$ is the (incoming) weight matrix for layer $l$ — row $j$ contains all weights feeding into neuron $j$ .

$a^{l - 1}$ is the vector of activations of the last layer, $l - 1$

$b^{l}$ is the bias vector for the current layer, $l$

Shape check: if layer $l - 1$ has $K$ neurons and layer $l$ has $J$ neurons, then

$W^{l} \in R^{J \times K}$ ,

$a^{l - 1} \in R^{K}$ ,

$b^{l} \in R^{J}$ ,

so $W^{l} a^{l - 1} + b^{l} \in R^{J}$ . ✓

Pre-activation (weighted input): Define $z^{l} \equiv W^{l} a^{l - 1} + b^{l}$ , so $a^{l} = σ (z^{l})$ . The $z^{l}$ notation is used heavily in backpropagation — it’s what you differentiate through before the nonlinearity (activation-function).

Questions about network structure

Weights live on incoming arcs: Each connection has a weight

Biases live on destination neurons: Each neuron after the input layer has a single bias added once to its weighted sum (i.e. one bias per neuron, not per incoming arc).

Together, the weights and the bias form a neuron’s “weighted input”, $z_{j}$ (pre-activation-function) $z_{j} = w_{j 0} a_{0} + w_{j 1} a_{1} + \dots + b_{j}$

I/O layers:

Input layer: pass-through. No weights, no biases, no activation function. Only holds raw data values (e.g. pixel brightness)

Output layer: has incoming weights, biases, and an activation function — usually chosen to match the task (sigmoid/softmax for classification, linear for regression).

The above is the standard MLP convention. CNN/RNN/Transformer have variations — see architecture-bias-and-weight-conventions.

PyTorch convention: Why xenc @ W and not W @ xenc?

See use in notebook: 04_from_bigrams_to_nns

In theory, the activation formula is $w x + b$ where

$x \in R^{n \times 1}$ is a column vector (in our example $\in R^{27 \times 1}$ ) and

$w \in R^{m \times n}$ left-multiplies it.

See “pre-activation” in neural-network (dropdowns), and multilayer-perceptron

In practice, inputs are stacked as a batch of row vectors: xenc $\in R^{5 \times 27}$ . So the order flips to $X W + b$ to keep the inner dimensions compatible, where:

$X \in R^{5 \times 27}$ , and $W \in R^{27 \times 27}$

$b \in R^{1 \times 27}$ (broadcast across rows → stretching to $\in R^{5 \times 27}$ )

Resulting in an output activation matrix $X W + b \in R^{5 \times 27}$ containing:

one row per training example and

one column per neuron

Each row of the output is one training example’s activations across all 27 neurons. This row-major batch convention is universal across PyTorch, TensorFlow, and NumPy.

Core abstraction

A neural network is a directed graph where:

Nodes (neurons) hold scalar values called activations — the output of the neuron’s activation-function, and the number that gets passed forward along outgoing edges.
Edges carry weights — each weight scales the activation flowing along that connection.
Each neuron computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation-function to produce its activation.

The network as a whole is a parameterised function $f_{θ} : R^{n} \to R^{m}$ , where $θ$ is the collection of all weights and biases. Training adjusts $θ$ to minimise a cost-function via gradient-descent, with gradients computed by backpropagation.

Single neuron diagram

Historically, two types of artificial neuron are foundational (Nielsen, Ch. 1): the perceptron, which uses a step function (output is 0 or 1 depending on whether the weighted sum exceeds a threshold), and the sigmoid neuron, which uses the sigmoid function to produce a continuous output in $(0, 1)$ . The sigmoid neuron’s smooth, differentiable output is what makes gradient-descent and backpropagation possible — you can’t take useful gradients through a hard step. Modern networks generalise this idea with other activation functions (ReLU, GELU, etc.), but the sigmoid neuron is the historical bridge from the perceptron to trainable deep networks.

Terminology

Term	Meaning
Activation	The scalar value a neuron holds
Weight	Scalar multiplier on a connection between two neurons
Bias	Additive offset — shifts how large the weighted sum must be before the neuron activates
Layer	A group of neurons at the same depth in the graph
Hidden layer	Any layer between input and output

Architectures

Different ways of wiring neurons give rise to different architectures, each suited to different data and tasks:

multilayer-perceptron — Fully-connected feed-forward network. Every neuron in one layer connects to every neuron in the next. The simplest architecture and the foundation for understanding all others.
CNN (convolutional neural network) — Weight-sharing and local connectivity for spatial data (images). (page TBD)
RNN (recurrent neural network) — Connections that loop back, giving the network memory over sequences. (page TBD)
transformer-architecture — Attention-based architecture; no recurrence, processes sequences in parallel.

notes/

Neural Network

Core abstraction

Terminology

Architectures

Sources

Neural Network

Core abstraction

Terminology

Architectures

Sources

Graph View

Backlinks

Explorer