Summary: The simplest neural-network architecture — fully-connected layers stacked in sequence, where information flows strictly from input to output with no cycles.
MLP vs "feed-forward network"
At this level of abstraction, “multilayer perceptron” and “feed-forward neural network” refer to the same thing: layers of neurons where every neuron in layer connects to every neuron in layer , and information flows in one direction only. The name “perceptron” is historical — an MLP has little to do with Rosenblatt’s single-layer perceptron beyond the lineage.
Layer computation
Single neuron
A single neuron in layer computes two steps. First, the pre-activation — a weighted sum of all inputs from the previous layer, plus a bias:
Then the activation — pass through a nonlinear activation-function (e.g. ReLU or sigmoid):
Index conventions:
- Superscript — which layer’s parameters are being used. and are the weights and biases that transform layer into layer . The superscript lives on the parameters, not the output, because a network has one weight matrix per transition.
- Subscripts on — indexes the destination neuron (in layer ), indexes the source neuron (in layer ). Row of the weight matrix contains all the weights feeding into neuron .
- Subscript on — which neuron within layer .
Matrix form (full layer at once)
The same computation for all neurons in a layer, written as a matrix-vector operation. Separate the two steps:
Step 1 — Pre-activation (linear):
Step 2 — Activation (nonlinear, element-wise):
Or collapsed into one line: .
- — activation vector of layer (length = number of neurons in layer )
- — weight matrix for the transition. Rows = neurons in layer , columns = neurons in layer . Each row is one neuron’s full set of incoming weights.
- — bias vector (one entry per neuron in layer )
- — applied element-wise (activation-function: typically ReLU for hidden layers)
Expanded with explicit dimensions
For a layer with neurons connecting to a layer with neurons:
Step 1 — Pre-activation:
Inner dimensions cancel (), output is as expected.
Step 2 — Activation: for each .
The entire network is this two-step operation composed times — a chain of matrix multiplies, bias additions, and nonlinearities.
Image: Example MLP architecture diagram
A fully connected MLP, with 16 neurons in input layer, two hidden layers with 12 neurons each, and 4 neurons in the output layer
Why layers work
Layers decompose hard problems into sub-problems of increasing abstraction. For image recognition the hope is: pixels → edges → shapes → digits. Each layer performs a relatively simple transformation; the composition is what produces complex behaviour.
Whether the trained network actually learns this neat decomposition is a separate question — in practice, the hidden layers of a simple MLP often learn blobby, hard-to-interpret patterns rather than clean edges and loops (see src-3b1b-neural-networks-ch3).
Key properties
- Expressiveness comes from the nonlinearity. Without an activation-function, stacking layers just produces a single affine map () — no better than one layer. The nonlinearity is what lets the network represent non-linear decision boundaries.
- Weights are interpretable (in principle). For image inputs, each neuron’s weight vector can be reshaped into the input dimensions and visualised as the pattern the neuron responds to (e.g. positive weights in a strip with negative surround = edge detector).
- Local minima are real. The learned parameters depend on random initialisation and may not correspond to the interpretable decomposition we’d hope for (see src-3b1b-neural-networks-ch3).
- Parameter count grows fast. A fully-connected layer from to neurons has weights + biases. For the 3Blue1Brown MNIST example (784→16→16→10), that’s 13,002 total parameters. This density is why CNNs use weight-sharing for large spatial inputs.
How it learns
The weights and biases are adjusted by gradient-descent to minimise a cost-function over labelled data. The gradient is computed efficiently via backpropagation.
Limitations
- No spatial awareness. An MLP treats each input dimension independently — it doesn’t know that pixel (3,4) is adjacent to pixel (3,5). A pattern learned in one region of the input doesn’t transfer to another. This is the gap that convolutional neural networks fill.
- No memory. Each input is processed independently. For sequential data (text, audio, time series), recurrent architectures or transformers are needed.
- Scales poorly to high-dimensional inputs. Full connectivity means the parameter count is per layer. For a 224×224 RGB image (150,528 inputs), even one hidden layer of 1,000 neurons would have ~150M parameters — most of them redundant.
MLP blocks inside a transformer
A transformer is interleaved attention blocks and MLP blocks. The MLP block is a two-layer MLP applied independently and in parallel at every token position (no cross-token information flow — that’s attention’s job).
For each token ‘s residual-stream vector :
- Up-projection (Linear): , where 1. For GPT-3, , so has ~4× as many rows as the embedding dimension. i.e. .
- Nonlinearity: element-wise. The standard choice is ReLU or the smoother GELU (Gaussian Error Linear Unit), which looks like ReLU but with a soft knee near zero.
- This ensures cleaner yes/no triggering for when a vector cleanly “answers” a question (row of )
- Down-projection (Linear): , bringing the vector back to the embedding dimension.
- Residual add: output .
Rows as questions, columns as answers
Two useful lenses on what the two matrices do:
- Rows of = “questions being asked”.
- Computing is a pile of dot products between the row vectors of and .
- Each dot product measures how much aligns with some learned direction (analogous to a “question”).
- With the right bias and ReLU, the -th “neuron” fires if and only if contains some specific combination of features — an AND gate over directions in the embedding space.
- Columns of = “answers written back”.
- Computing can be read as summing columns of weighted by the neuron activations (i.e. rescaling then adding columns)
- Each column is a direction in embedding space that gets added to whenever its corresponding neuron is active.
Together, a single neuron implements: “if the residual stream contains direction X (plus any threshold features encoded in the bias), add direction Y to it”. This is how MLPs are often said to store facts — they’re lookup structures keyed on directions in the embedding space. Worked example: Michael Jordan → basketball. See src-3b1b-llms-ch4-mlps-store-facts for the full walkthrough.
Where most of the parameters live
In GPT-3, each MLP block contributes:
- : ~604M params
- : ~604M params
- Biases: trivial
~1.2B per block × 96 blocks ≈ 116B params devoted to MLPs — about two-thirds of GPT-3’s total 175B. Attention grabs the spotlight; MLPs hold most of the memory.
Superposition complicates the story
The “rows ask clean questions, columns write clean answers” picture is an idealisation. In practice, individual MLP neurons rarely correspond to single interpretable features — they encode combinations of features in superposition, exploiting the Johnson–Lindenstrauss room in high-dimensional spaces to pack exponentially many near-orthogonal features into the activation vector. Recovering human-interpretable features usually requires tools like sparse autoencoders rather than reading neurons directly.