Multilayer Perceptron (MLP)

Summary: The simplest neural-network architecture — fully-connected layers stacked in sequence, where information flows strictly from input to output with no cycles.

MLP vs "feed-forward network"

At this level of abstraction, “multilayer perceptron” and “feed-forward neural network” refer to the same thing: layers of neurons where every neuron in layer $l$ connects to every neuron in layer $l + 1$ , and information flows in one direction only. The name “perceptron” is historical — an MLP has little to do with Rosenblatt’s single-layer perceptron beyond the lineage.

Layer computation

Single neuron

A single neuron $j$ in layer $l + 1$ computes two steps. First, the pre-activation — a weighted sum of all inputs from the previous layer, plus a bias:

$z_{j}^{(l + 1)} = w_{j 0}^{(l)} a_{0}^{(l)} + w_{j 1}^{(l)} a_{1}^{(l)} + \dots + w_{j, n - 1}^{(l)} a_{n - 1}^{(l)} + b_{j}^{(l)}$

Then the activation — pass $z_{j}$ through a nonlinear activation-function $σ$ (e.g. ReLU or sigmoid):

$a_{j}^{(l + 1)} = σ (z_{j}^{(l + 1)})$

Index conventions:

Superscript $(l)$ — which layer’s parameters are being used. $W^{(l)}$ and $b^{(l)}$ are the weights and biases that transform layer $l$ into layer $l + 1$ . The superscript lives on the parameters, not the output, because a network has one weight matrix per transition.
Subscripts on $w_{jk}^{(l)}$ — $j$ indexes the destination neuron (in layer $l + 1$ ), $k$ indexes the source neuron (in layer $l$ ). Row $j$ of the weight matrix contains all the weights feeding into neuron $j$ .
Subscript on $a_{k}^{(l)}$ — which neuron within layer $l$ .

Matrix form (full layer at once)

The same computation for all neurons in a layer, written as a matrix-vector operation. Separate the two steps:

Step 1 — Pre-activation (linear):

$z^{(l + 1)} = W^{(l)} a^{(l)} + b^{(l)}$

Step 2 — Activation (nonlinear, element-wise):

$a^{(l + 1)} = σ (z^{(l + 1)})$

Or collapsed into one line: $a^{(l + 1)} = σ (W^{(l)} a^{(l)} + b^{(l)})$ .

$a^{(l)}$ — activation vector of layer $l$ (length = number of neurons in layer $l$ )
$W^{(l)}$ — weight matrix for the $l \to l + 1$ transition. Rows = neurons in layer $l + 1$ , columns = neurons in layer $l$ . Each row is one neuron’s full set of incoming weights.
$b^{(l)}$ — bias vector (one entry per neuron in layer $l + 1$ )
$σ$ — applied element-wise (activation-function: typically ReLU for hidden layers)

Expanded with explicit dimensions

For a layer $l$ with $n$ neurons connecting to a layer $l + 1$ with $m$ neurons:

Step 1 — Pre-activation:

[m \times 1] z_{1} z_{2} ⋮ z_{m}^{(l + 1)} = [m \times n] w_{10} w_{20} ⋮ w_{m 0} w_{11} w_{21} ⋮ w_{m 1} \dots \dots ⋱ \dots w_{1, n - 1} w_{2, n - 1} ⋮ w_{m, n - 1}^{(l)} [n \times 1] a_{0} a_{1} ⋮ a_{n - 1}^{(l)} + [m \times 1] b_{1} b_{2} ⋮ b_{m}^{(l)}

Inner dimensions cancel ( $m \times n \cdot n \times 1$ ), output is $[m \times 1]$ as expected.

Step 2 — Activation: $a_{j}^{(l + 1)} = σ (z_{j}^{(l + 1)})$ for each $j = 1, \dots, m$ .

The entire network is this two-step operation composed $L$ times — a chain of matrix multiplies, bias additions, and nonlinearities.

Image: Example MLP architecture diagram

A fully connected MLP, with 16 neurons in input layer, two hidden layers with 12 neurons each, and 4 neurons in the output layer

Why layers work

Layers decompose hard problems into sub-problems of increasing abstraction. For image recognition the hope is: pixels → edges → shapes → digits. Each layer performs a relatively simple transformation; the composition is what produces complex behaviour.

Whether the trained network actually learns this neat decomposition is a separate question — in practice, the hidden layers of a simple MLP often learn blobby, hard-to-interpret patterns rather than clean edges and loops (see src-3b1b-neural-networks-ch3).

Key properties

Expressiveness comes from the nonlinearity. Without an activation-function, stacking layers just produces a single affine map ( $W_{L} \dots W_{1} x + const$ ) — no better than one layer. The nonlinearity is what lets the network represent non-linear decision boundaries.
Weights are interpretable (in principle). For image inputs, each neuron’s weight vector can be reshaped into the input dimensions and visualised as the pattern the neuron responds to (e.g. positive weights in a strip with negative surround = edge detector).
Local minima are real. The learned parameters depend on random initialisation and may not correspond to the interpretable decomposition we’d hope for (see src-3b1b-neural-networks-ch3).
Parameter count grows fast. A fully-connected layer from $m$ to $n$ neurons has $m \times n$ weights + $n$ biases. For the 3Blue1Brown MNIST example (784→16→16→10), that’s 13,002 total parameters. This density is why CNNs use weight-sharing for large spatial inputs.

How it learns

The weights and biases are adjusted by gradient-descent to minimise a cost-function over labelled data. The gradient is computed efficiently via backpropagation.

Limitations

No spatial awareness. An MLP treats each input dimension independently — it doesn’t know that pixel (3,4) is adjacent to pixel (3,5). A pattern learned in one region of the input doesn’t transfer to another. This is the gap that convolutional neural networks fill.
No memory. Each input is processed independently. For sequential data (text, audio, time series), recurrent architectures or transformers are needed.
Scales poorly to high-dimensional inputs. Full connectivity means the parameter count is $O (n \cdot m)$ per layer. For a 224×224 RGB image (150,528 inputs), even one hidden layer of 1,000 neurons would have ~150M parameters — most of them redundant.

MLP blocks inside a transformer

A transformer is interleaved attention blocks and MLP blocks. The MLP block is a two-layer MLP applied independently and in parallel at every token position (no cross-token information flow — that’s attention’s job).

For each token $j$ ‘s residual-stream vector $E_{j} \in R^{d_{embed}}$ :

Up-projection (Linear): $h_{j} = W_{↑} E_{j} + B_{↑}$ , where $W_{↑} \in R^{N d_{embed} \times d_{embed}}$ 1. For GPT-3, $N = 4$ , so $W_{↑}$ has ~4× as many rows as the embedding dimension. i.e. $W_{↑} \in R^{49, 152 \times 12, 288}$ .
Nonlinearity: $h = σ (h)$ element-wise. The standard choice is ReLU or the smoother GELU (Gaussian Error Linear Unit), which looks like ReLU but with a soft knee near zero.
1. This ensures cleaner yes/no triggering for when a vector $E_{j}$ cleanly “answers” a question (row of $W_{↑}$ )
Down-projection (Linear): $Δ E = W_{↓} h + B_{↓}$ , bringing the vector back to the embedding dimension.
Residual add: output $E^{'} = E + Δ E$ .

Rows as questions, columns as answers

Two useful lenses on what the two matrices do:

Rows of $W_{↑}$ = “questions being asked”.

Computing $W_{↑} E$ is a pile of dot products between the row vectors of $W_{↑}$ and $E$ .

Each dot product measures how much $E$ aligns with some learned direction (analogous to a “question”).

With the right bias and ReLU, the $j$ -th “neuron” fires if and only if $E$ contains some specific combination of features — an AND gate over directions in the embedding space.

$W_{↑} 49, 152 \times 12, 288 — R_{0}^{⊤} — — R_{1}^{⊤} — — R_{2}^{⊤} — ⋮ — R_{n}^{⊤} — E 12, 288 \times 1 ∣ E ∣ + B_{↑} 49, 152 \times 1 b_{0} b_{1} b_{2} ⋮ b_{n} = h 49, 152 \times 1 R_{0} \cdot E + b_{0} R_{1} \cdot E + b_{1} R_{2} \cdot E + b_{2} ⋮ R_{n} \cdot E + b_{n}$

Columns of $W_{↓}$ = “answers written back”.

Computing $W_{↓} h$ can be read as summing columns of $W_{↓}$ weighted by the neuron activations (i.e. rescaling then adding columns)

Each column is a direction in embedding space that gets added to $Δ E$ whenever its corresponding neuron is active.

$n_{0} C_{0} + n_{1} C_{1} + n_{2} C_{2} + n_{3} C_{3} + n_{4} C_{4} + \dots$

$= W_{↓} 12, 288 \times 49, 152 ∣ C_{0} ∣ ∣ C_{1} ∣ ∣ C_{2} ∣ ∣ C_{3} ∣ ∣ C_{4} ∣ \dots ∣ C_{m} ∣ h 49, 152 \times 1 n_{0} n_{1} n_{2} n_{3} n_{4} ⋮ n_{m} + B_{↓} 12, 288 \times 1 ∣ B ∣$

Together, a single neuron implements: “if the residual stream contains direction X (plus any threshold features encoded in the bias), add direction Y to it”. This is how MLPs are often said to store facts — they’re lookup structures keyed on directions in the embedding space. Worked example: Michael Jordan → basketball. See src-3b1b-llms-ch4-mlps-store-facts for the full walkthrough.

Where most of the parameters live

In GPT-3, each MLP block contributes:

$W_{↑}$ : ~604M params
$W_{↓}$ : ~604M params
Biases: trivial

~1.2B per block × 96 blocks ≈ 116B params devoted to MLPs — about two-thirds of GPT-3’s total 175B. Attention grabs the spotlight; MLPs hold most of the memory.

Superposition complicates the story

The “rows ask clean questions, columns write clean answers” picture is an idealisation. In practice, individual MLP neurons rarely correspond to single interpretable features — they encode combinations of features in superposition, exploiting the Johnson–Lindenstrauss room in high-dimensional spaces to pack exponentially many near-orthogonal features into the activation vector. Recovering human-interpretable features usually requires tools like sparse autoencoders rather than reading neurons directly.

notes/

Multilayer Perceptron (MLP)

Layer computation

Single neuron

Matrix form (full layer at once)

Expanded with explicit dimensions

Why layers work

Key properties

How it learns

Limitations

MLP blocks inside a transformer

Rows as questions, columns as answers

Where most of the parameters live

Superposition complicates the story

Sources

Multilayer Perceptron (MLP)

Layer computation

Single neuron

Matrix form (full layer at once)

Expanded with explicit dimensions

Why layers work

Key properties

How it learns

Limitations

MLP blocks inside a transformer

Rows as questions, columns as answers

Where most of the parameters live

Superposition complicates the story

Sources

Graph View

Backlinks

Explorer