Summary: The MLP “circles and arcs” diagram describes which neurons connect to which (a static architectural view); the computation-graph describes every individual mathematical operation that runs (a dynamic operational view). They are two representations of the same network at different levels of abstraction — one for humans designing models, one for autograd computing gradients.
One-line distinction
| Network diagram | Computation graph | |
|---|---|---|
| What a node is | A neuron | A mathematical operation (or leaf tensor) |
| What an edge is | A weighted connection | A data dependency (no weight) |
| Where weights live | On the edges (labels) | As leaf nodes, alongside inputs |
| Granularity | One circle per neuron | One node per op (matmul, add, exp, …) |
| Built when | Drawn by a human at design time | Recorded by autograd during the forward pass |
| Used for | Reasoning about architecture, layer counts, connectivity | Executing forward + traversing backward to compute gradients |
| Audience | Humans | The autograd engine (PyTorch, JAX, TF, micrograd) |
Worked example: a single neuron
A single neuron with inputs computes
As a network diagram
One circle. incoming arcs labelled . One bias term attached to the neuron. One activation function implied inside the circle.
x₁ ──w₁──┐
x₂ ──w₂──┤
⋮ ├─► (σ) ──► y
x_K ─w_K─┘
+ b
As a computation graph (scalar-op level, e.g. micrograd)
Many nodes. Every multiply and add is its own node, then an exp (or other op) for the activation:
x₁ ─┐ x₂ ─┐
*──► p₁ *──► p₂ ... p_K
w₁ ─┘ w₂ ─┘
p₁ + p₂ + … + p_K = s
s + b = z
σ(z) = y
Roughly multiplies + adds + 1 bias add + 1 activation op = nodes for one neuron.
As a computation graph (tensor-op level, e.g. PyTorch)
If you batch the inputs as a row vector and stack the weights as (or for a layer of neurons), the same neuron compiles to just three internal ops:
x ──┐
├─► matmul ──► z₁ ──┐
W ──┘ ├─► add ──► z₂ ──► σ ──► y
b ──┘
Same gradients, far fewer nodes — matmul is one fused op rather than scalar multiplies + a sum tree. Granularity is an implementation choice; see Granularity is a choice.
Where the views diverge
Weights as nodes vs weights as edges
This is the most common conceptual stumbling block.
- Network diagram. Each weight is a number painted on an arc between two neurons. It modifies the data passing along that arc. Weights are part of the connection.
- Computation graph. Weights are first-class nodes — leaf tensors that participate in
matmuloperations alongside the inputs. The computation graph treats the input and the parameter symmetrically: both are leaves, both feed into thematmulop, both can have gradients accumulated against them. Edges are weight-free; they only encode “is consumed by.”
This symmetry is why the same autograd machinery handles training (where you optimise over ) and adversarial-input crafting (where you optimise over ): they’re just two leaves of the same graph.
Biases
- Network diagram. The bias is attached to neuron — a property of the destination node.
- Computation graph. is a leaf tensor, fed into an
addop alongside the matmul output. No different from any other parameter.
Activation functions
- Network diagram. Implied — drawn as a small symbol inside the neuron, or assumed by convention. One per neuron.
- Computation graph. Explicit op-node. For sigmoid, may decompose into 4 nodes (
neg,exp,add,div) at the scalar-op level, or be fused into onesigmoidop at the tensor-op level.
Where the views converge
For an MLP layer expressed in matrix form,
the tensor-op computation graph (PyTorch) maps almost 1:1 onto the network diagram drawn one layer at a time — matmul corresponds to “all incoming arcs of the layer,” add to all biases, the activation to all ‘s. The tensor-op graph is the network diagram’s compact computational form.
The two views diverge most sharply when:
- You go down to scalar-op granularity (every multiply explicit).
- You include sub-neuron operations the diagram hides — softmax, layer norm, attention masking, residual additions.
- You’re inside a Transformer block, where many ops happen between “the embedding” and “the output,” none of which are clean per-neuron drawings.
When to use each
- Reach for the network diagram when you’re designing a model, choosing layer widths, explaining connectivity, or reasoning about parameter counts at the layer level.
- Reach for the computation graph when you’re debugging gradient flow, writing custom autograd functions, profiling memory (every node holds an intermediate tensor for backward), or understanding why something like
.detach()orrequires_grad=Falseworks.
In short: the network diagram is the architectural specification; the computation graph is the compiled program. Frameworks like PyTorch take the layer code you write and produce the latter automatically — you almost never have to draw it by hand, but knowing it’s there is what lets you reason about autograd, memory, and gradient-related bugs.
See also
- computation-graph — full concept page on the operational view
- neural-network — the architectural abstraction
- multilayer-perceptron — the canonical “circles and arcs” example
- backpropagation — what walks the computation graph in reverse
- backprop-graph-terminology — root/leaf/upstream/downstream as used in the computation graph
Sources
- src-3b1b-neural-networks-ch1 — network diagrams of MNIST classifier
- 02_nn_data_structs_and_forward_pass — scalar-op computation graph in code
- 04_from_bigrams_to_nns — tensor-op view via
xenc @ Wthenexp,softmax