Network Diagram vs Computation Graph

Summary: The MLP “circles and arcs” diagram describes which neurons connect to which (a static architectural view); the computation-graph describes every individual mathematical operation that runs (a dynamic operational view). They are two representations of the same network at different levels of abstraction — one for humans designing models, one for autograd computing gradients.

One-line distinction

	Network diagram	Computation graph
What a node is	A neuron	A mathematical operation (or leaf tensor)
What an edge is	A weighted connection	A data dependency (no weight)
Where weights live	On the edges (labels)	As leaf nodes, alongside inputs
Granularity	One circle per neuron	One node per op (`matmul`, `add`, `exp`, …)
Built when	Drawn by a human at design time	Recorded by autograd during the forward pass
Used for	Reasoning about architecture, layer counts, connectivity	Executing forward + traversing backward to compute gradients
Audience	Humans	The autograd engine (PyTorch, JAX, TF, micrograd)

Worked example: a single neuron

A single neuron with $K$ inputs computes $y = σ (\sum_{i = 1}^{K} w_{i} x_{i} + b)$

As a network diagram

One circle. $K$ incoming arcs labelled $w_{1}, \dots, w_{K}$ . One bias term attached to the neuron. One activation function $σ$ implied inside the circle.

  x₁ ──w₁──┐
  x₂ ──w₂──┤
   ⋮       ├─► (σ) ──► y
  x_K ─w_K─┘
        + b

As a computation graph (scalar-op level, e.g. micrograd)

Many nodes. Every multiply and add is its own node, then an exp (or other op) for the activation:

  x₁ ─┐                   x₂ ─┐
       *──► p₁              *──► p₂   ...   p_K
  w₁ ─┘                   w₂ ─┘

  p₁ + p₂ + … + p_K = s
  s + b = z
  σ(z) = y

Roughly $K$ multiplies + $K - 1$ adds + 1 bias add + 1 activation op = $\sim 2 K$ nodes for one neuron.

As a computation graph (tensor-op level, e.g. PyTorch)

If you batch the inputs as a row vector $x \in R^{1 \times K}$ and stack the weights as $W \in R^{K \times 1}$ (or $R^{K \times J}$ for a layer of $J$ neurons), the same neuron compiles to just three internal ops:

  x ──┐
       ├─► matmul ──► z₁ ──┐
  W ──┘                    ├─► add ──► z₂ ──► σ ──► y
                       b ──┘

Same gradients, far fewer nodes — matmul is one fused op rather than $K$ scalar multiplies + a sum tree. Granularity is an implementation choice; see Granularity is a choice.

Where the views diverge

Weights as nodes vs weights as edges

This is the most common conceptual stumbling block.

Network diagram. Each weight is a number painted on an arc between two neurons. It modifies the data passing along that arc. Weights are part of the connection.
Computation graph. Weights are first-class nodes — leaf tensors that participate in matmul operations alongside the inputs. The computation graph treats the input $x$ and the parameter $W$ symmetrically: both are leaves, both feed into the matmul op, both can have gradients accumulated against them. Edges are weight-free; they only encode “is consumed by.”

This symmetry is why the same autograd machinery handles training (where you optimise over $W$ ) and adversarial-input crafting (where you optimise over $x$ ): they’re just two leaves of the same graph.

Biases

Network diagram. The bias $b_{j}$ is attached to neuron $j$ — a property of the destination node.
Computation graph. $b$ is a leaf tensor, fed into an add op alongside the matmul output. No different from any other parameter.

Activation functions

Network diagram. Implied — drawn as a small symbol inside the neuron, or assumed by convention. One $σ$ per neuron.
Computation graph. Explicit op-node. For sigmoid, $σ (z) = 1/ (1 + e^{- z})$ may decompose into 4 nodes (neg, exp, add, div) at the scalar-op level, or be fused into one sigmoid op at the tensor-op level.

Where the views converge

For an MLP layer expressed in matrix form, $a^{l} = σ (W^{l} a^{l - 1} + b^{l})$ the tensor-op computation graph (PyTorch) maps almost 1:1 onto the network diagram drawn one layer at a time — matmul corresponds to “all incoming arcs of the layer,” add to all biases, the activation to all $σ$ ‘s. The tensor-op graph is the network diagram’s compact computational form.

The two views diverge most sharply when:

You go down to scalar-op granularity (every multiply explicit).
You include sub-neuron operations the diagram hides — softmax, layer norm, attention masking, residual additions.
You’re inside a Transformer block, where many ops happen between “the embedding” and “the output,” none of which are clean per-neuron drawings.

When to use each

Reach for the network diagram when you’re designing a model, choosing layer widths, explaining connectivity, or reasoning about parameter counts at the layer level.
Reach for the computation graph when you’re debugging gradient flow, writing custom autograd functions, profiling memory (every node holds an intermediate tensor for backward), or understanding why something like .detach() or requires_grad=False works.

In short: the network diagram is the architectural specification; the computation graph is the compiled program. Frameworks like PyTorch take the layer code you write and produce the latter automatically — you almost never have to draw it by hand, but knowing it’s there is what lets you reason about autograd, memory, and gradient-related bugs.

Sources

src-3b1b-neural-networks-ch1 — network diagrams of MNIST classifier
02_nn_data_structs_and_forward_pass — scalar-op computation graph in code
04_from_bigrams_to_nns — tensor-op view via xenc @ W then exp, softmax

notes/

Network Diagram vs Computation Graph

One-line distinction

Worked example: a single neuron

As a network diagram

As a computation graph (scalar-op level, e.g. micrograd)

As a computation graph (tensor-op level, e.g. PyTorch)

Where the views diverge

Weights as nodes vs weights as edges

Biases

Activation functions

Where the views converge

When to use each

See also

Sources

Network Diagram vs Computation Graph

One-line distinction

Worked example: a single neuron

As a network diagram

As a computation graph (scalar-op level, e.g. micrograd)

As a computation graph (tensor-op level, e.g. PyTorch)

Where the views diverge

Weights as nodes vs weights as edges

Biases

Activation functions

Where the views converge

When to use each

See also

Sources

Graph View

Backlinks

Explorer