Summary: The MLP “circles and arcs” diagram describes which neurons connect to which (a static architectural view); the computation-graph describes every individual mathematical operation that runs (a dynamic operational view). They are two representations of the same network at different levels of abstraction — one for humans designing models, one for autograd computing gradients.

One-line distinction

Network diagramComputation graph
What a node isA neuronA mathematical operation (or leaf tensor)
What an edge isA weighted connectionA data dependency (no weight)
Where weights liveOn the edges (labels)As leaf nodes, alongside inputs
GranularityOne circle per neuronOne node per op (matmul, add, exp, …)
Built whenDrawn by a human at design timeRecorded by autograd during the forward pass
Used forReasoning about architecture, layer counts, connectivityExecuting forward + traversing backward to compute gradients
AudienceHumansThe autograd engine (PyTorch, JAX, TF, micrograd)

Worked example: a single neuron

A single neuron with inputs computes

As a network diagram

One circle. incoming arcs labelled . One bias term attached to the neuron. One activation function implied inside the circle.

  x₁ ──w₁──┐
  x₂ ──w₂──┤
   ⋮       ├─► (σ) ──► y
  x_K ─w_K─┘
        + b

As a computation graph (scalar-op level, e.g. micrograd)

Many nodes. Every multiply and add is its own node, then an exp (or other op) for the activation:

  x₁ ─┐                   x₂ ─┐
       *──► p₁              *──► p₂   ...   p_K
  w₁ ─┘                   w₂ ─┘

  p₁ + p₂ + … + p_K = s
  s + b = z
  σ(z) = y

Roughly multiplies + adds + 1 bias add + 1 activation op = nodes for one neuron.

As a computation graph (tensor-op level, e.g. PyTorch)

If you batch the inputs as a row vector and stack the weights as (or for a layer of neurons), the same neuron compiles to just three internal ops:

  x ──┐
       ├─► matmul ──► z₁ ──┐
  W ──┘                    ├─► add ──► z₂ ──► σ ──► y
                       b ──┘

Same gradients, far fewer nodes — matmul is one fused op rather than scalar multiplies + a sum tree. Granularity is an implementation choice; see Granularity is a choice.

Where the views diverge

Weights as nodes vs weights as edges

This is the most common conceptual stumbling block.

  • Network diagram. Each weight is a number painted on an arc between two neurons. It modifies the data passing along that arc. Weights are part of the connection.
  • Computation graph. Weights are first-class nodes — leaf tensors that participate in matmul operations alongside the inputs. The computation graph treats the input and the parameter symmetrically: both are leaves, both feed into the matmul op, both can have gradients accumulated against them. Edges are weight-free; they only encode “is consumed by.”

This symmetry is why the same autograd machinery handles training (where you optimise over ) and adversarial-input crafting (where you optimise over ): they’re just two leaves of the same graph.

Biases

  • Network diagram. The bias is attached to neuron — a property of the destination node.
  • Computation graph. is a leaf tensor, fed into an add op alongside the matmul output. No different from any other parameter.

Activation functions

  • Network diagram. Implied — drawn as a small symbol inside the neuron, or assumed by convention. One per neuron.
  • Computation graph. Explicit op-node. For sigmoid, may decompose into 4 nodes (neg, exp, add, div) at the scalar-op level, or be fused into one sigmoid op at the tensor-op level.

Where the views converge

For an MLP layer expressed in matrix form, the tensor-op computation graph (PyTorch) maps almost 1:1 onto the network diagram drawn one layer at a time — matmul corresponds to “all incoming arcs of the layer,” add to all biases, the activation to all ‘s. The tensor-op graph is the network diagram’s compact computational form.

The two views diverge most sharply when:

  • You go down to scalar-op granularity (every multiply explicit).
  • You include sub-neuron operations the diagram hides — softmax, layer norm, attention masking, residual additions.
  • You’re inside a Transformer block, where many ops happen between “the embedding” and “the output,” none of which are clean per-neuron drawings.

When to use each

  • Reach for the network diagram when you’re designing a model, choosing layer widths, explaining connectivity, or reasoning about parameter counts at the layer level.
  • Reach for the computation graph when you’re debugging gradient flow, writing custom autograd functions, profiling memory (every node holds an intermediate tensor for backward), or understanding why something like .detach() or requires_grad=False works.

In short: the network diagram is the architectural specification; the computation graph is the compiled program. Frameworks like PyTorch take the layer code you write and produce the latter automatically — you almost never have to draw it by hand, but knowing it’s there is what lets you reason about autograd, memory, and gradient-related bugs.

See also

Sources