Summary: How the standard “weight per arc, bias per neuron” convention from the multilayer-perceptron generalises (or doesn’t) to CNNs, RNNs, and Transformers.

TODO:

This is a stub page. CNN, RNN, and Transformer pages don’t exist yet — once they do, this comparison should be linked from each.

What’s universal across all architectures

  • Weights live on connections, biases live on neurons — true for nearly all standard architectures (MLP, CNN, RNN, Transformer). The bias being a property of the destination neuron (not the arc) is a near-universal convention.
  • Input neurons are pass-through — true everywhere. Input neurons hold raw data, no activation, no parameters.
  • Output layer activation depends on the task — true everywhere. Sigmoid/softmax for classification, linear for regression, etc.

Where it varies

CNNs (convolutional networks)

  • Weights are shared across spatial positions. A convolutional filter is a small set of weights (e.g. 3×3) that gets reused at every position in the input. So instead of one weight per arc, you have one weight per position in the filter, replicated across many “arcs.”
  • Biases still belong to neurons, but in CNNs there’s typically one bias per filter (shared across all spatial positions where that filter is applied), not one per output neuron. This is part of the same weight-sharing principle.

RNNs (recurrent networks)

  • Same weight-and-bias structure as MLPs, but weights are reused across time steps. The same weight matrix processes the input at every step in a sequence.
  • Biases still belong to neurons, one per neuron in each layer.

Transformers

  • Mostly follow the MLP convention (weights on connections, biases on neurons) within each linear/feed-forward sublayer.
  • Some Transformer variants drop biases entirely in certain projections (e.g. LLaMA omits biases in attention and feed-forward layers). Empirically, removing biases barely affects performance and saves a small number of parameters. So “every neuron has a bias” is a default, not a law.
  • LayerNorm and other normalisation layers have their own learnable parameters (scale and shift) that act bias-like but follow different rules.

Batch normalisation

  • When BatchNorm is used, the bias of the preceding layer becomes redundant (BatchNorm has its own shift parameter), so practitioners often disable biases on layers immediately before BatchNorm.

Summary table

ArchitectureWeightsBiasesNotes
multilayer-perceptronOne per arcOne per neuronThe baseline convention
CNNShared across spatial positions (one set per filter)One per filter (shared spatially)Weight-sharing reduces parameters
RNNShared across time stepsOne per neuronSame matrix reused per step
TransformerOne per arc (in linear sublayers)One per neuron, or omitted (e.g. LLaMA)Biases are often dropped
With BatchNormStandardOften disabled on the preceding layerBatchNorm absorbs the shift

Takeaway

The “weights on arcs, one bias per neuron, nonlinearity in hidden layers” model is the right starting point and holds for MLPs exactly. For other architectures:

  • Weights can be shared (CNN, RNN) or have extra structure (attention), but they still live on connections.
  • Biases are usually one-per-neuron, but can be shared (CNN filters), omitted (some Transformers), or absorbed into a normalisation layer.
  • Activations in hidden layers are nearly always nonlinear; the choice of function varies (ReLU, GELU, SwiGLU, etc.).

The variations are optimisations and architectural tricks layered on top of the basic neural-network abstraction, not fundamental departures from it.

Sources

  • General ML knowledge; to be backed by architecture-specific sources as the corresponding wiki pages are added.