Bias and Weight Conventions Across Architectures

Summary: How the standard “weight per arc, bias per neuron” convention from the multilayer-perceptron generalises (or doesn’t) to CNNs, RNNs, and Transformers.

TODO:

This is a stub page. CNN, RNN, and Transformer pages don’t exist yet — once they do, this comparison should be linked from each.

What’s universal across all architectures

Weights live on connections, biases live on neurons — true for nearly all standard architectures (MLP, CNN, RNN, Transformer). The bias being a property of the destination neuron (not the arc) is a near-universal convention.
Input neurons are pass-through — true everywhere. Input neurons hold raw data, no activation, no parameters.
Output layer activation depends on the task — true everywhere. Sigmoid/softmax for classification, linear for regression, etc.

Where it varies

CNNs (convolutional networks)

Weights are shared across spatial positions. A convolutional filter is a small set of weights (e.g. 3×3) that gets reused at every position in the input. So instead of one weight per arc, you have one weight per position in the filter, replicated across many “arcs.”
Biases still belong to neurons, but in CNNs there’s typically one bias per filter (shared across all spatial positions where that filter is applied), not one per output neuron. This is part of the same weight-sharing principle.

RNNs (recurrent networks)

Same weight-and-bias structure as MLPs, but weights are reused across time steps. The same weight matrix processes the input at every step in a sequence.
Biases still belong to neurons, one per neuron in each layer.

Transformers

Mostly follow the MLP convention (weights on connections, biases on neurons) within each linear/feed-forward sublayer.
Some Transformer variants drop biases entirely in certain projections (e.g. LLaMA omits biases in attention and feed-forward layers). Empirically, removing biases barely affects performance and saves a small number of parameters. So “every neuron has a bias” is a default, not a law.
LayerNorm and other normalisation layers have their own learnable parameters (scale and shift) that act bias-like but follow different rules.

Batch normalisation

When BatchNorm is used, the bias of the preceding layer becomes redundant (BatchNorm has its own shift parameter), so practitioners often disable biases on layers immediately before BatchNorm.

Summary table

Architecture	Weights	Biases	Notes
multilayer-perceptron	One per arc	One per neuron	The baseline convention
CNN	Shared across spatial positions (one set per filter)	One per filter (shared spatially)	Weight-sharing reduces parameters
RNN	Shared across time steps	One per neuron	Same matrix reused per step
Transformer	One per arc (in linear sublayers)	One per neuron, or omitted (e.g. LLaMA)	Biases are often dropped
With BatchNorm	Standard	Often disabled on the preceding layer	BatchNorm absorbs the shift

Takeaway

The “weights on arcs, one bias per neuron, nonlinearity in hidden layers” model is the right starting point and holds for MLPs exactly. For other architectures:

Weights can be shared (CNN, RNN) or have extra structure (attention), but they still live on connections.
Biases are usually one-per-neuron, but can be shared (CNN filters), omitted (some Transformers), or absorbed into a normalisation layer.
Activations in hidden layers are nearly always nonlinear; the choice of function varies (ReLU, GELU, SwiGLU, etc.).

The variations are optimisations and architectural tricks layered on top of the basic neural-network abstraction, not fundamental departures from it.

Sources

General ML knowledge; to be backed by architecture-specific sources as the corresponding wiki pages are added.

Bias and Weight Conventions Across Architectures

What’s universal across all architectures

Where it varies

CNNs (convolutional networks)

RNNs (recurrent networks)

Transformers

Batch normalisation

Summary table

Takeaway

Sources

Graph View

Backlinks

Explorer