One-Hot Encoding

Summary: A technique to represent categorical integer indices as binary vectors so that neural networks cannot misinterpret them as ordinal numbers.

Why integers can’t be fed directly to a neural network

Raw integer indices for categorical data cannot be fed into a neural network. Integers imply false ordinal relationships — the network would interpret 'c' = 3 as literally three times 'a' = 1, which is meaningless for categorical data. Downstream multiplicative and non-linear operations would exacerbate the NN’s misunderstanding.

One-hot vectors make every category equidistant from every other: each vector lies on a different axis of $R^{C}$ (where $C$ is the number of classes), so no pair of categories is “closer” or “further” than any other.

Encoding

For a vocabulary of size $C$ , the one-hot vector for index $k$ is the standard basis vector $e_{k} \in R^{C}$ :

x_{k} = [0, \dots, 0, k 1, 0, \dots, 0]

A batch of $N$ examples becomes a matrix $X \in R^{N \times C}$ , one row per example. The dtype must be float (not int) so gradients can flow during backpropagation.

import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=C).float()  # (N, C)

GPT-3 example: encoding "The cat sat" ( $C = 50, 257$ )

Token indices (BPE): "The" → 464, "cat" → 3797, "sat" → 3332. Each becomes a length-50,257 vector.

For example, "The":
$x_{464} = [464 0, \dots, 0, 1, 49, 792 0, \dots, 0]$
Stacked as a batch of $N = 3$ :
$X = x_{464} x_{3797} x_{3332} \in R^{3 \times 50, 257}$
Each row contains one 1 and 50,256 zeros. The full matrix holds 150,771 floats — 99.998% of which are zero. This is why embeddings replace one-hot encoding in practice: an embedding lookup is the same row-select operation (see row-select insight), but without materialising the sparse matrix.

Neat insight: one-hot × weight matrix = row select

Multiplying a one-hot vector $e_{k} \in R^{C}$ by a weight matrix $W \in R^{C \times D}$ is algebraically equivalent to selecting a single row (the $k$ ‘th row) from $W$ :

e_{k} \cdot W = W [k, :] \in R^{1 \times D}

GPT-3 example: $C = 50, 257, D = 12, 288$ , selecting row $k = 464$ ("The")

$e_{k} : 1 \times 50, 257 [464 0, \dots, 0, 1, 49, 792 0, \dots, 0] \cdot W : 50, 257 \times 12, 288 w_{0, 0} ⋮ w_{464, 0} ⋮ w_{50256, 0} \dots ⋱ \dots ⋱ \dots w_{0, 12287} ⋮ w_{464, 12287} ⋮ w_{50256, 12287} = W [k, :] (row select) : 1 \times 12, 288 [w_{464, 0}, \dots, w_{464, 12287}]$

GPT-3 batch: $N = 3$ tokens ("The cat sat"), each row is a row-select

$X : 3 \times 50, 257 e_{464} e_{3797} e_{3332} \cdot W : 50, 257 \times 12, 288 w_{0, 0} ⋮ w_{50256, 0} \dots ⋱ \dots w_{0, 12287} ⋮ w_{50256, 12287} = X W : 3 \times 12, 288 W [464, :] W [3797, :] W [3332, :]$
Each output row is a direct copy of the corresponding row of $W$ — three simultaneous lookups, no arithmetic.

The dot product zeroes out every row except the one the 1 aligns with. No actual arithmetic is needed; it’s a lookup.

Implication for bigram language models

This means xenc @ W (where xenc is a one-hot batch) produces one row of $W$ per training example — effectively treating each row of $W$ as the learned log-count for that input category. When gradient descent converges, W.exp() recovers the same count-based probabilities that an explicit frequency table would give — the two approaches are identical in their final result; they differ only in how they arrive there (counting vs. gradient descent).

Sources

Relevant Jupyter notebooks:
- 04_from_bigrams_to_nns
- one_hot_encoding

One-Hot Encoding

Why integers can’t be fed directly to a neural network

Encoding

Neat insight: one-hot × weight matrix = row select

Implication for bigram language models

Sources

Graph View

Backlinks

Explorer