01_build

Next: 02_train_mlp
Related: word-embedding, backpropagation, multilayer-perceptron

# imports and data ingest
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
 
# read in all the words
words = open('data/names.txt', 'r').read().splitlines()
 
print('number of names:', len(words), '\nfirst 8 names:', words[:8])

number of names: 32033 
first 8 names: ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

A Bigram character-level LM does not scale well for more than 1 character of context. The counts matrix N (or weight matrix W of log-counts) grows exponentially for each additional character in-context for next-character prediction, so:

For 1 character of context: N, W $\in R^{27 \times 27}$ (built this in 01_define_bigram_model)
For 2 characters of context: N, W $\in R^{729 \times 729}$
For 3 characters of context: N, W $\in R^{19, 683 \times 19, 683}$

How to build an MLP

Image: NN structure in Bengio et. al. 2003 - Inspiring example for this notebook's MLP

Andrej Karpathy's summary of Bengio et. al. 2003 NN

word-embedding Embed all ~17,000 words (vocabulary) into a much smaller dimensionality space (e.g. 30-100 dims).

E.g. each word will have an associated 30-dim feature vector, thereby “embedding” it into that 30-dim space

Word indices $\in [1, ..., 16, 999]$

Embedding matrix (lookup table) of all words: $C \in R^{17, 000 \times 30}$

Each word vector $\in R^{1 \times 30}$ is just a row of $C$ , i.e. the embedding vector for that word

Initially, word embeddings (vectors) are randomly initialised in the 30-dim space.

Use an MLP NN to predict the next word given the 3 previous words

Input layer: 30 neurons per word, 3 words → 90 input neurons

Hidden layer 1: fully connected (to 90 input neurons) layer with:

$x$ neurons: Size of hidden layer neurons is a “hyperparameter” (design choice).

tanh nonlinearity

Output layer: 1 neuron per “next word”, so 17,000 neurons (with logits) ← very expensive layer

softmax nonlinearity: logits.exp() → normalise (sum to 1) → prob dist. of next word in sequence

Neural network’s parameters $W$ :

weights and biases of output layer

weights and biases of hidden layer

weights and biases of input layer

and the embedding matrix (lookup table), $C$

Modelling approach:

To train the NN, maximise log-likelihood of the training data

During training, we have labels (i.e. we know the identity of the correct next word in sequence)

Use $i$ the correct next word’s index

to maximise its probability, $P (w_{i})$

wrt NN parameters $w_{i} \in W$

backpropagation adjusts $W$ : the word embeddings, and the weights and biases of all layers

We expect words with similar meanings to end up clustered in the space (high dot product)

Words with different meanings to be in different parts of the space (low dot product)

And highly tuned weights and biases to maximise the probability of the correct next word.

Some insights

If a phrase is encountered in testing/inference that was never encountered in training, it is said to be “out of distribution”
- A well-trained network can transfer knowledge through nearby embeddings to predict a reasonable next token.
  - e.g. “the” may be near “a” and the model understands these are somewhat interchangeable
  - e.g. the model recognises “cat” and “dog” are animals (similar embedding) and co-occur in many similar contexts
Through the embeddings, models can hence generalise to novel scenarios

# build the vocabulary of characters, and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}

Build MLP

Build dataset (training examples)

A single training example looks like a sliding window of context: 3 input tokens (chars) → 1 output token (label / desired output)

Two examples: ... -> e or emm -> a
names.txt has 32,033 names, equating to 228,146 training examples (see X.shape or Y.shape).

Load all training examples (inspect output showing 32 training examples in the first 5 names):

# build dataset X and Y (& print the 32, 3-char examples in first 5 names)
 
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []  # X: NN input training examples, Y: labels for each input in X
 
for i, w in enumerate(words):
# for w in words:
    if i < 5: print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        if i < 5: print(''.join(itos[i] for i in context), '->', itos[ix])
        context = context[1:] + [ix] # crop and append
    
X = torch.tensor(X)
Y = torch.tensor(Y)
 
print('\nX.shape:', X.shape, 'X.dtype:', X.dtype)
print('Y.shape:', Y.shape, 'Y.dtype:', Y.dtype)

emma
... -> e
..e -> m
.em -> m
emm -> a
mma -> .
olivia
... -> o
..o -> l
.ol -> i
oli -> v
liv -> i
ivi -> a
via -> .
ava
... -> a
..a -> v
.av -> a
ava -> .
isabella
... -> i
..i -> s
.is -> a
isa -> b
sab -> e
abe -> l
bel -> l
ell -> a
lla -> .
sophia
... -> s
..s -> o
.so -> p
sop -> h
oph -> i
phi -> a
hia -> .
 
X.shape: torch.Size([228146, 3]) X.dtype: torch.int64
Y.shape: torch.Size([228146]) Y.dtype: torch.int64

# inspect shape, dtype of X (NN input training ex's) & Y (labels for each input in X)
print('\nX.shape:', X.shape, 'X.dtype:', X.dtype)
print('Y.shape:', Y.shape, 'Y.dtype:', Y.dtype)

X.shape: torch.Size([228146, 3]) X.dtype: torch.int64
Y.shape: torch.Size([228146]) Y.dtype: torch.int64

# i - inspecting the training examples in X
print('first row in names.txt:', words[0])
print('...', X[0])
print('..e', X[1])
print('.em', X[2])
print('emm', X[3])
print('mma', X[4])
print('\nsecond row in names.txt:', words[1])
print('...', X[5])
print('..o', X[6])
print('.ol', X[7])

first row in names.txt: emma
... tensor([0, 0, 0])
..e tensor([0, 0, 5])
.em tensor([ 0,  5, 13])
emm tensor([ 5, 13, 13])
mma tensor([13, 13,  1])
 
second row in names.txt: olivia
... tensor([0, 0, 0])
..o tensor([ 0,  0, 15])
.ol tensor([ 0, 15, 12])

Create embedding matrix $C \in R^{27 \times 2}$ (lookup table)

Bengio et al.: 17,000-word vocabulary size → embedded into 30-dim space (see word-embedding)
This toy model: 27-characters → embedded into 2-dim space, so $C \in R^{27 \times 2}$

Recall, multiplying a one-hot encoded vector by $C$ is identical to selecting a row from $C$ .

The input neurons (i.e. embedding process) has two equivalent interpretations:

“Look up”: Explicitly index a row in lookup table $C$ . This is that token’s embedding vector.

One-hot encode: Interpret the input neurons as if they’re a “linear layer” with a fake 0th input layer preceding them.

$C$ is the weight matrix fully connecting the “fake 0th input layer” to our “linear” input layer

The fake 0th layer is made up of one-hot encoded integers

One-hot vectors (fake inputs) $\times$ $C$ (weight matrix) $\to$ embedding vectors (actual inputs to the network) — equivalent to a row select from $C$

Hence the following are equivalent:

C[5] - explicitly index the 5th row in $C$ (easier, faster)

F.one_hot(torch.tensor(5), num_classes=27).float() @ C - identical behaviour

# i - toy vocab: all 27 chars -> simultaneously embedded into 2-dim embedding space
torch.manual_seed(2147483647)
C = torch.randn((27, 2))  # init parameter: each 1 of 27 chars has 2 embedding dims
 
# PyTorch can index on integer, tensor (N-dim!), or list (flexible!)
# C[X] works, because tensor C indexing with integer tensor X is a batched row lookup
# no explicit loop. typically nn.Embedding is a wrapper for this.
emb = C[X] # embedded training set (each token: scalar integer -> 2D vector)
 
print('Raw training examples (integer tokens) -> X.shape:', X.shape)
print('Embedding matrix (lookup table)        -> C.shape:', C.shape)
print('Embedded training examples (2-dim)     -> emb.shape:', emb.shape)

Raw training examples (integer tokens) -> X.shape: torch.Size([228146, 3])
Embedding matrix (lookup table)        -> C.shape: torch.Size([27, 2])
Embedded training examples (2-dim)     -> emb.shape: torch.Size([228146, 3, 2])

Inspect above outputs explaining each object’s shape

# i - index a SPECIFIC token in training examples X, and indexing the same tokens's associated embedding vector in C
 
# indexing a single token (character) in two training examples:
print('\ntraining example at X[4, 2]: "a" in "mma" ->', X[4, 2])
print('training example at X[6, 2]: "o" in "..o" ->', X[6, 2])
 
# indexing associated embedding vectors (two methods)
print('\n2-dimensional embedding vector for token "a"', '\nmethod 1      C[1] :', C[1], '\nmethod 2 C[X][4,2] :', C[X][4,2])
print('\n2-dimensional embedding vector for token "o"', '\nmethod 1     C[15] :', C[15], '\nmethod 2 C[X][6,2] :', C[X][6,2])

training example at X[4, 2]: "a" in "mma" -> tensor(1)
training example at X[6, 2]: "o" in "..o" -> tensor(15)
 
2-dimensional embedding vector for token "a" 
method 1      C[1] : tensor([-0.0274, -1.1008]) 
method 2 C[X][4,2] : tensor([-0.0274, -1.1008])
 
2-dimensional embedding vector for token "o" 
method 1     C[15] : tensor([-1.0725,  0.7276]) 
method 2 C[X][6,2] : tensor([-1.0725,  0.7276])

Inspect indexing examples above

Implement a hidden layer

W1 $\in R^{6 \times 100}$ : Hidden layer’s (incoming) weights matrix
- arg 1: 6 inputs to hidden layer: three embedding vectors, each with two embedding dims
- arg 2: 100 (hidden) neurons in this (hidden) layer: design parameter
b1 $\in R^{100}$ : Hidden layer’s bias vector (lives “in” the layer’s neurons)
- gets broadcasted (to 228,146; the number of training examples)
h: Non linear activation function

Handling W1 and emb shape (dims) mismatch with Pytorch emb.view() (calculating h)

The embedded training examples emb are of shape $\in R^{228, 146 \times 3 \times 2}$ :

228,146 training examples,

$\times$ 3 tokens per example,

$\times$ 2 embedding dimensions (token is embedded in 2D)

but W1 $\in R^{6 \times 100}$ :

it is expecting 6 inputs (not 3 input tokens with 2 embedding dims each)

We need to create a view: emb.view(-1, 6),

arg 1: -1 infers the size of dimension 0

arg 2: 6 we specify the size of dimension 1 (6 inputs coming into this layer)

Hence, all elements are accounted for.

# i - specify parameters of hidden layer
W1 = torch.randn((6 , 100))                 # (incoming) weights matrix
b1 = torch.randn(100)                       # biases (on each neuron in layer) -> broadcasted!
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)   # nonlin (-1 infers the n. of dims so all params are accounted for)

Inspect outputs

Expand both outputs below if confused

# inspect all object shapes: emb, emb.view(-1,6), W1, emb.view(-1,6) @ W1, b1, h
print('emb.shape                    :', emb.shape, ' -> wrong dims for MatMul with W1')
print('emb.view(-1, 6).shape        :', emb.view(-1, 6).shape, '    -> fixed: dims match')
print('W1.shape                     :', W1.shape)
print('\n(emb.view(-1, 6) @ W1).shape :', (emb.view(-1, 6) @ W1).shape, '  -> MatMul successful, now add bias')
print('\nb1.shape                     :', b1.shape, '          -> interpreted as (1, 100) -> broadcast to (228146, 100)')
print('h.shape                      :', h.shape, '  -> activations of hidden layer (100 neurons) for all 228146 examples')

emb.shape                    : torch.Size([228146, 3, 2])  -> wrong dims for MatMul with W1
emb.view(-1, 6).shape        : torch.Size([228146, 6])     -> fixed: dims match
W1.shape                     : torch.Size([6, 100])
 
(emb.view(-1, 6) @ W1).shape : torch.Size([228146, 100])   -> MatMul successful, now add bias
 
b1.shape                     : torch.Size([100])           -> interpreted as (1, 100) -> broadcast to (228146, 100)
h.shape                      : torch.Size([228146, 100])   -> activations of hidden layer (100 neurons) for all 228146 examples

# inspect h, hidden layer neuron activations (100 neurons) for all 228,146 training examples
print('\nhidden layer activations, h:')
print(h)

hidden layer activations, h:
tensor([[-0.9348,  1.0000,  0.9258,  ...,  0.9786, -0.1926,  0.9515],
        [ 0.2797,  0.9997,  0.7675,  ...,  0.9929,  0.9992,  0.9981],
        [-0.9960,  1.0000, -0.8694,  ..., -0.5159, -1.0000, -0.0069],
        ...,
        [-0.4849,  0.9972, -0.6418,  ..., -0.9641,  0.9996,  0.9873],
        [-0.9318,  0.9926, -0.9841,  ..., -0.8989, -0.9938,  0.5930],
        [-0.9736,  0.3844, -0.8744,  ..., -0.5093,  0.9998, -0.9975]])

Implement the output layer

W2 $\in R^{100 \times 27}$ : Output layer’s (incoming) weights matrix
- arg 1: 100 neurons coming into this output layer from previous (hidden) layer
- arg 2: 27 (output) neurons in the this (output) layer: 27 possible next characters
b2 $\in R^{27}$ : Output layer bias vector (lives “in” the layer’s neurons)

# i - specify parameters of output layer
W2 = torch.randn((100, 27))   #
b2 = torch.randn(27)          #

Implement loss function: Negative log likelihood

logits: output layer neurons activations. Interpret as “log counts”

# i - calculate "logits" (aka "log counts")
logits = h @ W2 + b2
print('logits.shape:', logits.shape)

logits.shape: torch.Size([228146, 27])

Apply softmax to calculate prob:

logits.exp() makes the logits (“log counts”) behave like actual counts
Normalise each row (by row sum) to convert counts → probabilities
prob $\in R^{228, 146 \times 27}$ : Probability distribution of next token (character), for every training example
- Each row of prob sums to 1

# i - apply softmax
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
print('prob.shape:', prob.shape)

prob.shape: torch.Size([228146, 27])

Compute loss: negative log likelihood

Selecting specific probability of correct next token

Breaking down prob[torch.arange(228146), Y]:

prob[torch.arange(228146)]: For each training example (of 228,146), get row of next-token probability distributions (27 elements, row sums to 1)
Y arg says (for each training example) select the specific probability for the actual next token in sequence!
- In training, we know Y, the correct label for each training example '.em' -> 'm' or 'emm' -> 'a'
If prob[torch.arange(228146), Y] → 1.0, then the model is making good (correct) next-character predictions!

# i - compute loss function (negative log likelihood)
loss = -prob[torch.arange(228146), Y].log().mean()
print('loss      :', loss)

loss      : tensor(19.5052)

`F.cross_entropy()` loss does the same thing!

Rationale: Why use F.cross_entropy()?

More efficient forward pass: More memory efficient (no temporary intermediate counts, prob tensors created)

Fused kernel: softmax + log + NLL computed in a single pass, avoiding materialising the full probability distribution tensor in memory

More efficient backward pass: F.cross_entropy keeps softmax + log + NLL fused, so the backward pass has a clean closed-form (analytical) gradient ( $\overset{y}{^} - y$ ) rather than backpropping through each intermediate step separately.

Recall: 07_breaking_up_tanh showed how decomposing a fused op $tanh$ into steps multiplies the number of backward pass operations needed

More well behaved than exp().

If an element in logits is very large (e.g. 100) , then the counts = logits.exp() → inf for that element. Ran out of float range!

probs then has a nan probability value for that next token.

PyTorch offsets all logits by subtracting the largest element before applying exp()

Largest element becomes 0, all others ≤ 0

exp() of non-positive numbers stays in [0, 1] — no overflow

Softmax is invariant to this shift (only relative differences matter)

F.cross_entropy(logits, Y)

tensor(19.5052)

Summary of full network

# i - summary
print('X:', X.shape, '-> Y:', Y.shape)
 
g = torch.Generator().manual_seed(2147483647) # for reproducibility
 
C = torch.randn((27, 2), generator=g)         # embedding matrix (lookup table for input tokens)
W1 = torch.randn((6, 100), generator=g)       # hidden layer's incoming weights: 6 inputs to layer, 100 hidden neurons in layer 
b1 = torch.randn(100, generator=g)            # 100 biases live "in" hidden layer's neurons
W2 = torch.randn((100, 27), generator=g)      # output layer's incoming weights: 100 inputs to layer, 27 output neurons in layer
b2 = torch.randn(27, generator=g)             # 27 biases live "in" output layer's neurons
 
parameters = [C, W1, b1, W2, b2]              # list of all parameters (makes easier to count)
print('num. of parameters:', sum(p.nelement() for p in parameters))  # total parameter count in network
 
emb = C[X]                                 # (228146, 3, 2) -> (228146, 6) on next line via emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Y)          # simpler!
loss

X: torch.Size([228146, 3]) -> Y: torch.Size([228146])
num. of parameters: 3481

tensor(19.5052)

Sources

YouTube: The spelled-out intro to language modeling: building makemore
Bengio et. al. 2003: A Neural Probabilistic Language Model (implemented here)
karpathy/makemore on GitHub
Google Colab: Exercises
ezyang’s blog: PyTorch Internals

notes/

Scaling Bigram to an MLP

How to build an MLP

Some insights

Build MLP

Build dataset (training examples)

Create embedding matrix $C \in R^{27 \times 2}$ (lookup table)

Implement a hidden layer

Inspect outputs

Implement the output layer

Implement loss function: Negative log likelihood

Selecting specific probability of correct next token

`F.cross_entropy()` loss does the same thing!

Summary of full network

Sources

Scaling Bigram to an MLP

How to build an MLP

Some insights

Build MLP

Build dataset (training examples)

Create embedding matrix C∈R27×2 (lookup table)

Implement a hidden layer

Inspect outputs

Implement the output layer

Implement loss function: Negative log likelihood

Selecting specific probability of correct next token

F.cross_entropy() loss does the same thing!

Summary of full network

Sources

Graph View

Backlinks

Explorer

Create embedding matrix $C \in R^{27 \times 2}$ (lookup table)

`F.cross_entropy()` loss does the same thing!