- Next: 02_train_mlp
- Related: word-embedding, backpropagation, multilayer-perceptron
# imports and data ingest
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
# read in all the words
words = open('data/names.txt', 'r').read().splitlines()
print('number of names:', len(words), '\nfirst 8 names:', words[:8])number of names: 32033
first 8 names: ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']A Bigram character-level LM does not scale well for more than 1 character of context. The counts matrix N (or weight matrix W of log-counts) grows exponentially for each additional character in-context for next-character prediction, so:
- For 1 character of context:
N,W(built this in 01_define_bigram_model) - For 2 characters of context:
N,W - For 3 characters of context:
N,W
How to build an MLP
Image: NN structure in Bengio et. al. 2003 - Inspiring example for this notebook's MLP
Andrej Karpathy's summary of Bengio et. al. 2003 NN
- word-embedding Embed all ~17,000 words (vocabulary) into a much smaller dimensionality space (e.g. 30-100 dims).
- E.g. each word will have an associated 30-dim feature vector, thereby “embedding” it into that 30-dim space
- Word indices
- Embedding matrix (lookup table) of all words:
- Each word vector is just a row of , i.e. the embedding vector for that word
- Initially, word embeddings (vectors) are randomly initialised in the 30-dim space.
- Use an MLP NN to predict the next word given the 3 previous words
- Input layer: 30 neurons per word, 3 words → 90 input neurons
- Hidden layer 1: fully connected (to 90 input neurons) layer with:
- neurons: Size of hidden layer neurons is a “hyperparameter” (design choice).
tanhnonlinearity- Output layer: 1 neuron per “next word”, so 17,000 neurons (with logits) ← very expensive layer
softmaxnonlinearity:logits.exp()→ normalise (sum to 1) → prob dist. of next word in sequence- Neural network’s parameters :
- weights and biases of output layer
- weights and biases of hidden layer
- weights and biases of input layer
- and the embedding matrix (lookup table),
- Modelling approach:
- To train the NN, maximise log-likelihood of the training data
- During training, we have labels (i.e. we know the identity of the correct next word in sequence)
- Use the correct next word’s index
- to maximise its probability,
- wrt NN parameters
- backpropagation adjusts : the word embeddings, and the weights and biases of all layers
- We expect words with similar meanings to end up clustered in the space (high dot product)
- Words with different meanings to be in different parts of the space (low dot product)
- And highly tuned weights and biases to maximise the probability of the correct next word.
Some insights
- If a phrase is encountered in testing/inference that was never encountered in training, it is said to be “out of distribution”
- A well-trained network can transfer knowledge through nearby embeddings to predict a reasonable next token.
- e.g. “the” may be near “a” and the model understands these are somewhat interchangeable
- e.g. the model recognises “cat” and “dog” are animals (similar embedding) and co-occur in many similar contexts
- A well-trained network can transfer knowledge through nearby embeddings to predict a reasonable next token.
- Through the embeddings, models can hence generalise to novel scenarios
# build the vocabulary of characters, and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos){1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}Build MLP
Build dataset (training examples)
A single training example looks like a sliding window of context: 3 input tokens (chars) → 1 output token (label / desired output)
- Two examples:
... -> eoremm -> a names.txthas 32,033 names, equating to 228,146 training examples (seeX.shapeorY.shape).
Load all training examples (inspect output showing 32 training examples in the first 5 names):
# build dataset X and Y (& print the 32, 3-char examples in first 5 names)
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], [] # X: NN input training examples, Y: labels for each input in X
for i, w in enumerate(words):
# for w in words:
if i < 5: print(w)
context = [0] * block_size
for ch in w + '.':
ix = stoi[ch]
X.append(context)
Y.append(ix)
if i < 5: print(''.join(itos[i] for i in context), '->', itos[ix])
context = context[1:] + [ix] # crop and append
X = torch.tensor(X)
Y = torch.tensor(Y)
print('\nX.shape:', X.shape, 'X.dtype:', X.dtype)
print('Y.shape:', Y.shape, 'Y.dtype:', Y.dtype)emma
... -> e
..e -> m
.em -> m
emm -> a
mma -> .
olivia
... -> o
..o -> l
.ol -> i
oli -> v
liv -> i
ivi -> a
via -> .
ava
... -> a
..a -> v
.av -> a
ava -> .
isabella
... -> i
..i -> s
.is -> a
isa -> b
sab -> e
abe -> l
bel -> l
ell -> a
lla -> .
sophia
... -> s
..s -> o
.so -> p
sop -> h
oph -> i
phi -> a
hia -> .
X.shape: torch.Size([228146, 3]) X.dtype: torch.int64
Y.shape: torch.Size([228146]) Y.dtype: torch.int64# inspect shape, dtype of X (NN input training ex's) & Y (labels for each input in X)
print('\nX.shape:', X.shape, 'X.dtype:', X.dtype)
print('Y.shape:', Y.shape, 'Y.dtype:', Y.dtype)X.shape: torch.Size([228146, 3]) X.dtype: torch.int64
Y.shape: torch.Size([228146]) Y.dtype: torch.int64# i - inspecting the training examples in X
print('first row in names.txt:', words[0])
print('...', X[0])
print('..e', X[1])
print('.em', X[2])
print('emm', X[3])
print('mma', X[4])
print('\nsecond row in names.txt:', words[1])
print('...', X[5])
print('..o', X[6])
print('.ol', X[7])first row in names.txt: emma
... tensor([0, 0, 0])
..e tensor([0, 0, 5])
.em tensor([ 0, 5, 13])
emm tensor([ 5, 13, 13])
mma tensor([13, 13, 1])
second row in names.txt: olivia
... tensor([0, 0, 0])
..o tensor([ 0, 0, 15])
.ol tensor([ 0, 15, 12])Create embedding matrix (lookup table)
- Bengio et al.: 17,000-word vocabulary size → embedded into 30-dim space (see word-embedding)
- This toy model: 27-characters → embedded into 2-dim space, so
Recall, multiplying a one-hot encoded vector by is identical to selecting a row from .
The input neurons (i.e. embedding process) has two equivalent interpretations:
- “Look up”: Explicitly index a row in lookup table . This is that token’s embedding vector.
- One-hot encode: Interpret the input neurons as if they’re a “linear layer” with a fake 0th input layer preceding them.
- is the weight matrix fully connecting the “fake 0th input layer” to our “linear” input layer
- The fake 0th layer is made up of one-hot encoded integers
- One-hot vectors (fake inputs) (weight matrix) embedding vectors (actual inputs to the network) — equivalent to a row select from
Hence the following are equivalent:
C[5]- explicitly index the 5th row in (easier, faster)F.one_hot(torch.tensor(5), num_classes=27).float() @ C- identical behaviour
# i - toy vocab: all 27 chars -> simultaneously embedded into 2-dim embedding space
torch.manual_seed(2147483647)
C = torch.randn((27, 2)) # init parameter: each 1 of 27 chars has 2 embedding dims
# PyTorch can index on integer, tensor (N-dim!), or list (flexible!)
# C[X] works, because tensor C indexing with integer tensor X is a batched row lookup
# no explicit loop. typically nn.Embedding is a wrapper for this.
emb = C[X] # embedded training set (each token: scalar integer -> 2D vector)
print('Raw training examples (integer tokens) -> X.shape:', X.shape)
print('Embedding matrix (lookup table) -> C.shape:', C.shape)
print('Embedded training examples (2-dim) -> emb.shape:', emb.shape)Raw training examples (integer tokens) -> X.shape: torch.Size([228146, 3])
Embedding matrix (lookup table) -> C.shape: torch.Size([27, 2])
Embedded training examples (2-dim) -> emb.shape: torch.Size([228146, 3, 2])Inspect above outputs explaining each object’s shape
# i - index a SPECIFIC token in training examples X, and indexing the same tokens's associated embedding vector in C
# indexing a single token (character) in two training examples:
print('\ntraining example at X[4, 2]: "a" in "mma" ->', X[4, 2])
print('training example at X[6, 2]: "o" in "..o" ->', X[6, 2])
# indexing associated embedding vectors (two methods)
print('\n2-dimensional embedding vector for token "a"', '\nmethod 1 C[1] :', C[1], '\nmethod 2 C[X][4,2] :', C[X][4,2])
print('\n2-dimensional embedding vector for token "o"', '\nmethod 1 C[15] :', C[15], '\nmethod 2 C[X][6,2] :', C[X][6,2])training example at X[4, 2]: "a" in "mma" -> tensor(1)
training example at X[6, 2]: "o" in "..o" -> tensor(15)
2-dimensional embedding vector for token "a"
method 1 C[1] : tensor([-0.0274, -1.1008])
method 2 C[X][4,2] : tensor([-0.0274, -1.1008])
2-dimensional embedding vector for token "o"
method 1 C[15] : tensor([-1.0725, 0.7276])
method 2 C[X][6,2] : tensor([-1.0725, 0.7276])Inspect indexing examples above
Implement a hidden layer
Implement the output layer
W2: Output layer’s (incoming) weights matrix- arg 1:
100neurons coming into this output layer from previous (hidden) layer - arg 2:
27(output) neurons in the this (output) layer: 27 possible next characters
- arg 1:
b2: Output layer bias vector (lives “in” the layer’s neurons)
# i - specify parameters of output layer
W2 = torch.randn((100, 27)) #
b2 = torch.randn(27) # Implement loss function: Negative log likelihood
logits: output layer neurons activations. Interpret as “log counts”
# i - calculate "logits" (aka "log counts")
logits = h @ W2 + b2
print('logits.shape:', logits.shape)logits.shape: torch.Size([228146, 27])Apply softmax to calculate prob:
logits.exp()makes the logits (“log counts”) behave like actual counts- Normalise each row (by row sum) to convert counts → probabilities
prob: Probability distribution of next token (character), for every training example- Each row of
probsums to 1
- Each row of
# i - apply softmax
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
print('prob.shape:', prob.shape)prob.shape: torch.Size([228146, 27])Compute loss: negative log likelihood
Selecting specific probability of correct next token
Breaking down prob[torch.arange(228146), Y]:
prob[torch.arange(228146)]: For each training example (of 228,146), get row of next-token probability distributions (27 elements, row sums to 1)Yarg says (for each training example) select the specific probability for the actual next token in sequence!- In training, we know
Y, the correct label for each training example'.em' -> 'm'or'emm' -> 'a'
- In training, we know
- If
prob[torch.arange(228146), Y]→ 1.0, then the model is making good (correct) next-character predictions!
# i - compute loss function (negative log likelihood)
loss = -prob[torch.arange(228146), Y].log().mean()
print('loss :', loss)loss : tensor(19.5052)F.cross_entropy() loss does the same thing!
Rationale: Why use
F.cross_entropy()?
- More efficient forward pass: More memory efficient (no temporary intermediate
counts,probtensors created)
- Fused kernel: softmax + log + NLL computed in a single pass, avoiding materialising the full probability distribution tensor in memory
- More efficient backward pass:
F.cross_entropykeeps softmax + log + NLL fused, so the backward pass has a clean closed-form (analytical) gradient () rather than backpropping through each intermediate step separately.
- Recall: 07_breaking_up_tanh showed how decomposing a fused op into steps multiplies the number of backward pass operations needed
- More well behaved than
exp().
- If an element in
logitsis very large (e.g.100) , then thecounts = logits.exp()→inffor that element. Ran out of float range!probsthen has ananprobability value for that next token.- PyTorch offsets all
logitsby subtracting the largest element before applyingexp()
- Largest element becomes 0, all others ≤ 0
exp()of non-positive numbers stays in [0, 1] — no overflow- Softmax is invariant to this shift (only relative differences matter)
F.cross_entropy(logits, Y)tensor(19.5052)Summary of full network
# i - summary
print('X:', X.shape, '-> Y:', Y.shape)
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 2), generator=g) # embedding matrix (lookup table for input tokens)
W1 = torch.randn((6, 100), generator=g) # hidden layer's incoming weights: 6 inputs to layer, 100 hidden neurons in layer
b1 = torch.randn(100, generator=g) # 100 biases live "in" hidden layer's neurons
W2 = torch.randn((100, 27), generator=g) # output layer's incoming weights: 100 inputs to layer, 27 output neurons in layer
b2 = torch.randn(27, generator=g) # 27 biases live "in" output layer's neurons
parameters = [C, W1, b1, W2, b2] # list of all parameters (makes easier to count)
print('num. of parameters:', sum(p.nelement() for p in parameters)) # total parameter count in network
emb = C[X] # (228146, 3, 2) -> (228146, 6) on next line via emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (228146, 100)
logits = h @ W2 + b2 # (228146, 27)
loss = F.cross_entropy(logits, Y) # simpler!
lossX: torch.Size([228146, 3]) -> Y: torch.Size([228146])
num. of parameters: 3481tensor(19.5052)Sources
- YouTube: The spelled-out intro to language modeling: building makemore
- Bengio et. al. 2003: A Neural Probabilistic Language Model (implemented here)
- karpathy/makemore on GitHub
- Google Colab: Exercises
- ezyang’s blog: PyTorch Internals
