04_exp1_larger_hidden

# imports, build vocabulary, build_dataset function, create train/val/test data split.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
 
# import data
words = open('data/names.txt', 'r').read().splitlines()
 
# build the vocabulary of characters, and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
 
# fn: build dataset (training examples X, and labels Y) for an INPUT list of names only 
block_size = 3 # context length: how many characters do we take to predict the next one?
 
def build_dataset(words):  
    X, Y = [], [] # X: NN input training examples, Y: labels for each input in X
    
    for w in words:
        #print(w)
        context = [0] * block_size
        for ch in w + '.':
            ix = stoi[ch]
            X.append(context)
            Y.append(ix)
            #print(''.join(itos[i] for i in context), '--->', itos[ix])
            context = context[1:] + [ix] # crop and append
 
    X = torch.tensor(X)
    Y = torch.tensor(Y)
    print(X.shape, Y.shape)
    return X, Y
 
# randomly shuffle words data set, and create train, val, test splits
import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words)) # to index 80th percentile word (i.e. words[0] to words[n1])
n2 = int(0.9*len(words)) # to index the 90th percentile word (i.e. words[n1] to words[n2])
 
Xtr, Ytr = build_dataset(words[:n1])     # 80% test set (Xtr: training examples, Ytr: training labels)
Xdev, Ydev = build_dataset(words[n1:n2]) # 10% validation set
Xte, Yte = build_dataset(words[n2:])     # 10% test set

torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])

Redefine hidden layer: 100 → 300 neurons

Changes to hidden layer parameters `W1` and `b1`

W1 $\in R^{6 \times 100 \to 300}$ : Hidden layer’s (incoming) weights matrix
- arg 1: 6 inputs to hidden layer: three embedding vectors, each with two embedding dims
- arg 2: 100 -> 300 (hidden) neurons in this (hidden) layer: design parameter
b1 $\in R^{100 \to 300}$ : Hidden layer’s bias vector (lives “in” the layer’s neurons)
- gets broadcasted (to 228,146; or however many training examples in the batch)

Change output layer parameter `W2`

W2 $\in R^{100 \to 300 \times 27}$ : Output layer’s (incoming) weights matrix
- arg 1: 100 -> 300 neurons coming into this output layer from previous (hidden) layer
- arg 2: 27 (output) neurons in the this (output) layer: 27 possible next characters
b2 $\in R^{27}$ : Output layer bias vector (lives “in” the layer’s neurons)

So the total parameter count goes from ~~3,481~~ → 10,281

# i - increase hidden layer size: 100 neurons -> 300 neurons
g = torch.Generator().manual_seed(2147483647) # for reproducibility
 
# define parameters
C = torch.randn((27, 2), generator=g)         # embedding matrix (lookup table for input tokens)
W1 = torch.randn((6, 300), generator=g)       # hidden layer's incoming weights: 6 inputs to layer, NOW 300 hidden neurons in layer 
b1 = torch.randn(300, generator=g)            # NOW 300 biases live "in" hidden layer's neurons
W2 = torch.randn((300, 27), generator=g)      # output layer's incoming weights: NOW 300 inputs to layer, 27 output neurons in layer
b2 = torch.randn(27, generator=g)             # 27 biases live "in" output layer's neurons
 
parameters = [C, W1, b1, W2, b2]              # list of all parameters (makes easier to count)
print('num. of parameters:', sum(p.nelement() for p in parameters))  # total parameter count in network: 3,481
 
# ensure all 10,281 parameters have gradient (to enable optimisation)
for p in parameters:
    p.requires_grad = True
 
lossi = []   # track resulting loss on each iter
stepi = []   # track steps

num. of parameters: 10281

Run 1: Train for 60,000 iters at `lr = 0.1`

# Run 1: 60,000 training iters on (Xtr, Ytr)! mini-batches (32 examples each).
 
for i in range(60000):
    # minibatch construct
    ix = torch.randint(0, Xtr.shape[0], (32,))
    
    # forward pass
    emb = C[Xtr[ix]] # (32, 3, 10)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 200)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Ytr[ix])
    #print(loss.item())
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # update
    lr = 0.1
    for p in parameters:
        p.data += -lr * p.grad
 
    # track stats
    stepi.append(i)
    lossi.append(loss.item())

Compare training loss to val loss

# compare train loss vs val (dev) loss
# forward pass full train split (Xtr, Ytr): clean loss number showing true model progress
emb = C[Xtr]                                 # (228146, 3, 2) -> (228146, 6) next line emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Ytr)
print('Training Run 1 (lr = 0.1): 60,000 iters on (Xtr, Ytr)\n\ntraining loss:',loss.item())
 
# forward pass val split (Xdev, Ydev)! clean loss number showing true model progress on unseen data
emb = C[Xdev]                                 # (228146, 3, 2) -> (228146, 6) next line emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Ydev)
print('validation (Xdev -> Ydev) loss:', loss.item())

Training Run 1 (lr = 0.1): 60,000 iters on (Xtr, Ytr)
 
training loss: 2.463120222091675
validation (Xdev -> Ydev) loss: 2.4652748107910156

Visualise loss as a function of step size

plt.plot(stepi, lossi)

[<matplotlib.lines.Line2D at 0x116f1ead0>]

Thoughts and observations

Increased parameter count (larger NN) may have necessitated more training iterations
Mini-batches are noisy, causing gradient thrashing (see vertical thickness in the loss plot)
- At 32 training examples per batch, there may be too much noise to be optimise a larger network
- Increasing batch size above 32 training examples per training iteration may help

Run 2: Train for 60,000 iters at `lr = 0.05`

# 60,000 training iters on training split only (Xtr, Ytr)! mini-batches (32 examples each).
 
for i in range(60000):
    # minibatch construct
    ix = torch.randint(0, Xtr.shape[0], (32,))
    
    # forward pass
    emb = C[Xtr[ix]] # (32, 3, 10)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 200)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Ytr[ix])
    #print(loss.item())
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # update
    lr = 0.05
    for p in parameters:
        p.data += -lr * p.grad
 
    # track stats
    stepi.append(i)
    lossi.append(loss.item())

Compare training loss to val loss

# compare train loss vs val (dev) loss
# forward pass full train split (Xtr, Ytr): clean loss number showing true model progress
emb = C[Xtr]                                 # (228146, 3, 2) -> (228146, 6) next line emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Ytr)
print('Training Run 2 (lr = 0.05): 60,000 iters on (Xtr, Ytr)\n\ntraining loss:',loss.item())
 
# forward pass val split (Xdev, Ydev)! clean loss number showing true model progress on unseen data
emb = C[Xdev]                                 # (228146, 3, 2) -> (228146, 6) next line emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Ydev)
print('validation (Xdev -> Ydev) loss:', loss.item())

Training Run 2 (lr = 0.05): 60,000 iters on (Xtr, Ytr)
 
training loss: 2.3143296241760254
validation (Xdev -> Ydev) loss: 2.3253331184387207

Run 3: Train for 60,000 iters at `lr = 0.01`

# 60,000 training iters on training split only (Xtr, Ytr)! mini-batches (32 examples each).
 
for i in range(60000):
    # minibatch construct
    ix = torch.randint(0, Xtr.shape[0], (32,))
    
    # forward pass
    emb = C[Xtr[ix]] # (32, 3, 10)
    h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 200)
    logits = h @ W2 + b2 # (32, 27)
    loss = F.cross_entropy(logits, Ytr[ix])
    #print(loss.item())
    
    # backward pass
    for p in parameters:
        p.grad = None
    loss.backward()
    
    # update
    lr = 0.01
    for p in parameters:
        p.data += -lr * p.grad
 
    # track stats
    stepi.append(i)
    lossi.append(loss.item())

Compare training loss to val loss

# compare train loss vs val (dev) loss
# forward pass full train split (Xtr, Ytr): clean loss number showing true model progress
emb = C[Xtr]                                 # (228146, 3, 2) -> (228146, 6) next line emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Ytr)
print('Training Run 3 (lr = 0.01): 60,000 iters on (Xtr, Ytr)\n\ntraining loss:',loss.item())
 
# forward pass val split (Xdev, Ydev)! clean loss number showing true model progress on unseen data
emb = C[Xdev]                                 # (228146, 3, 2) -> (228146, 6) next line emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Ydev)
print('validation (Xdev -> Ydev) loss:', loss.item())

Training Run 3 (lr = 0.01): 60,000 iters on (Xtr, Ytr)
 
training loss: 2.235992908477783
validation (Xdev -> Ydev) loss: 2.2460684776306152

Visualise character embeddings

Since we used 2-dimensional embedding vectors (see 01_build_mlp, embedding section) we can visualise the model’s pre-trained embedding matrix $C$ as a graph on the $x y$ -plane

Clearly there is some structure in how the model treats certain characters:

The start/end character . is very different to everything else, so sits apart by itself
Vowels a, e, i, o have clustered to the bottom left
q is quite unique and out by itself.
u is also unique and out by itself: its uses are clearlly dissimilar to most other letters, and maybe more like q
y sits between the vowels and everything else
Vague clustering of “hard / closed” consonants like c, p, k, d, t.
Vague $x$ -axis alignment of “soft / flowy / open” consonants like f l, r, n, w, v, h, m

It is possible the number of embedding dimensions is another bottleneck holding back model performance. Maybe cramming 27 tokens into 2 dimensions is too ambitious, and loses their semantic meaning.

# visualize dimensions 0 and 1 of the embedding matrix C for all characters
plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:,1].data, s=200) # graphing the columns of C. x: C[:,0] and y: C[:,1]
for i in range(C.shape[0]):
    plt.text(C[i,0].item(), C[i,1].item(), itos[i], ha="center", va="center", color='white')
plt.grid('minor')

Sources

YouTube: The spelled-out intro to language modeling: building makemore
Bengio et. al. 2003: A Neural Probabilistic Language Model (implemented here)
karpathy/makemore on GitHub
Google Colab: Exercises
ezyang’s blog: PyTorch Internals

notes/

Experiment 1: Increase hidden layer size

Redefine hidden layer: 100 → 300 neurons

Changes to hidden layer parameters `W1` and `b1`

Change output layer parameter `W2`

Run 1: Train for 60,000 iters at `lr = 0.1`

Compare training loss to val loss

Visualise loss as a function of step size

Thoughts and observations

Run 2: Train for 60,000 iters at `lr = 0.05`

Compare training loss to val loss

Run 3: Train for 60,000 iters at `lr = 0.01`

Compare training loss to val loss

Visualise character embeddings

Sources

Experiment 1: Increase hidden layer size

Redefine hidden layer: 100 → 300 neurons

Changes to hidden layer parameters W1 and b1

Change output layer parameter W2

Run 1: Train for 60,000 iters at lr = 0.1

Compare training loss to val loss

Visualise loss as a function of step size

Thoughts and observations

Run 2: Train for 60,000 iters at lr = 0.05

Compare training loss to val loss

Run 3: Train for 60,000 iters at lr = 0.01

Compare training loss to val loss

Visualise character embeddings

Sources

Graph View

Backlinks

Explorer

Changes to hidden layer parameters `W1` and `b1`

Change output layer parameter `W2`

Run 1: Train for 60,000 iters at `lr = 0.1`

Run 2: Train for 60,000 iters at `lr = 0.05`

Run 3: Train for 60,000 iters at `lr = 0.01`