# imports and data ingest
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline
 
# read in all the words
words = open('data/names.txt', 'r').read().splitlines()
 
print('number of names:', len(words), '\nfirst 8 names:', words[:8])
number of names: 32033 
first 8 names: ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

A Bigram character-level LM does not scale well for more than 1 character of context. The counts matrix N (or weight matrix W of log-counts) grows exponentially for each additional character in-context for next-character prediction, so:

  • For 1 character of context: N, W (built this in 01_define_bigram_model)
  • For 2 characters of context: N, W
  • For 3 characters of context: N, W

How to build an MLP

Image: NN structure in Bengio et. al. 2003 - Inspiring example for this notebook's MLP

Some insights

  • If a phrase is encountered in testing/inference that was never encountered in training, it is said to be “out of distribution”
    • A well-trained network can transfer knowledge through nearby embeddings to predict a reasonable next token.
      • e.g. “the” may be near “a” and the model understands these are somewhat interchangeable
      • e.g. the model recognises “cat” and “dog” are animals (similar embedding) and co-occur in many similar contexts
  • Through the embeddings, models can hence generalise to novel scenarios
# build the vocabulary of characters, and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}

Build MLP

Build dataset (training examples)

A single training example looks like a sliding window of context: 3 input tokens (chars) 1 output token (label / desired output)

  • Two examples: ... -> e or emm -> a
  • names.txt has 32,033 names, equating to 228,146 training examples (see X.shape or Y.shape).

Load all training examples (inspect output showing 32 training examples in the first 5 names):

# build dataset X and Y (& print the 32, 3-char examples in first 5 names)
 
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []  # X: NN input training examples, Y: labels for each input in X
 
for i, w in enumerate(words):
# for w in words:
    if i < 5: print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        if i < 5: print(''.join(itos[i] for i in context), '->', itos[ix])
        context = context[1:] + [ix] # crop and append
    
X = torch.tensor(X)
Y = torch.tensor(Y)
 
print('\nX.shape:', X.shape, 'X.dtype:', X.dtype)
print('Y.shape:', Y.shape, 'Y.dtype:', Y.dtype)
emma
... -> e
..e -> m
.em -> m
emm -> a
mma -> .
olivia
... -> o
..o -> l
.ol -> i
oli -> v
liv -> i
ivi -> a
via -> .
ava
... -> a
..a -> v
.av -> a
ava -> .
isabella
... -> i
..i -> s
.is -> a
isa -> b
sab -> e
abe -> l
bel -> l
ell -> a
lla -> .
sophia
... -> s
..s -> o
.so -> p
sop -> h
oph -> i
phi -> a
hia -> .
 
X.shape: torch.Size([228146, 3]) X.dtype: torch.int64
Y.shape: torch.Size([228146]) Y.dtype: torch.int64
# inspect shape, dtype of X (NN input training ex's) & Y (labels for each input in X)
print('\nX.shape:', X.shape, 'X.dtype:', X.dtype)
print('Y.shape:', Y.shape, 'Y.dtype:', Y.dtype)
X.shape: torch.Size([228146, 3]) X.dtype: torch.int64
Y.shape: torch.Size([228146]) Y.dtype: torch.int64
# i - inspecting the training examples in X
print('first row in names.txt:', words[0])
print('...', X[0])
print('..e', X[1])
print('.em', X[2])
print('emm', X[3])
print('mma', X[4])
print('\nsecond row in names.txt:', words[1])
print('...', X[5])
print('..o', X[6])
print('.ol', X[7])
first row in names.txt: emma
... tensor([0, 0, 0])
..e tensor([0, 0, 5])
.em tensor([ 0,  5, 13])
emm tensor([ 5, 13, 13])
mma tensor([13, 13,  1])
 
second row in names.txt: olivia
... tensor([0, 0, 0])
..o tensor([ 0,  0, 15])
.ol tensor([ 0, 15, 12])

Create embedding matrix (lookup table)

  • Bengio et al.: 17,000-word vocabulary size embedded into 30-dim space (see word-embedding)
  • This toy model: 27-characters embedded into 2-dim space, so
# i - toy vocab: all 27 chars -> simultaneously embedded into 2-dim embedding space
torch.manual_seed(2147483647)
C = torch.randn((27, 2))  # init parameter: each 1 of 27 chars has 2 embedding dims
 
# PyTorch can index on integer, tensor (N-dim!), or list (flexible!)
# C[X] works, because tensor C indexing with integer tensor X is a batched row lookup
# no explicit loop. typically nn.Embedding is a wrapper for this.
emb = C[X] # embedded training set (each token: scalar integer -> 2D vector)
 
print('Raw training examples (integer tokens) -> X.shape:', X.shape)
print('Embedding matrix (lookup table)        -> C.shape:', C.shape)
print('Embedded training examples (2-dim)     -> emb.shape:', emb.shape)
Raw training examples (integer tokens) -> X.shape: torch.Size([228146, 3])
Embedding matrix (lookup table)        -> C.shape: torch.Size([27, 2])
Embedded training examples (2-dim)     -> emb.shape: torch.Size([228146, 3, 2])

Inspect above outputs explaining each object’s shape

# i - index a SPECIFIC token in training examples X, and indexing the same tokens's associated embedding vector in C
 
# indexing a single token (character) in two training examples:
print('\ntraining example at X[4, 2]: "a" in "mma" ->', X[4, 2])
print('training example at X[6, 2]: "o" in "..o" ->', X[6, 2])
 
# indexing associated embedding vectors (two methods)
print('\n2-dimensional embedding vector for token "a"', '\nmethod 1      C[1] :', C[1], '\nmethod 2 C[X][4,2] :', C[X][4,2])
print('\n2-dimensional embedding vector for token "o"', '\nmethod 1     C[15] :', C[15], '\nmethod 2 C[X][6,2] :', C[X][6,2])
training example at X[4, 2]: "a" in "mma" -> tensor(1)
training example at X[6, 2]: "o" in "..o" -> tensor(15)
 
2-dimensional embedding vector for token "a" 
method 1      C[1] : tensor([-0.0274, -1.1008]) 
method 2 C[X][4,2] : tensor([-0.0274, -1.1008])
 
2-dimensional embedding vector for token "o" 
method 1     C[15] : tensor([-1.0725,  0.7276]) 
method 2 C[X][6,2] : tensor([-1.0725,  0.7276])

Inspect indexing examples above

Implement a hidden layer

  • W1 : Hidden layer’s (incoming) weights matrix
    • arg 1: 6 inputs to hidden layer: three embedding vectors, each with two embedding dims
    • arg 2: 100 (hidden) neurons in this (hidden) layer: design parameter
  • b1 : Hidden layer’s bias vector (lives “in” the layer’s neurons)
    • gets broadcasted (to 228,146; the number of training examples)
  • h: Non linear activation function
# i - specify parameters of hidden layer
W1 = torch.randn((6 , 100))                 # (incoming) weights matrix
b1 = torch.randn(100)                       # biases (on each neuron in layer) -> broadcasted!
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)   # nonlin (-1 infers the n. of dims so all params are accounted for)

Inspect outputs

Expand both outputs below if confused

# inspect all object shapes: emb, emb.view(-1,6), W1, emb.view(-1,6) @ W1, b1, h
print('emb.shape                    :', emb.shape, ' -> wrong dims for MatMul with W1')
print('emb.view(-1, 6).shape        :', emb.view(-1, 6).shape, '    -> fixed: dims match')
print('W1.shape                     :', W1.shape)
print('\n(emb.view(-1, 6) @ W1).shape :', (emb.view(-1, 6) @ W1).shape, '  -> MatMul successful, now add bias')
print('\nb1.shape                     :', b1.shape, '          -> interpreted as (1, 100) -> broadcast to (228146, 100)')
print('h.shape                      :', h.shape, '  -> activations of hidden layer (100 neurons) for all 228146 examples')
emb.shape                    : torch.Size([228146, 3, 2])  -> wrong dims for MatMul with W1
emb.view(-1, 6).shape        : torch.Size([228146, 6])     -> fixed: dims match
W1.shape                     : torch.Size([6, 100])
 
(emb.view(-1, 6) @ W1).shape : torch.Size([228146, 100])   -> MatMul successful, now add bias
 
b1.shape                     : torch.Size([100])           -> interpreted as (1, 100) -> broadcast to (228146, 100)
h.shape                      : torch.Size([228146, 100])   -> activations of hidden layer (100 neurons) for all 228146 examples
# inspect h, hidden layer neuron activations (100 neurons) for all 228,146 training examples
print('\nhidden layer activations, h:')
print(h)
hidden layer activations, h:
tensor([[-0.9348,  1.0000,  0.9258,  ...,  0.9786, -0.1926,  0.9515],
        [ 0.2797,  0.9997,  0.7675,  ...,  0.9929,  0.9992,  0.9981],
        [-0.9960,  1.0000, -0.8694,  ..., -0.5159, -1.0000, -0.0069],
        ...,
        [-0.4849,  0.9972, -0.6418,  ..., -0.9641,  0.9996,  0.9873],
        [-0.9318,  0.9926, -0.9841,  ..., -0.8989, -0.9938,  0.5930],
        [-0.9736,  0.3844, -0.8744,  ..., -0.5093,  0.9998, -0.9975]])

Implement the output layer

  • W2 : Output layer’s (incoming) weights matrix
    • arg 1: 100 neurons coming into this output layer from previous (hidden) layer
    • arg 2: 27 (output) neurons in the this (output) layer: 27 possible next characters
  • b2 : Output layer bias vector (lives “in” the layer’s neurons)
# i - specify parameters of output layer
W2 = torch.randn((100, 27))   #
b2 = torch.randn(27)          # 

Implement loss function: Negative log likelihood

  • logits: output layer neurons activations. Interpret as “log counts”
# i - calculate "logits" (aka "log counts")
logits = h @ W2 + b2
print('logits.shape:', logits.shape)
logits.shape: torch.Size([228146, 27])

Apply softmax to calculate prob:

  • logits.exp() makes the logits (“log counts”) behave like actual counts
  • Normalise each row (by row sum) to convert counts probabilities
  • prob : Probability distribution of next token (character), for every training example
    • Each row of prob sums to 1
# i - apply softmax
counts = logits.exp()
prob = counts / counts.sum(1, keepdims=True)
print('prob.shape:', prob.shape)
prob.shape: torch.Size([228146, 27])

Compute loss: negative log likelihood

Selecting specific probability of correct next token

Breaking down prob[torch.arange(228146), Y]:

  • prob[torch.arange(228146)]: For each training example (of 228,146), get row of next-token probability distributions (27 elements, row sums to 1)
  • Y arg says (for each training example) select the specific probability for the actual next token in sequence!
    • In training, we know Y, the correct label for each training example '.em' -> 'm' or 'emm' -> 'a'
  • If prob[torch.arange(228146), Y] 1.0, then the model is making good (correct) next-character predictions!
# i - compute loss function (negative log likelihood)
loss = -prob[torch.arange(228146), Y].log().mean()
print('loss      :', loss)
loss      : tensor(19.5052)

F.cross_entropy() loss does the same thing!

Rationale: Why use F.cross_entropy()?

  • More efficient forward pass: More memory efficient (no temporary intermediate counts, prob tensors created)
    • Fused kernel: softmax + log + NLL computed in a single pass, avoiding materialising the full probability distribution tensor in memory
  • More efficient backward pass: F.cross_entropy keeps softmax + log + NLL fused, so the backward pass has a clean closed-form (analytical) gradient () rather than backpropping through each intermediate step separately.
    • Recall: 07_breaking_up_tanh showed how decomposing a fused op into steps multiplies the number of backward pass operations needed
  • More well behaved than exp().
    • If an element in logits is very large (e.g. 100) , then the counts = logits.exp() inf for that element. Ran out of float range!
    • probs then has a nan probability value for that next token.
    • PyTorch offsets all logits by subtracting the largest element before applying exp()
      • Largest element becomes 0, all others ≤ 0
      • exp() of non-positive numbers stays in [0, 1] — no overflow
      • Softmax is invariant to this shift (only relative differences matter)
F.cross_entropy(logits, Y)
tensor(19.5052)

Summary of full network

# i - summary
print('X:', X.shape, '-> Y:', Y.shape)
 
g = torch.Generator().manual_seed(2147483647) # for reproducibility
 
C = torch.randn((27, 2), generator=g)         # embedding matrix (lookup table for input tokens)
W1 = torch.randn((6, 100), generator=g)       # hidden layer's incoming weights: 6 inputs to layer, 100 hidden neurons in layer 
b1 = torch.randn(100, generator=g)            # 100 biases live "in" hidden layer's neurons
W2 = torch.randn((100, 27), generator=g)      # output layer's incoming weights: 100 inputs to layer, 27 output neurons in layer
b2 = torch.randn(27, generator=g)             # 27 biases live "in" output layer's neurons
 
parameters = [C, W1, b1, W2, b2]              # list of all parameters (makes easier to count)
print('num. of parameters:', sum(p.nelement() for p in parameters))  # total parameter count in network
 
emb = C[X]                                 # (228146, 3, 2) -> (228146, 6) on next line via emb.view(-1, 6)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)  # (228146, 100)
logits = h @ W2 + b2                       # (228146, 27)
loss = F.cross_entropy(logits, Y)          # simpler!
loss
X: torch.Size([228146, 3]) -> Y: torch.Size([228146])
num. of parameters: 3481
tensor(19.5052)

Sources