- Prev: 02_sampling
- Next: 04_from_bigrams_to_nns
- Related: NLL minimised in 05_optimisation, smoothing analogous to L2 regularisation in 05_optimisation
# import data
words = open('data/names.txt', 'r').read().splitlines()
import torch
# initialise 27 x 27 tensor (2D array), and create lookup tables (maps)
N = torch.zeros((27, 27), dtype=torch.int32) # init counts array: 32-bit ints (26 chars + EOL)
chars = sorted(list(set(''.join(words)))) # sorted list: unique (26) chars in full dataset
stoi = {s:i+1 for i,s in enumerate(chars)} # map (dict type): `str`->`int`. 'a'=1, ..., 'z'=26
stoi['.'] = 0 # map: '.'=0. Treat '<S>' and '<E>' as the same!
itos = {i:s for s,i in stoi.items()} # map (reverse): invert `stoi` dict
# (re)create freq map, but storing bigram counts in PyTorch tensor: N
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1
# visualise array (heatmap)
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues') # show entire array image
# iterate over each cell in array
for i in range(27):
for j in range(27):
chstr = itos[i] + itos[j] # create char strings (e.g. 'ac', 'gb', 'a.')
plt.text(j, i, chstr, ha="center", va="bottom", color='gray') # write char string
plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray') # write count (item)
plt.axis('off');
# broadcast row sums to normalise rows of N -> obtain probability distribution P
P = N.float()
P /= P.sum(dim=1, keepdim=True)
# # run generation loop (pre-normalised the bigram counts matrix `N`)
# g = torch.Generator().manual_seed(2147483647)
# ix = 0 # start at index 0. the 1st (special) token '.'
# for i in range(20):
# out = [] # init list to hold name
# while True:
# p = P[ix] # expect same result!
# # sample next index
# ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
# out.append(itos[ix]) # append namese to list
# # break loop if next predicted token is EOL '.'
# if ix == 0:
# break
# print(''.join(out))Building toward a loss function
To capture the “quality” of the model’s predictions (names), we need a single number, a “loss value”, obtained by computing by a cost-function (aka a loss function).
By convention one minimises the loss function by adjusting the parameters of the model. In the case of bigram, we adjust the bigram probabilities P (visualised in heatmap above)
Model parameters
Inspecting the model parameters (bigram probabilities in P tensor), each parameter is between 0 (low probability bigram) and 1 (high probability bigram). These will be further tuned to improve the quality of the model.
Comparing their current values against the Uniform Distribution (), any bigram [i,j] with probability P[i,j] > 0.037 implies something useful in the bigram statistics: that bigram is more probable than “random”.
What can we optimise? Likelihood!
Maximum Likelihood Estimation (MLE), suggests the Likelihood (the product of all the probabilities, of all bigrams) should be maximised. This should produce a model with “good” predictions. Since there are 32k names and each name hsa ~2-15 bigrams in it, multiplying all probabilities result in a tiny value for the Likelihood.
Likelihood → Log Likelihood
Recall the exponential identity, and its inverse
- In exponential-function, the defining exponential identity
- addition in the input → multiplication in the output
- And its inverse (in natural-logarithm)
- multiplication in the input → addition in the output
It is highly convenient to utilise the identity to use the additive operation (instead of multiplicative).
This chart of (monotonic function) shows probability on the -axis, bounded .
Goal of training will be to optimise… something…
- Maximise likelihood of the data wrt. model parameters (
Ptensor)- i.e. maximise product of bigram probabilities
- equivalent to maximising the log likelihood (because log is monotonic)
- i.e. maximise sum of log(probabilities)
- equivalent to minimising the negative log likelihood
- i.e. inverse of previous step
- equivalent to minimising the average log likelihood
- i.e. normalise by number of samples (names in training data set)
- stabilises the loss value (avoid exploding)
The above 4 optimisation problems are identical
# i - see prob and logprob for a few bigrams
log_likelihood = 0.0
n = 0
print('bigram ', 'prob ', 'logprob')
for w in words[:3]:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2] # model's assigned prob for each bigram
logprob = torch.log(prob)
log_likelihood += logprob
n += 1
print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'\n{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')
print('\nnote: bigrams with smaller probabilities have MORE POSITIVE (i.e. WORSE) log-probabilities')bigram prob logprob
.e: 0.0478 -3.0408
em: 0.0377 -3.2793
mm: 0.0253 -3.6772
ma: 0.3899 -0.9418
a.: 0.1960 -1.6299
.o: 0.0123 -4.3982
ol: 0.0780 -2.5508
li: 0.1777 -1.7278
iv: 0.0152 -4.1867
vi: 0.3541 -1.0383
ia: 0.1381 -1.9796
a.: 0.1960 -1.6299
.a: 0.1377 -1.9829
av: 0.0246 -3.7045
va: 0.2495 -1.3882
a.: 0.1960 -1.6299
log_likelihood = tensor(-38.7856)
nll = tensor(38.7856)
nll / n = tensor(2.4241)
note: bigrams with smaller probabilities have MORE POSITIVE (i.e. WORSE) log-probabilitiesGoal of training: Minimise loss nll / n (on all training data)
A high quality model is one where the parameters (elements in P) have been optimised to minimise negative log likelihood nll (or its average nll / n)
# i - loss function for entire data set
log_likelihood = 0.0
n = 0
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2] # model's assigned prob for each bigram
logprob = torch.log(prob)
log_likelihood += logprob
n += 1
print(f'{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')
print('\na BETTER model will MAXIMISE log_likelihood (i.e. MINIMISE nll or nll/n)')log_likelihood = tensor(-559891.7500)
nll = tensor(559891.7500)
nll / n = tensor(2.4541)
a BETTER model will MAXIMISE log_likelihood (i.e. MINIMISE nll or nll/n)expand output cell above for nll and nll / n over all training examples
Model smoothing
Loss function for a single name
My name is relatively uncommon (nll / n > 3). Something like Nicole is more common.
# i - loss function for 1 name: 'noel'
log_likelihood = 0.0
n = 0
print('bigram ', 'prob ', 'logprob')
for w in ['noel']:
# for w in ['nicole']:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2] # model's assigned prob for each bigram
logprob = torch.log(prob)
log_likelihood += logprob
n += 1
print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'\n{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')bigram prob logprob
.n: 0.0358 -3.3305
no: 0.0271 -3.6096
oe: 0.0166 -4.0961
el: 0.1590 -1.8386
l.: 0.0941 -2.3630
log_likelihood = tensor(-15.2378)
nll = tensor(15.2378)
nll / n = tensor(3.0476)Bug: Division by Zero
For a name input like noelx, the bigram lx has prob = 0.0.
- Hence
logprob = -inf - So
nll(and its average) approach+inf
This is because some bigrams (e.g. lx or jq) never occur times in the training data, so have zero counts in N. The zero probability bigram, causes a divide-by-zero error.
inspect output below
# predictions and loss function for 1 name: 'noelx'
log_likelihood = 0.0
n = 0
print('bigram ', 'prob ', 'logprob')
for w in ['noelx']:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2] # model's assigned prob for each bigram
logprob = torch.log(prob)
log_likelihood += logprob
n += 1
print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'\n{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')bigram prob logprob
.n: 0.0358 -3.3305
no: 0.0271 -3.6096
oe: 0.0166 -4.0961
el: 0.1590 -1.8386
lx: 0.0000 -inf
x.: 0.2353 -1.4469
log_likelihood = tensor(-inf)
nll = tensor(inf)
nll / n = tensor(inf)A solution: Model smoothing (add small counts)
Laplace Smoothing: Modify the P = N.float() init step to be P = (N+1).float(). This ensures every bigram has a frequency of at least 1, then recalculate parameter matrix P of bigram probabilities.
- Adding a large count means more model smoothing (towards Uniform distribution)
- Adding a smaller count keeps the model peaky
# i - alternate between (N) or (N+1) to see the effect on predictions, and loss (NLL)
P = (N+1).float()
P /= P.sum(dim=1, keepdim=True)Inspect predictions and loss nll
Findings (compare against nll output of Minimise loss section above):
- Subtle changes to loss values
nll(andnll / n). Laplace-smoothedP(withN+1) is mathematically different- Raw probabilities
P: Lossnll / nis 2.4541 - Laplace-smoothed probabilities
P: Lossnll / nis 2.4544
- Raw probabilities
- No change to generated outputs (predictions). Smoothed
Pis only different by a very small amount.- Too small to move past the CDF crossing poins in
torch.multinomial
- Too small to move past the CDF crossing poins in
Expand output cell below
# print predictions and loss (nll / n) for all words
g = torch.Generator().manual_seed(2147483647)
# loss computation
log_likelihood = 0.0
n = 0
# print('bigram ', 'prob ', 'logprob')
for w in words:
chs = ['.'] + list(w) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2] # model's assigned prob for each bigram
logprob = torch.log(prob)
log_likelihood += logprob
n += 1
# print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }\n')
# predictions
for i in range(20):
out = []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))log_likelihood = tensor(-559951.5625)
nll = tensor(559951.5625)
nll / n = tensor(2.4544)
cexze.
momasurailezitynn.
konimittain.
llayn.
ka.
da.
staiyaubrtthrigotai.
moliellavo.
ke.
teda.
ka.
emimmsade.
enkaviyny.
ftlspihinivenvorhlasu.
dsor.
br.
jol.
pen.
aisan.
ja.Sources
- YouTube: The spelled-out intro to language modeling: building makemore
- karpathy/makemore on GitHub
- Jupyter notebooks from this chapter
- CS231N: Python NumPy Tutorial (with Jupyter and Colab)
- PyTorch Docs: Introduction to PyTorch
- PyTorch Docs: Tensors in PyTorch
- PyTorch Docs:
torch.sum- Specifically, see arg use:
dim,keepdimanddtype
- Specifically, see arg use:
torch.tensorvstorch.Tensor