03_loss_function_and

Prev: 02_sampling
Next: 04_from_bigrams_to_nns
Related: NLL minimised in 05_optimisation, smoothing analogous to L2 regularisation in 05_optimisation

# import data
words = open('data/names.txt', 'r').read().splitlines()
 
import torch
# initialise 27 x 27 tensor (2D array), and create lookup tables (maps)
N = torch.zeros((27, 27), dtype=torch.int32) # init counts array: 32-bit ints (26 chars + EOL)
 
chars = sorted(list(set(''.join(words))))  # sorted list: unique (26) chars in full dataset
stoi = {s:i+1 for i,s in enumerate(chars)} # map (dict type): `str`->`int`. 'a'=1, ..., 'z'=26
stoi['.'] = 0                              # map: '.'=0. Treat '<S>' and '<E>' as the same!
itos = {i:s for s,i in stoi.items()}       # map (reverse): invert `stoi` dict
 
# (re)create freq map, but storing bigram counts in PyTorch tensor: N
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1
 
# visualise array (heatmap)
import matplotlib.pyplot as plt
%matplotlib inline
 
plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues') # show entire array image
 
# iterate over each cell in array
for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j] # create char strings (e.g. 'ac', 'gb', 'a.')
        plt.text(j, i, chstr, ha="center", va="bottom", color='gray')       # write char string
        plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray') # write count (item)
plt.axis('off');
 
# broadcast row sums to normalise rows of N -> obtain probability distribution P
P  = N.float()
P /= P.sum(dim=1, keepdim=True)
 
# # run generation loop (pre-normalised the bigram counts matrix `N`)
# g = torch.Generator().manual_seed(2147483647)
# ix = 0 # start at index 0. the 1st (special) token '.' 
 
# for i in range(20):
#     out = [] # init list to hold name
#     while True:
#         p = P[ix] # expect same result!
        
#         # sample next index
#         ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
#         out.append(itos[ix]) # append namese to list
        
#         # break loop if next predicted token is EOL '.'
#         if ix == 0:
#             break
#     print(''.join(out))

Building toward a loss function

To capture the “quality” of the model’s predictions (names), we need a single number, a “loss value”, obtained by computing by a cost-function (aka a loss function).

By convention one minimises the loss function by adjusting the parameters of the model. In the case of bigram, we adjust the bigram probabilities P (visualised in heatmap above)

Model parameters

Inspecting the model parameters (bigram probabilities in P tensor), each parameter is between 0 (low probability bigram) and 1 (high probability bigram). These will be further tuned to improve the quality of the model.

Comparing their current values against the Uniform Distribution ( $P = 1/27 \approx 0.037$ ), any bigram [i,j] with probability P[i,j] > 0.037 implies something useful in the bigram statistics: that bigram is more probable than “random”.

What can we optimise? Likelihood!

Maximum Likelihood Estimation (MLE), suggests the Likelihood (the product of all the probabilities, of all bigrams) should be maximised. This should produce a model with “good” predictions. Since there are 32k names and each name hsa ~2-15 bigrams in it, multiplying all probabilities result in a tiny value for the Likelihood.

Likelihood → Log Likelihood

Recall the exponential identity, and its inverse

In exponential-function, the defining exponential identity $a^{x + y} = a^{x} \cdot a^{y}$

addition in the input → multiplication in the output

And its inverse (in natural-logarithm) $lo g (x y) = lo g (x) + lo g (y)$

multiplication in the input → addition in the output

It is highly convenient to utilise the $lo g (x y) = lo g (x) + lo g (y)$ identity to use the additive operation (instead of multiplicative).

This chart of $lo g (x)$ (monotonic function) shows probability on the $x$ -axis, bounded $[0, 1]$ .

Goal of training will be to optimise… something…

Maximise likelihood of the data wrt. model parameters (P tensor)
- i.e. maximise product of bigram probabilities
equivalent to maximising the log likelihood (because log is monotonic)
- i.e. maximise sum of log(probabilities)
equivalent to minimising the negative log likelihood
- i.e. inverse of previous step
equivalent to minimising the average log likelihood
- i.e. normalise by number of samples (names in training data set)
- stabilises the loss value (avoid exploding)

The above 4 optimisation problems are identical

# i - see prob and logprob for a few bigrams
log_likelihood = 0.0
n = 0
 
print('bigram ', 'prob    ', 'logprob')
for w in words[:3]: 
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]         # model's assigned prob for each bigram
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        print(f'{ch1}{ch2}:     {prob:.4f}   {logprob:.4f}')
        
print(f'\n{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')
print('\nnote: bigrams with smaller probabilities have MORE POSITIVE (i.e. WORSE) log-probabilities')

bigram  prob     logprob
.e:     0.0478   -3.0408
em:     0.0377   -3.2793
mm:     0.0253   -3.6772
ma:     0.3899   -0.9418
a.:     0.1960   -1.6299
.o:     0.0123   -4.3982
ol:     0.0780   -2.5508
li:     0.1777   -1.7278
iv:     0.0152   -4.1867
vi:     0.3541   -1.0383
ia:     0.1381   -1.9796
a.:     0.1960   -1.6299
.a:     0.1377   -1.9829
av:     0.0246   -3.7045
va:     0.2495   -1.3882
a.:     0.1960   -1.6299
 
log_likelihood = tensor(-38.7856)
nll = tensor(38.7856)
nll / n = tensor(2.4241)
 
note: bigrams with smaller probabilities have MORE POSITIVE (i.e. WORSE) log-probabilities

Goal of training: Minimise loss `nll / n` (on all training data)

A high quality model is one where the parameters (elements in P) have been optimised to minimise negative log likelihood nll (or its average nll / n)

# i - loss function for entire data set
log_likelihood = 0.0
n = 0
 
for w in words: 
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]         # model's assigned prob for each bigram
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        
print(f'{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')
print('\na BETTER model will MAXIMISE log_likelihood (i.e. MINIMISE nll or nll/n)')

log_likelihood = tensor(-559891.7500)
nll = tensor(559891.7500)
nll / n = tensor(2.4541)
 
a BETTER model will MAXIMISE log_likelihood (i.e. MINIMISE nll or nll/n)

expand output cell above for nll and nll / n over all training examples

Model smoothing

Loss function for a single name

My name is relatively uncommon (nll / n > 3). Something like Nicole is more common.

# i - loss function for 1 name: 'noel'
log_likelihood = 0.0
n = 0
 
print('bigram ', 'prob    ', 'logprob')
for w in ['noel']:
# for w in ['nicole']:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]         # model's assigned prob for each bigram
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        print(f'{ch1}{ch2}:     {prob:.4f}   {logprob:.4f}')
 
print(f'\n{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')

bigram  prob     logprob
.n:     0.0358   -3.3305
no:     0.0271   -3.6096
oe:     0.0166   -4.0961
el:     0.1590   -1.8386
l.:     0.0941   -2.3630
 
log_likelihood = tensor(-15.2378)
nll = tensor(15.2378)
nll / n = tensor(3.0476)

Bug: Division by Zero

For a name input like noelx, the bigram lx has prob = 0.0.

Hence logprob = -inf
So nll (and its average) approach +inf

This is because some bigrams (e.g. lx or jq) never occur times in the training data, so have zero counts in N. The zero probability bigram, causes a divide-by-zero error.

inspect output below

# predictions and loss function for 1 name: 'noelx'
log_likelihood = 0.0
n = 0
 
print('bigram ', 'prob    ', 'logprob')
for w in ['noelx']:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]         # model's assigned prob for each bigram
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        print(f'{ch1}{ch2}:     {prob:.4f}   {logprob:.4f}')
 
print(f'\n{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }')

bigram  prob     logprob
.n:     0.0358   -3.3305
no:     0.0271   -3.6096
oe:     0.0166   -4.0961
el:     0.1590   -1.8386
lx:     0.0000   -inf
x.:     0.2353   -1.4469
 
log_likelihood = tensor(-inf)
nll = tensor(inf)
nll / n = tensor(inf)

A solution: Model smoothing (add small counts)

Laplace Smoothing: Modify the P = N.float() init step to be P = (N+1).float(). This ensures every bigram has a frequency of at least 1, then recalculate parameter matrix P of bigram probabilities.

Adding a large count means more model smoothing (towards Uniform distribution)
Adding a smaller count keeps the model peaky

# i - alternate between (N) or (N+1) to see the effect on predictions, and loss (NLL)
P  = (N+1).float() 
P /= P.sum(dim=1, keepdim=True)

Inspect predictions and loss `nll`

Findings (compare against nll output of Minimise loss section above):

Subtle changes to loss values nll (and nll / n). Laplace-smoothed P (with N+1) is mathematically different
- Raw probabilities P: Loss nll / n is 2.4541
- Laplace-smoothed probabilities P: Loss nll / n is 2.4544
No change to generated outputs (predictions). Smoothed P is only different by a very small amount.
- Too small to move past the CDF crossing poins in torch.multinomial

Expand output cell below

# print predictions and loss (nll / n) for all words
g = torch.Generator().manual_seed(2147483647)
 
# loss computation
log_likelihood = 0.0
n = 0
 
# print('bigram ', 'prob    ', 'logprob')
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]         # model's assigned prob for each bigram
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        # print(f'{ch1}{ch2}:     {prob:.4f}   {logprob:.4f}')
 
print(f'{log_likelihood = }')
nll = -log_likelihood
print(f'{nll = }')
print(f'{nll / n = }\n')
 
# predictions
for i in range(20):
    out = []
    ix = 0
    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])
        if ix == 0:
            break
    print(''.join(out))

log_likelihood = tensor(-559951.5625)
nll = tensor(559951.5625)
nll / n = tensor(2.4544)
 
cexze.
momasurailezitynn.
konimittain.
llayn.
ka.
da.
staiyaubrtthrigotai.
moliellavo.
ke.
teda.
ka.
emimmsade.
enkaviyny.
ftlspihinivenvorhlasu.
dsor.
br.
jol.
pen.
aisan.
ja.

Sources

YouTube: The spelled-out intro to language modeling: building makemore
karpathy/makemore on GitHub
Jupyter notebooks from this chapter
CS231N: Python NumPy Tutorial (with Jupyter and Colab)
PyTorch Docs: Introduction to PyTorch
PyTorch Docs: Tensors in PyTorch
PyTorch Docs: torch.sum
- Specifically, see arg use: dim, keepdim and dtype
torch.tensor vs torch.Tensor

Loss function

Building toward a loss function

Model parameters

What can we optimise? Likelihood!

Likelihood → Log Likelihood

Goal of training will be to optimise… something…

Goal of training: Minimise loss nll / n (on all training data)

Model smoothing

Loss function for a single name

Bug: Division by Zero

A solution: Model smoothing (add small counts)

Inspect predictions and loss nll

Sources

Graph View

Backlinks

Explorer

Goal of training: Minimise loss `nll / n` (on all training data)

Inspect predictions and loss `nll`