04_backprop_train_a

Prev: 03_backpropagation
Next: 05_backprop_backward_method
Related: Value object data structure, computation graph direction terminology

# imports, `Value` class, graphviz: trace() & draw_dot()
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
 
# Object definitions from end of previous chapter:
 
# Value class:
class Value:
 
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self.label = label
 
    def __repr__(self):
        return f"Value(data={self.data})" 
 
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        return out
 
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        return out
 
# graphviz
from graphviz import Digraph
 
def trace(root):
    # recursively builds a set of all nodes and edges in a graph
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes: 
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges
 
def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))
        # for any value in the graph, create a rectangular ('record') node for it
        dot.node(name = uid, label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')
        if n._op:
            # if this value is a result of some operation, create an op node for it
            dot.node(name = uid + n._op, label = n._op)
            # and connect this node to it
            dot.edge(uid + n._op, uid)
    for n1, n2 in edges:
        # connect n_i to the op node of n2
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)
    return dot
 
# draw_dot(L)

8. Manual backpropagation (train a neuron)

Inspiring example: MLP and a single neuron

An example neural network (MLP):

A mathematical model of a single neuron in MLP. Note the multiplicative relationship between input and weight (synapse):

The activation-function is a squishing function (e.g. Sigmoid, ReLU, GELU)

In this example, is the hyperbolic tan function, $tanh$ :

$tanh x = \frac{sinh x}{cosh x} = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}} = \frac{e ^{2 x} - 1}{e ^{2 x} + 1}$

# example activation function: tanh smoothly caps large (+ or -) inputs to +1 or -1 respectively
plt.figure(figsize=(4, 3), dpi=80)
plt.plot(np.arange(-5,5,0.2), np.tanh(np.arange(-5,5,0.2))); plt.grid()

8.1. Define the forward pass (i.e. initialise NN)

Initialise:
- neuron inputs (data): $x_{1}$ and $x_{2}$
- weights: $w_{1}$ and $w_{2}$
- bias: $b$
Then compute the neuron’s pre-activation value:

n = x_{1} \cdot w_{1} + x_{2} \cdot w_{2} + b

Visualise the computation graph as a DAG with graphviz):

# init nn: params x1,x2; w1,w2; b -> intermediate nodes x1w1, x2w2, x1w1x2w2 -> output (n)
# neuron inputs x1,x2 (2 dimensional neuron)
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
 
# weights of neuron w1,w2 (synaptic strengths for each input)
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
 
# bias of the neuron
b = Value(6.7, label='b')
 
# following the graph above to create: x1*w1 + x2*w2 + b
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
 
# cell body raw activation (without the activation function)
n = x1w1x2w2 + b; n.label = 'n'
 
draw_dot(n)

8.2. Define the activation function $tanh$ in `Value`

The cell below (calculating output via activation function tanh) throws an error.

$tanh$ is not defined in Value
Hyperbolic functions cannot be computed via Value object’s methods we defined earlier
- __add__ (+) and __mul__ (*) are insufficient
- Division and/or exponentiation is also needed.

# i - output axon (via activation function tanh) -- THROWS ERROR!
o = n.tanh() # throws error since Python doesn't know how to do tanh for a Value object

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 2
      1 # i - output axon (via activation function tanh) -- THROWS ERROR!
----> 2 o = n.tanh() # throws error since Python doesn't know how to do tanh for a Value object
 
AttributeError: 'Value' object has no attribute 'tanh'

We could implement the ideas of __div__() and exp() as new methods into our Value object, and then reproduce the $tanh$ operator
However we can also directly define $tanh$ as the tanh method, as long as we know how to take its local derivative
- Any arbitrarily complicated function can be directly defined in Value, if we know how to take its local derivative (how its inputs impact its output)

Now we can compute the neuron’s post-activation value (and visualise with graphviz):

$o = tanh (n) = tanh (x_{1} \cdot w_{1} + x_{2} \cdot w_{2} + b)$

# extend `Value` class with `tanh(self)` method; reset network (slightly modify bias `b`); visualise
class Value:
 
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self.label = label
 
    def __repr__(self):
        return f"Value(data={self.data})" 
 
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        return out
 
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        return out
 
    # defining the tanh method (our activation function) in one go!
    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
        
        # the tanh node only has 1 child, so it's a tuple of 1 node "(self, )", 
        # and op name is 'tanh'
        out = Value(t, (self, ), 'tanh')
        return out
 
# same values as earlier: define inputs (x1,x2), weights (w1,w2)
x1 = Value(2.0, label='x1'); x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1'); w2 = Value(1.0, label='w2')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'; x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
 
# manually change bias to make number nice for education: 
# b=8 to see tanh squishing the post-activation value, o, to just below +1, 
# b=6.8813735870195432 makes derivative = 1
b = Value(6.8813735870195432, label='b')
n = x1w1x2w2 + b; n.label = 'n'
 
# try re-run the activation function on n (the raw cell body) and draw the output node o
o = n.tanh(); o.label = 'o'
draw_dot(o)

8.3. Now the backward pass (backpropagation)

We’re particularly interested in $\frac{d o}{d w _{1}}$ and $\frac{d o}{d w _{2}}$
- We can only change the weights, $w_{1}$ and $w_{2}$ during training of the neural net.
- The data, $x_{1}$ and $x_{2}$ is fixed.
Also note, this is only 1 neuron. A real NN has many connected neurons
- The loss function evaluates to a single number, at the very end of that NN
- It measures the NN’s accuracy (a goalpost for the NN’s backpropagation)

8.3.1 Manually backpropagate (hand-assign gradients of prior nodes)

See the image below. Immensely helpful.

Base case is known:

\frac{d o}{d o} = 1

Per wikipedia (many options):

o = tanh (n) \to \frac{d o}{d n} = 1 - tanh^{2} (n)

We know $tanh (n) = o$ , so by substitution: $\frac{d o}{d n} = 1 - o^{2}$

$\frac{d o}{d n}$ is “distributed” (plus node, +) to $n$ ‘s upstream nodes $[x_{1} w_{1} + x_{2} w_{2}]$ and $b$ :

\frac{d o}{d ( x _{1} w _{1} + x _{2} w _{2} )} = \frac{d o}{d b} = \frac{d o}{d n} = 1 - o^{2}

$\frac{d o}{d n}$ is again distributed (plus node, +) to $[x_{1} w_{1} + x_{2} w_{2}]$ ‘s upstream nodes $[x_{1} w_{1}]$ and $[x_{2} w_{2}]$ :

\frac{d o}{d ( x _{1} w _{1} )} = \frac{d o}{d ( x _{2} w _{2} )} = \frac{d o}{d ( x _{1} w _{1} + x _{2} w _{2} )} = \frac{d o}{d n} = 1 - o^{2}

Finally, for the last two * nodes, gradient propagates upstream via multiplication by the other node’s value.

\frac{d o}{d ( w _{1} )} = x_{1} \cdot \frac{d o}{d ( x _{1} w _{1} )}

\frac{d o}{d ( w _{2} )} = x_{2} \cdot \frac{d o}{d ( x _{2} w _{2} )}

For brevity, only showing gradients for the weights, $\frac{d o}{d w _{1}}$ and $\frac{d o}{d w _{2}}$ , since input data, $x_{1}$ and $x_{2}$ , cannot be changed during training.

# backpropagation: hand-assign gradients, `.grad`, of prior nodes
# base case: manually set o.grad (i.e. d(o)/do = 1 is known) 
o.grad = 1.0
 
# we know o = tanh(n); 
# per wikipedia (or calculus): d()/dx tanh(x) = 1 - ( tanh(x))^2;
# therefore: do/dn = 1 - tanh(n)**2 (and we know tanh(n) is o.data!)
n.grad = 1 - o.data**2
 
n.grad # (0.5 in this ex.)
 
# n's incoming nodes enter via a '+' node, so n's gradient is simply routed back (i.e. 0.5 again):
x1w1x2w2.grad = n.grad
b.grad = n.grad 
 
# same logic for x1w1x2w2's incoming nodes (another '+' node); route x1w1x2w2 gradient!
x1w1.grad = x1w1x2w2.grad
x2w2.grad = x1w1x2w2.grad 
 
# the final 4 nodes (x1, w1, x2, w2) flow thru a '*' node. Per the (local) CHAIN RULE, their local gradients are:
x1.grad = x1w1.grad * w1.data # do/dx1 = do/dx1w1 * d(x1w1)/dx1 = x1w1.grad * w1.data = 0.5 * -3 = -1.5
w1.grad = x1w1.grad * x1.data
x2.grad = x2w2.grad * w2.data
w2.grad = x2w2.grad * x2.data
 
draw_dot(o)

8.3.1 Analysis of this backpropagation graph

Note how w2.grad = 0 in the graph above.
This makes sense because w2.grad tells us how nudging w2 affects the final output o
- Since x2 (the neuron input) is 0; it doesn’t matter how we nudge w2. o remains unchanged (because x2*w2 go through a * node)
- So o is totally insensitive to w2, hence w2.grad = 0

Next: Automate backprop - Implement `_backward` method

_backward implements the chain rule locally at each node and passes it to upstream nodes

Leaf node (e.g. input data / input layer weights): Do nothing.
- Why: No downstream nodes to pass gradient to
Addition node (+): Incoming upstream gradient distributed as-is
- Why: local gradient is 1.0 so it passes downstream, unchanged
Multiplication node (*): Incoming upstream gradient swap multiplier
- Why: local gradient multiplied by the value of the other input, then passed downstream
Activation functions (e.g. ReLU, Sigmoid): Multiply by derivative of activation function
Maximum node: Switch behaviour - 100% routed to largest input, 0 to others.

Note: See backprop-graph-terminology

Picture paints a thousand words!
Upstream and downstream direction convention:
downstream gradient = local gradient × upstream gradient
Addition (+) nodes are gradient distributors:

Multiplication (*) nodes are gradient swap multipliers:

Sources

YouTube: The spelled-out intro to neural networks and backpropagation: building micrograd
karpathy/micrograd on GitHub
Jupyter notebooks from this chapter
Google Colab exercises

notes/

Backpropagation (2 of 4)

8. Manual backpropagation (train a neuron)

8.1. Define the forward pass (i.e. initialise NN)

8.2. Define the activation function $tanh$ in `Value`

8.3. Now the backward pass (backpropagation)

8.3.1 Manually backpropagate (hand-assign gradients of prior nodes)

8.3.1 Analysis of this backpropagation graph

Next: Automate backprop - Implement `_backward` method

Sources

Backpropagation (2 of 4)

8. Manual backpropagation (train a neuron)

8.1. Define the forward pass (i.e. initialise NN)

8.2. Define the activation function tanh in Value

8.3. Now the backward pass (backpropagation)

8.3.1 Manually backpropagate (hand-assign gradients of prior nodes)

8.3.1 Analysis of this backpropagation graph

Next: Automate backprop - Implement _backward method

Sources

Graph View

Backlinks

Explorer

8.2. Define the activation function $tanh$ in `Value`

Next: Automate backprop - Implement `_backward` method