- Prev: 06_backprop_backward_method_auto
- Next: 08_pytorch_backprop
- Related:
Valueobject data structure, computation graph direction terminology
# imports, reset_graph() to init nn, graphviz: trace() & draw_dot()
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# helper: re-initialise graph
def reset_graph(reset_level):
# declare global gradients
global x1, x2, w1, w2, x1w1, x2w2, x1w1x2w2, b, n, o
if reset_level == 'gradients':
x1.grad = x2.grad = w1.grad = w2.grad = x1w1.grad = x2w2.grad = x1w1x2w2.grad = b.grad = n.grad = o.grad = 0
print("reset_graph(): All gradients have been reset to 0")
# reset all variables
elif reset_level == 'graph':
# redefine inputs (x1,x2), weights (w1,w2), and then the graph (n = x1*w1 + x2*w2 + b)
x1 = Value(2.0, label='x1'); x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1'); w2 = Value(1.0, label='w2')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'; x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
# manually change bias to make number nice for education: (b=8 to see tanh squishing!, b=6.8813735870195432 so deriv = 1)
b = Value(6.8813735870195432, label='b');
n = x1w1x2w2 + b; n.label = 'n'
# try re-run the activation function on n (the raw cell body) and draw the output node o
o = n.tanh(); o.label = 'o'
print("reset_graph(): All vars, initial and intermediate, have been reset. All gradients now 0")
else: print("reset_graph(): please specify the level of reset desired 'gradients' or 'graph'")
# graphviz
from graphviz import Digraph
def trace(root):
# recursively builds a set of all nodes and edges in a graph
nodes, edges = set(), set()
def build(v):
if v not in nodes:
nodes.add(v)
for child in v._prev:
edges.add((child, v))
build(child)
build(root)
return nodes, edges
def draw_dot(root):
dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
nodes, edges = trace(root)
for n in nodes:
uid = str(id(n))
# for any value in the graph, create a rectangular ('record') node for it
dot.node(name = uid, label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')
if n._op:
# if this value is a result of some operation, create an op node for it
dot.node(name = uid + n._op, label = n._op)
# and connect this node to it
dot.edge(uid + n._op, uid)
for n1, n2 in edges:
# connect n_i to the op node of n2
dot.edge(str(id(n1)), str(id(n2)) + n2._op)
return dotExercise
- We implemented
tanhas a single composite operation (method:.tanh()).- Valid, because we know its local derivative (see
self.gradin_backward())
- Valid, because we know its local derivative (see
- Now re-implement it, only using its constituent operations
- Bonus: good practice implementing a few more neuron operations!
Approach:
- Generalise existing “left operand” methods to handle expressions with multiple Types:
__add__()method must handle:Value + int(i.e.Value.__add__(int))__mul__()method must handle:Value * int(i.e.Value.__mul__(int))- How: Assume non-
Valueoperand isint(/float) → wrap:Value(int)
- Fallbacks: create reflected versions of the above (for swapped operands)
- New
__radd__()method handles:int + Value(i.e.int.__radd__(Value)) - New
__rmul__()method handles:int * Value(i.e.int.__rmul__(Value))
- New
- Define exponentiation method:
exp()math.exp()builtin function, and single input (self)
- One could define a division method
__truediv()__- But it’s more general to implement a
__pow__()(e.g. forx**k) - Division is a special case of multiplication (by the inverse:
a / b => a * (1/b) => a * b**-1)
- But it’s more general to implement a
For steps 3 and 4, we need to define the local derivative for backpropagating gradients (self.grad ← out.grad):
| Operation | Forward | Local Derivative | _backward() Gradient Flow |
|---|---|---|---|
| Exponentiation | |||
| Power |
Recall: The conventional _backward() pass gradient flow direction described here
_backward()method gradients for previously implemented operations
Operation Forward Local Derivative _backward()Gradient FlowAddition
Multiplication
tanh ReLU Max
Sigmoid
Implementation (extend Value)
# extend `Value` class with the constituent methods listed above:
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
# pre-process `other`. If it is non-`Value`, assume `int`/`float` and wrap in `Value()`
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad * 1.0
other.grad += out.grad * 1.0
out._backward = _backward
return out
def __mul__(self, other):
# pre-process `other`. If it is non-`Value`, assume int/float and wrap in `Value()`
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
# def __radd__(self, other): # fallback for swapped operands: i.e. other + self
# return self + other # route to `__add__`
def __rmul__(self, other): # fallback for swapped operands: i.e. other * self
return self * other # route to `__mul__`
# ensure `other` is NEVER a `Value` object. Only int/float allowed
def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
out = Value(self.data**other, (self,), f'**{other}')
# recall downstream grad = local grad * upstream grad
# local gradient for x^k: d(x^k)/dx = kx^(k-1)
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
out._backward = _backward
return out
def __truediv__(self, other): # i.e. self / other but...
return self * other**-1 # use previously defined __mul__() and __pow__(), instead of implementing `/` operation and its own `_backward()``
def __neg__(self): # -self
return self * -1 # use previously defined __mul__() to evaluate this `Value` * `int` expression
def __sub__(self, other): # self - other
return self + (-other) # use previously defined __add__(), instead of implementing `-` operation and its own `_backward()``
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
out = Value(t, (self, ), 'tanh')
def _backward():
self.grad += (1 - t**2) * out.grad
out._backward = _backward
return out
# define exponentiation method
def exp(self):
x = self.data # input data value
out = Value(math.exp(x), (self, ), 'exp') # output data value: use builtin math.exp(x)
# recall downstream grad = local grad * upstream grad
# local gradient for exp: d(e^x)/dx = e^x (i.e. out.data, just calculated!)
def _backward():
self.grad += out.data * out.grad
out._backward = _backward
return out
# define division method
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for node in reversed(topo):
node._backward()Test implementation
Original tanh
# i - original o = n.tanh() call
reset_graph('graph')
o.backward()
draw_dot(o)reset_graph(): All vars, initial and intermediate, have been reset. All gradients now 0Simplified tanh
Inspect the new graph:
- Data values must add up in the forward pass.
- The
tanhoperation node should be decomposed into a series of simple operation nodes - Inspect backpropagated gradients at leaves (
x1,x2,w1,w2,b). Should match the above.
# reset graph -> overwrite o = n.tanh() node
reset_graph('graph')
# overwrite o = n.tanh() -> express activation function as constituent operations
e = (2*n).exp()
o = (e - 1) / (e + 1)
o.label = 'o'
# perform backward pass, and draw the output node o
o.backward()
draw_dot(o)reset_graph(): All vars, initial and intermediate, have been reset. All gradients now 0Takeaways
- The level at which a neuron operation (method) is implemented is arbitrary.
- Simple operations like addition (
+) and complex composite ones liketanhare equivalent.
- Simple operations like addition (
- The only prerequisite to implement an operation. You must be able to perform:
- Forward pass: Some output(s) that are a function of some input(s),
- Backward pass: The operation is differentiable (i.e. we can find and chain its local gradient)
Sources
- YouTube: The spelled-out intro to neural networks and backpropagation: building micrograd
- karpathy/micrograd on GitHub
- Jupyter notebooks from this chapter
- Google Colab exercises