Recall from last time: Reverse-Mode AD¶
- Fix one output $\ell$ over $\mathbb{R}$
- Compute partial derivatives $\frac{\partial \ell}{\partial y}$ for each value $y$
- Need to do this by going backward through the computation
"Deep Learning" ML Frameworks¶
Classical Core Components:
- Numerical linear algebra library
- Hardware support (e.g. GPU)
- Backpropagation engine
- Library for expressing deep neural networks
All embedded in a high-level language
- Usually python.
In [1]:
import numpy
In [2]:
x = numpy.zeros(1024)
In [3]:
x
Out[3]:
array([0., 0., 0., ..., 0., 0., 0.])
In [4]:
x.dtype
Out[4]:
dtype('float64')
In [6]:
y = x
In [7]:
y[1] = 17
In [8]:
x
Out[8]:
array([ 0., 17., 0., ..., 0., 0., 0.])
In [9]:
import torch
In [10]:
x = torch.zeros(1024)
y = x
y[1] = 17
x
Out[10]:
tensor([ 0., 17., 0., ..., 0., 0., 0.])
In [13]:
z = x[1:5]
z[1] = 6
In [14]:
x
Out[14]:
tensor([ 0., 17., 6., ..., 0., 0., 0.])
In [15]:
a = [0,0,0,0,0]
b = a
b[1] = 4
a
Out[15]:
[0, 4, 0, 0, 0]
In [18]:
c = a[1:3]
c[1] = 5
In [20]:
c
Out[20]:
[4, 5]
In [19]:
a
Out[19]:
[0, 4, 0, 0, 0]
Numerical Linear Algebra¶
You've already seen and used this sort of thing: NumPy.
- Arrays are objects "owned" by the library
- Any arithmetic operation on these objects goes through the library
- The library calls an optimized function to compute the operation
- This happens outside the python interpreter
- Control is returned to python when the function finishes
- By default you're only going to be running one such function at a time.
Numerical Linear Algebra: More Details¶
- Arrays are mutable
- Multiple references can exist!
Numerical Linear Algebra On-Device¶
- The simplest version of this is essentially a "copy" of NumPy for each sort of hardware we want to run on. This contains a copy of every function we want to support for each type of hardware.
- e.g. one copy that runs on the CPU, one copy that runs on the GPU
- Arrays are located explicitly on one device
- in PyTorch, you move them with
x.to("device_name")
- in PyTorch, you move them with
- When we try to call a function, the library checks where the inputs are located
- if they're all on one device, it calls that device's version of the function
- if they're not all on the same device, it raises an exception
In [28]:
x = torch.randn(4,4)
y = torch.randn(4,4,device='mps')
Eager Execution vs Graph Execution¶
When we manifest a node of the compute graph, we can either:
- (eager) compute and manifest the value at that node immediately
- (graph) just manifest the node
- need to call some function to compute the forward pass later
This was the classic distinction between TensorFlow and PyTorch
In [40]:
import torch
In [41]:
x = torch.ones((1,3,4))
In [45]:
x = torch.ones(())
x = x + x
In [46]:
x
Out[46]:
tensor(2.)
In [47]:
x = torch.ones(())
x.requires_grad = True
u = (x + 2)
y = u.square() # (x + 2)^2 --> 2 * (1 + 2) = 6
y
Out[47]:
tensor(9., grad_fn=<PowBackward0>)
In [48]:
with torch.no_grad():
v = x * x
In [49]:
v
Out[49]:
tensor(1.)
In [51]:
y.backward()
In [54]:
x.grad
Out[54]:
tensor(6.)
In [56]:
def dumb_abs(a):
if a >= 0:
return a
else:
return -a
In [65]:
x = torch.tensor(-5.0)
x.requires_grad = True
In [66]:
y = dumb_abs(x)
y
Out[66]:
tensor(5., grad_fn=<NegBackward0>)
In [67]:
y.backward()
x.grad
Out[67]:
tensor(-1.)
Advantages of eager mode (compute values & manifest graph at the same time):
- much better for value-based debugging!
- varying shapes
- less complicated
- condition on values
Advantages of lazy mode/graph mode (manifest graph first, then compute values):
- heavier static optimization
- better for shape-based debugging
- could use less memory
- graph overhead less/amortized
In [68]:
import time
In [70]:
N = 1024 * 16
X = torch.randn(N,N)
Y = torch.randn(N,N)
In [82]:
begin = time.time()
Z = X @ Y
print(Z[0,0])
end = time.time()
print(f"elapsed: {(end - begin) * 1000} ms")
tensor(111.4226) elapsed: 4410.971879959106 ms
In [83]:
X_mps = X.to("mps")
Y_mps = Y.to("mps")
begin = time.time()
Z_mps = X_mps @ Y_mps
print(Z_mps[0,0])
end = time.time()
print(f"elapsed: {(end - begin) * 1000} ms")
tensor(111.4226, device='mps:0') elapsed: 912.1642112731934 ms
In [84]:
import torch
In [85]:
torch.nn
Out[85]:
<module 'torch.nn' from '/opt/anaconda3/lib/python3.10/site-packages/torch/nn/__init__.py'>
In [86]:
X = torch.nn.Linear(128,128)
In [ ]:
In [98]:
X = torch.nn.Sequential(
torch.nn.Linear(128,128),
torch.nn.ReLU(),
torch.nn.Linear(128,128),
torch.nn.ReLU(),
torch.nn.Linear(128,1)
)
In [99]:
X
Out[99]:
Sequential( (0): Linear(in_features=128, out_features=128, bias=True) (1): ReLU() (2): Linear(in_features=128, out_features=128, bias=True) (3): ReLU() (4): Linear(in_features=128, out_features=1, bias=True) )
In [100]:
a = torch.randn(128)
In [103]:
X[0].weight
Out[103]:
Parameter containing: tensor([[-0.0642, 0.0879, -0.0534, ..., 0.0206, 0.0588, -0.0675], [-0.0641, -0.0230, 0.0114, ..., 0.0265, 0.0713, -0.0848], [-0.0686, 0.0794, 0.0154, ..., 0.0783, 0.0211, 0.0513], ..., [ 0.0850, -0.0019, 0.0655, ..., -0.0470, -0.0641, 0.0205], [ 0.0141, 0.0150, 0.0432, ..., 0.0211, -0.0118, -0.0345], [-0.0493, 0.0454, 0.0276, ..., -0.0297, 0.0097, 0.0564]], requires_grad=True)
In [99]:
U_mps = U.to("mps")
In [100]:
U
Out[100]:
tensor([1.])
In [101]:
U_mps[0] = 3
In [102]:
U
Out[102]:
tensor([1.])
In [103]:
U_mps
Out[103]:
tensor([3.], device='mps:0')
In [104]:
U_mps.cpu()
Out[104]:
tensor([3.])
In [105]:
U_mps.to("cpu")
Out[105]:
tensor([3.])
In [106]:
U = torch.ones((5,5,5),device="mps")
In [111]:
(N ** 3)/(1024**3)
Out[111]:
4096.0
In [113]:
U = torch.ones((N,N,N),device="meta")
The meta device. This is just a fake device. Can store tensors of any size. Can do ops of any size. Can't see results!
In [38]:
torch.ones((1000000,2000000),device="meta") @ torch.ones((2000000,3000000),device="meta")
Out[38]:
tensor(..., device='meta', size=(1000000, 3000000))
In [39]:
torch.ones((10000,20000),device="meta") @ torch.ones((30000,30000),device="meta")
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[39], line 1 ----> 1 torch.ones((10000,20000),device="meta") @ torch.ones((30000,30000),device="meta") File /opt/anaconda3/lib/python3.10/site-packages/torch/_prims_common/wrappers.py:273, in out_wrapper.<locals>._out_wrapper.<locals>._fn(out, *args, **kwargs) 271 result = fn(*args, is_out=(out is not None), **kwargs) # type: ignore[arg-type] 272 else: --> 273 result = fn(*args, **kwargs) 274 assert ( 275 isinstance(result, TensorLike) 276 and is_tensor 277 or isinstance(result, Tuple) # type: ignore[arg-type] 278 and len(result) == len(out_names) # type: ignore[arg-type] 279 ) 280 if out is not None: 281 # Naively you might expect this assert to be true, but 282 # it's not: (...) 295 # be a normal meta tensor, but this is perfectly 296 # harmless. File /opt/anaconda3/lib/python3.10/site-packages/torch/_meta_registrations.py:2100, in meta_mm(a, b) 2098 N, M1 = a.shape 2099 M2, P = b.shape -> 2100 torch._check( 2101 M1 == M2, 2102 lambda: f"a and b must have same reduction dim, but got [{N}, {M1}] X [{M2}, {P}].", 2103 ) 2104 return a.new_empty(N, P) File /opt/anaconda3/lib/python3.10/site-packages/torch/__init__.py:1564, in _check(cond, message) 1549 def _check(cond, message=None): # noqa: F811 1550 r"""Throws error containing an optional message if the specified condition 1551 is False. 1552 (...) 1562 message. Default: ``None`` 1563 """ -> 1564 _check_with(RuntimeError, cond, message) File /opt/anaconda3/lib/python3.10/site-packages/torch/__init__.py:1546, in _check_with(error_type, cond, message) 1542 raise TypeError("message must be a callable") 1544 message_evaluated = str(message()) -> 1546 raise error_type(message_evaluated) RuntimeError: a and b must have same reduction dim, but got [10000, 20000] X [30000, 30000].
In [119]:
torch.ones((1000000,2000000),device="meta")
Out[119]:
tensor(..., device='meta', size=(1000000, 2000000))
In [ ]:
torch.ones((1000000,2000000)).to("meta")
In [ ]: