import numpy
import scipy
import matplotlib
import time
Recall our first principle from last lecture...
Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms that update the model iteratively with easy-to-compute steps using numerical linear algebra.
Recall our first principle from last lecture...
Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms that update the model iteratively with easy-to-compute steps using numerical linear algebra.
A simple example: we can represent the properties of an object using a feature vector (or embedding) in $\R^d$. Say we wanted to predict something about a group of people including this guy (former mayor of Ithaca Svante Myrick) using the fact that he is 33, graduated in 2009, started being mayor in 2012, and makes $58,561 a year.
One way to represent this is as a vector in $4$ dimensional space.
$$x = \begin{bmatrix}36 \\ 2009 \\ 2012 \\ 58561\end{bmatrix}.$$Representing the information as a vector makes it easier for us to express ML models with it. We can then represent other objects we want to make predictions about with their own vectors, e.g.
$$x = \begin{bmatrix}80 \\ 1965 \\ 2021 \\ 400000\end{bmatrix}.$$Before we start in on how to compute with vectors, matrices, et cetera, we should make sure we're all on the same page about what these objects are.
A vector (represented on a computer) is an array of numbers (usually floating point numbers). We say that the dimension (or length) of the vector is the size of the array, i.e. the number of numbers it contains.
A vector (in mathematics) is an element of a vector space. Recall: a vector space over the real numbers is a set $V$ together with two binary operations $+$ (mapping $V \times V$ to $V$) and $\cdot$ (mapping $\R \times V$ to $V$) satisfying the following axioms for any $x, y, z \in V$ and $a, b \in \R$
We can treat our CS-style array of numbers as modeling a mathematical vector by letting $+$ add the two vectors elementwise and $\cdot$ multiply each element of the vector by the same scalar.
Again from the maths perspective, we say that a set of vectors $x_1, x_2, \ldots, x_d$ is linearly independent when no vector can be written as a linear combination of the others. That is,
$$\alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d = 0 \;\leftrightarrow \alpha_1 = \alpha_2 = \cdots = \alpha_d = 0.$$We say the span of some vectors $x_1, x_2, \ldots, x_d$ is the set of vectors that can be written as a linear combination of those vectors
$$\operatorname{span}(x_1,x_2,\ldots,x_d) = \{\alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d \mid \alpha_i \in \R \}.$$Finally, a set of vectors is a basis for the vector space $V$ if it is linearly independent and if its span is the whole space $V$.
We say the dimension of the space is $d$ if it has a basis of size $d$.
If any vector $v$ in the space can be written uniquely as
$$v = \alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d$$for some real numbers $\alpha_1, \alpha_2, \ldots$, then to represent $v$ on a computer, it suffices to store $\alpha_1$, $\alpha_2$, $\ldots$, and $\alpha_d$. We may as well store them in an array...and this gets us back to our CS-style notion of what a vector is.
Typically, when we work with a $d$-dimensonal vector space, we call it $\R^d$, and we use the standard basis, which I denote $e_1, \ldots, e_d$. E.g. in 3 dimensions this is defined as
$$e_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \; e_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \; e_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix},$$and more generally $e_i$ has a $1$ in the $i$th entry of the vector and $0$ otherwise. In this case, if $x_i$ denotes the $i$th entry of a vector $x \in \R^d$, then
$$x = x_1 e_1 + x_2 e_2 + \cdots + x_d e_d = \sum_{i=1}^d x_i e_i.$$In Python, we can use the library numpy to compute using vectors.
import numpy
u = numpy.array([1.0,2.0,3.0])
v = numpy.array([4.0,5.0,6.0])
print('u = {}'.format(u))
print('v = {}'.format(v))
print('u + v = {}'.format(u + v))
print('2 * u = {}'.format(2 * u))
u = [1. 2. 3.] v = [4. 5. 6.] u + v = [5. 7. 9.] 2 * u = [2. 4. 6.]
We can see that the standard vector operations are both supported easily!
Answers:
We say a function $F$ from a vector space $U$ to a vector space $V$ is a linear map if for any $x, y \in U$ and any $a \in \R$,
$$F(ax + y) = a F(x) + F(y).$$We call this two-dimensional-array representation of a linear map a matrix. Here is an example of a matrix in $\R^{3 \times 3}$
$$A = \begin{bmatrix}1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}.$$We use multiplication to denote the effect of a matrix operating on a vector (this is equivalent to applying a multilinear map as a function). E.g. if $F$ is the multilinear map corresponding to matrix $A$ (really they are the same object, but I'm using different letters here to keep the notation clear), then
$$y = F(x) \;\equiv\; y = Ax.$$We can add two matrices, and scale a matrix by a scalar.
If $A \in \R^{n \times m}$ is the matrix that corresponds to the linear map $F$, and $A_{ij}$ denotes the $(i,j)$th entry of the matrix, then by our construction
$$F(e_j) = \sum_{i=1}^n A_{ij} e_i$$and so for any $x \in \R^m$
$$F(x) = F\left( \sum_{j=1}^m x_j e_j \right) = \sum_{j=1}^m x_j F( e_j ) = \sum_{j=1}^m x_j \sum_{i=1}^n A_{ij} e_i = \sum_{i=1}^n \left( \sum_{j=1}^m A_{ij} x_j \right) e_i.$$So, this means that the $i$th entry of $F(x)$ will be
$$(F(x))_i = \sum_{j=1}^m A_{ij} x_j.$$A direct implementation of our matrix multiply formula: $$(F(x))_i = \sum_{j=1}^m A_{ij} x_j.$$
x = numpy.array([1.0,2.0,3.0])
A = numpy.array([[1.0,2,3],[4,5,6]])
def matrix_multiply(A, x):
(n,m) = A.shape
assert(m == x.size)
y = numpy.zeros(n)
for i in range(n):
for j in range(m):
y[i] += A[i,j] * x[j]
return y
print('x = {}'.format(x))
print('A = {}'.format(A))
print('Ax = {}'.format(matrix_multiply(A,x)))
x = [1. 2. 3.] A = [[1. 2. 3.] [4. 5. 6.]] Ax = [14. 32.]
# numpy has its own built-in support for matrix multiply
print('Ax = {}'.format(A @ x)) # numpy uses @ to mean matrix multiply
Ax = [14. 32.]
Comparing numpy matrix multiplies with my naive for-loop matrix multiply, one is much faster than the other.
# generate some random data
x = numpy.random.randn(1024)
A = numpy.random.randn(1024,1024)
import time
t = time.time()
for trial in range(20):
B = matrix_multiply(A,x)
my_time = time.time() - t
print('my matrix multiply: {} seconds'.format(my_time))
t = time.time()
for trial in range(20):
B = A @ x
np_time = time.time() - t
print('numpy matmul: {} seconds'.format(np_time))
print('numpy was {:.0f}x faster'.format(my_time/np_time))
my matrix multiply: 4.296830177307129 seconds numpy matmul: 0.0645897388458252 seconds numpy was 67x faster
Answers:
We can also multiply two matrices, which corresponds to function composition of linear maps.
One special matrix is the identity matrix $I$, which has the property that $Ix = x$ for any $x$.
A = numpy.ones((2,3))
B = numpy.array([[3.0,8,1],[-7,2,-1],[0,2,-2]])
I = numpy.eye(3) # identity matrix
print('size of A = {}'.format(A.shape))
print('size of B = {}'.format(B.shape))
print('u = {}'.format(u))
print('A = {}'.format(A))
print('B = {}'.format(B))
print('I = {}'.format(I))
print('A.shape = {}'.format(A.shape))
print('B.shape = {}'.format(B.shape))
print('Iu = {}'.format(I @ u)) # numpy uses @ to mean matrix multiply
# print('Au = {}'.format(A @ u))
print('AB = {}'.format(A @ B))
print('BA = {}'.format(B @ A)) # should cause an error!
size of A = (2, 3) size of B = (3, 3) u = [1. 2. 3.] A = [[1. 1. 1.] [1. 1. 1.]] B = [[ 3. 8. 1.] [-7. 2. -1.] [ 0. 2. -2.]] I = [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]] A.shape = (2, 3) B.shape = (3, 3) Iu = [1. 2. 3.] AB = [[-4. 12. -2.] [-4. 12. -2.]]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[18], line 17 15 # print('Au = {}'.format(A @ u)) 16 print('AB = {}'.format(A @ B)) ---> 17 print('BA = {}'.format(B @ A)) ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)
Transposition takes a $n \times m$ matrix and swaps the rows and columns to produce an $m \times n$ matrix. Formally,
$$(A^T)_{ij} = A_{ji}.$$A matrix that is its own transpose (i.e. $A = A^T$) is called a symmetric matrix.
We can also transpose a vector. Transposing a vector $x \in \R^d$ gives a matrix in $\R^{1 \times d}$, also known as a row vector. This gives us a handy way of defining the dot product which maps a pair of vectors to a scalar.
$$x^T y = y^T x = \langle x, y \rangle = \sum_{i=1}^d x_i y_i$$It also gives us a handy way of grabbing the $i$th element of a vector, since $x_i = e_i^T x$ (and $A_{ij} = e_i^T A e_j$).
A very useful identity: in $\R^d$, $\sum_{i=1}^d e_i e_i^T = I$.
A = numpy.array([[1,2,3],[4,5,6]])
print('A = {}'.format(A))
print('A.T = {}'.format(A.T))
print('u = {}'.format(u))
print('u.T @ u = {}'.format(u.T @ u))
A = [[1 2 3] [4 5 6]] A.T = [[1 4] [2 5] [3 6]] u = [1. 2. 3.] u.T @ u = 14.0
Often, we want to express some mathematics that goes beyond the addition and scalar multiplication operations in a vector space. Sometimes, to do this we use elementwise operations which operate on a vector/matrix (or pair of vectors/matrices) on a per-element basis. E.g. if
$$x = \begin{bmatrix}1 \\ 4 \\ 9 \\ 16\end{bmatrix},$$then if $\operatorname{sqrt}$ operates elementwise,
$$\operatorname{sqrt}(x) = \begin{bmatrix}1 \\ 2 \\ 3 \\ 4\end{bmatrix}.$$We can also do this with matrices and with binary operations.
x = numpy.array([1.0,4,9])
y = numpy.array([2,5,3])
z = numpy.array([2,3,7,8])
print('x = {}'.format(x))
print('y = {}'.format(y))
print('z = {}'.format(z))
print('sqrt(x) = {}'.format(numpy.sqrt(x)))
print('x * y = {}'.format(x * y)) # simple numerical operations are elementwise by default in numpy
print('x / y = {}'.format(x / y))
print('x * z = {}'.format(x * z)) # should cause error
x = [1. 4. 9.] y = [2 5 3] z = [2 3 7 8] sqrt(x) = [1. 2. 3.] x * y = [ 2. 20. 27.] x / y = [0.5 0.8 3. ]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[21], line 11 9 print('x * y = {}'.format(x * y)) # simple numerical operations are elementwise by default in numpy 10 print('x / y = {}'.format(x / y)) ---> 11 print('x * z = {}'.format(x * z)) ValueError: operands could not be broadcast together with shapes (3,) (4,)
We just saw that we can't use elementwise operations on pairs of vectors/matrices if they are not the same size. Broadcasting allows us to be more expressive by automatically expanding a vector/matrix along an axis of dimension 1.
# x = numpy.array([2.0,3])
# A = numpy.array([[1.,2],[3,4]])
# print(x.shape)
# print(A.shape)
# print(x)
# print(A)
(numpy.ones((5,3,2)) @ (numpy.ones((7,2,4))))
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[53], line 10 1 # x = numpy.array([2.0,3]) 2 # A = numpy.array([[1.,2],[3,4]]) 3 (...) 7 # print(x) 8 # print(A) ---> 10 (numpy.ones((5,3,2)) @ (numpy.ones((7,2,4)))) ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (5,3,2)->(5,newaxis,newaxis) (7,2,4)->(7,newaxis,newaxis) and requested shape (3,4)
We say that a matrix is stored as a 2-dimensional array. A tensor generalizes this to a matrix of whatever dimension you want.
From a mathematical perspective, a tensor is a multilinear map in the same way that a matrix is a linear map. That is, it's equivalent to a function
$$F(x_1, x_2, \ldots, x_n) \in \R$$where $F$ is linear in each of the inputs $x_i \in \R^{d_i}$ taken individually (i.e. with all the other inputs fixed).
$$F\left(\begin{bmatrix} x_1 \\ y_1 \end{bmatrix}, \begin{bmatrix} x_2 \\ y_2 \end{bmatrix}, \begin{bmatrix} x_3 \\ y_3 \end{bmatrix} \right) = x_1 y_2 x_3.$$We'll come back to this later when we discuss tensors in ML frameworks.
Suppose that we have $n$ websites, and we have collected a matrix $A \in \R^{n \times n}$, where $A_{ij}$ counts the number of links from website $i$ to website $j$.
We want to produce a new matrix $B \in \R^{n \times n}$ such that $B_{ij}$ measures the fraction of links from website $i$ that go to website $j$.
How do we compute this?
$$B_{ij} = \frac{A_{ij}}{\sum_{k=1}^n A_{ik}}$$# generate some random data to work with
n = 6
A = numpy.random.randint(0,6,(n,n))**2 + numpy.random.randint(0,5,(n,n))
B_for = numpy.zeros((n,n))
for i in range(n):
for j in range(n):
acc = 0
for k in range(n):
acc += A[i,k]
B_for[i,j] = A[i,j] / acc
# print(B_for - (A / numpy.sum(A, axis=1, keepdims=True)))
sumAik = A @ numpy.ones((n,1))
print(B_for - (A / sumAik))
# numpy.sum(A, axis=1).shape
[[0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.]]
Many, if not most, machine learning training algorithms use gradients to optimize a function.
What is a gradient?
Suppose I have a function $f$ from $\R^d$ to $\R$. The gradient, $\nabla f$, is a function from $\R^d$ to $\R^d$ such that
$$\left(\nabla f(w) \right)_i = \frac{\partial}{\partial w_i} f(w) = \lim_{\delta \rightarrow 0} \frac{f(w + \delta e_i) - f(w)}{\delta},$$that is, it is the vector of partial derivatives of the function. Another, perhaps cleaner (and basis-independent), definition is that $\nabla f(w)^T$ is the linear map such that for any $u \in \R^d$
$$\nabla f(w)^T u = \lim_{\delta \rightarrow 0} \frac{f(w + \delta u) - f(w)}{\delta}.$$More informally, it is the unique vector such that $f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0)$ for $w$ nearby $w_0$.
$f(x) = x^T A x$
$f(x) = \| x \|_2^2 = \sum_{i=1}^d x_i^2$
...
$f(x) = \| x \|_1 = \sum_{i=1}^d | x_i |$
...
$f(x) = \| x \|_{\infty} = \max(|x_1|, |x_2|, \ldots, |x_d|)$
...
...and you should become skilled in mapping from mathematical expressions to numpy and back.