Lecture 2: Linear algebra done efficiently

CS4787 — Principles of Large-Scale Machine Learning Systems

$\newcommand{\R}{\mathbb{R}}$

In [1]:
import numpy
import scipy
import matplotlib
import time

def hide_code_in_slideshow():   
    from IPython import display
    import binascii
    import os
    uid = binascii.hexlify(os.urandom(8)).decode()    
    html = """<div id="%s"></div>
    <script type="text/javascript">
        $(function(){
            var p = $("#%s");
            if (p.length==0) return;
            while (!p.hasClass("cell")) {
                p=p.parent();
                if (p.prop("tagName") =="body") return;
            }
            var cell = p;
            cell.find(".input").addClass("hide-in-slideshow")
        });
    </script>""" % (uid, uid)
    display.display_html(html, raw=True)

Recall our first principle from last lecture...

Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms that update the model iteratively with easy-to-compute steps using numerical linear algebra.

Recall our first principle from last lecture...

Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms that update the model iteratively with easy-to-compute steps using numerical linear algebra.

A simple example: we can represent the properties of an object using a feature vector (or embedding) in $\R^d$. Say we wanted to predict something about a group of people including this guy (Svante Myrick) using the fact that he is 33, graduated in 2009, started his current job in 2012, and makes $58,561 a year.

One way to represent this is as a vector in $4$ dimensional space.

$$x = \begin{bmatrix}33 \\ 2009 \\ 2012 \\ 58561\end{bmatrix}.$$

Representing the information as a vector makes it easier for us to express ML models with it. We can then represent other objects we want to make predictions about with their own vectors, e.g.

$$x = \begin{bmatrix}78 \\ 1965 \\ 2021 \\ 400000\end{bmatrix}.$$

Linear Algebra: A Review

Before we start in on how to compute with vectors, matrices, et cetera, we should make sure we're all on the same page about what these objects are.

A vector (represented on a computer) is an array of numbers (usually floating point numbers). We say that the dimension (or length) of the vector is the size of the array, i.e. the number of numbers it contains.

A vector (in mathematics) is an element of a vector space. Recall: a vector space over the real numbers is a set $V$ together with two binary operations $+$ (mapping $V \times V$ to $V$) and $\cdot$ (mapping $\R \times V$ to $V$) satisfying the following axioms for any $x, y, z \in V$ and $a, b \in \R$

  • $x + y \in V$ and $a \cdot x = ax \in V$ (closure)
  • $(x + y) + z = x + (y + z)$ (associativity of addition)
  • $x + y = y + x$ (transitivity of addition)
  • there exists a $(-x)$ such that $x + (-x) = 0$ (negation)
  • $0 \in V$ such that $0 + x = x + 0 = x$ (zero element)
  • $a(bx) = b(ax) = (ab)x$ (associativity of scalar multiplication)
  • $1v = v$ (multiplication by one)
  • $a(x+y) = ax + ay$ and $(a+b)x = ax + bx$ (distributivity)

We can treat our CS-style array of numbers as modeling a mathematical vector by letting $+$ add the two vectors elementwise and $\cdot$ multiply each element of the vector by the same scalar.

Again from the maths perspective, we say that a set of vectors $x_1, x_2, \ldots, x_d$ is linearly independent when no vector can be written as a linear combination of the others. That is,

$$\alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d = 0 \;\leftrightarrow \alpha_1 = \alpha_2 = \cdots = \alpha_d = 0.$$

We say the span of some vectors $x_1, x_2, \ldots, x_d$ is the set of vectors that can be written as a linear combination of those vectors

$$\operatorname{span}(x_1,x_2,\ldots,x_d) = \{\alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d \mid \alpha_i \in \R \}.$$

Finally, a set of vectors is a basis for the vector space $V$ if it is linearly independent and if its span is the whole space $V$.

  • Equivalently, a set of vectors is a basis if any vector $v \in V$ can be written uniquely as a linear combination of vectors in the basis.

We say the dimension of the space is $d$ if it has a basis of size $d$.

What does this have to do with our computer-science definition of a vector?

If any vector $v$ in the space can be written uniquely as

$$v = \alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d$$

for some real numbers $\alpha_1, \alpha_2, \ldots$, then to represent $v$ on a computer, it suffices to store $\alpha_1$, $\alpha_2$, $\ldots$, and $\alpha_d$. We may as well store them in an array...and this gets us back to our CS-style notion of what a vector is.

  • Importantly, this only works for finite-dimensional vector spaces!

Typically, when we work with a $d$-dimensonal vector space, we call it $\R^d$, and we use the standard basis, which I denote $e_1, \ldots, e_d$. E.g. in 3 dimensions this is defined as

$$e_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \; e_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \; e_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix},$$

and more generally $e_i$ has a $1$ in the $i$th entry of the vector and $0$ otherwise. In this case, if $x_i$ denotes the $i$th entry of a vector $x \in \R^d$, then

$$x = x_1 e_1 + x_2 e_2 + \cdots + x_d e_d = \sum_{i=1}^d x_i e_i.$$

In Python

In Python, we use the library numpy to compute using vectors.

In [2]:
import numpy

u = numpy.array([1.0,2.0,3.0])
v = numpy.array([4.0,5.0,6.0])

print('u = {}'.format(u))
print('v = {}'.format(v))
print('u + v = {}'.format(u + v))
print('2 * u = {}'.format(2 * u))
u = [1. 2. 3.]
v = [4. 5. 6.]
u + v = [5. 7. 9.]
2 * u = [2. 4. 6.]

We can see that the standard vector operations are both supported easily!

Question: What have you seen represented as a vector in your previous experience with machine learning?

Answers:

  • words/test
  • images
  • gene sequences
  • speech & video
  • feature vectors
  • graphs
  • houses
  • stock prices!

Linear Maps

We say a function $F$ from a vector space $U$ to a vector space $V$ is a linear map if for any $x, y \in U$ and any $a \in \R$,

$$F(ax + y) = a F(x) + F(y).$$
  • Notice that if we know $F(e_i)$ for all the basis elements $e_i$ of $U$, then this uniquely determines $F$ (why?).
  • So, if we want to represent $F$ on a computer and $U$ and $V$ are finite-dimensional vector spaces of dimensions $m$ and $n$ respectively, it suffices to store $F(e_1), F(e_2), \ldots, F(e_m)$.
  • Each $F(e_i)$ is itself an element of $V$, which we can represent on a computer as an array of $n$ numbers (since $V$ is $n$-dimensional).
  • So, we can represent $F$ as an array of $m$ arrays of $n$ numbers...or equivalently as a two-dimensional array.
    • Sadly, this overloads the meaning of the term "dimension"...but usually the meaning is clear from context.

Matrices

We call this two-dimensional-array representation of a linear map a matrix. Here is an example of a matrix in $\R^{3 \times 3}$

$$A = \begin{bmatrix}1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}.$$

We use multiplication to denote the effect of a matrix operating on a vector (this is equivalent to applying a multilinear map as a function). E.g. if $F$ is the multilinear map corresponding to matrix $A$ (really they are the same object, but I'm using different letters here to keep the notation clear), then

$$y = F(x) \;\equiv\; y = Ax.$$

We can add two matrices, and scale a matrix by a scalar.

  • Note that this means that the set of matrices $\R^{n \times m}$ is itself a vector space.

Matrix Multiply

If $A \in \R^{n \times m}$ is the matrix that corresponds to the linear map $F$, and $A_{ij}$ denotes the $(i,j)$th entry of the matrix, then by our construction

$$F(e_j) = \sum_{i=1}^n A_{ij} e_i$$

and so for any $x \in \R^m$

$$F(x) = F\left( \sum_{j=1}^m x_j e_j \right) = \sum_{j=1}^m x_j F( e_j ) = \sum_{j=1}^m x_j \sum_{i=1}^n A_{ij} e_i = \sum_{i=1}^n \left( \sum_{j=1}^m A_{ij} x_j \right) e_i.$$

So, this means that the $i$th entry of $F(x)$ will be

$$(F(x))_i = \sum_{j=1}^m A_{ij} x_j.$$

Matrices in Python

A direct implementation of our matrix multiply formula: $$(F(x))_i = \sum_{j=1}^m A_{ij} x_j.$$

In [3]:
x = numpy.array([1.0,2.0,3.0])
A = numpy.array([[1.0,2,3],[4,5,6]])

def matrix_multiply(A, x):
    (n,m) = A.shape
    assert(m == x.size)
    y = numpy.zeros(n)
    for i in range(n):
        for j in range(m):
            y[i] += A[i,j] * x[j]
    return y

print('x = {}'.format(x))
print('A = {}'.format(A))
print('Ax = {}'.format(matrix_multiply(A,x)))
x = [1. 2. 3.]
A = [[1. 2. 3.]
 [4. 5. 6.]]
Ax = [14. 32.]
In [4]:
# numpy has its own built-in support for matrix multiply
print('Ax = {}'.format(A @ x)) # numpy uses @ to mean matrix multiply
Ax = [14. 32.]

Using numpy buys us performance!

Comparing numpy matrix multiplies with my naive for-loop matrix multiply, one is much faster than the other.

In [9]:
# generate some random data
x = numpy.random.randn(1024)
A = numpy.random.randn(1024,1024)

t = time.time()
for trial in range(5):
    B = matrix_multiply(A,x)
my_time = time.time() - t
print('my matrix multiply: {} seconds'.format(my_time))

t = time.time()
for trial in range(5):
    B = A @ x
np_time = time.time() - t
print('numpy matmul:       {} seconds'.format(np_time))

print('numpy was {:.0f}x faster'.format(my_time/np_time))
my matrix multiply: 3.1413068771362305 seconds
numpy matmul:       0.0011799335479736328 seconds
numpy was 2662x faster
Question: What have you seen represented as a matrix in your previous experience with machine learning?

Answers:

  • the weights in a layer of a neural network
  • parameters
  • tables
  • a whole dataset
  • image
  • geometric transformation (physics, computer graphics)
  • PCA
  • covariance matrices

Multiplying Two Matrices

We can also multiply two matrices, which corresponds to function composition of linear maps.

  • Of course, this only makes sense if the dimensions match!
  • For example, if $A \in \R^{n \times m}$ and $B \in \R^{q \times p}$, then it only makes sense to write $AB$ if $m = q$.
  • In this context, we often want to think of a vector $x \in \R^d$ as a $d \times 1$ matrix.

One special matrix is the identity matrix $I$, which has the property that $Ix = x$ for any $x$.

In [6]:
B = numpy.array([[3.0,8,1],[-7,2,-1],[0,2,-2]])
I = numpy.eye(3) # identity matrix

print('u = {}'.format(u))
print('A = {}'.format(A))
print('B = {}'.format(B))
print('I = {}'.format(I))
print('A.shape = {}'.format(A.shape))
print('B.shape = {}'.format(B.shape))
print('Iu = {}'.format(I @ u)) # numpy uses @ to mean matrix multiply
print('Au = {}'.format(A @ u))
print('AB = {}'.format(A @ B))
print('BA = {}'.format(B @ A)) # should cause an error!
u = [1. 2. 3.]
A = [[ 2.49235899 -0.79169829 -1.15330389 ...  1.0745381  -0.4371076
   0.05276692]
 [-1.22523154 -0.63926429  1.39791807 ...  0.42955508  0.44796394
   0.03866751]
 [-1.62111808 -0.91904186  0.71170734 ...  2.82345617 -0.17558477
  -0.08574408]
 ...
 [ 2.45025228 -0.66925759  0.88846616 ...  0.54732669 -0.39968432
   1.22050626]
 [-0.61742792  0.05118947  0.65439684 ... -0.37513861  0.82803545
  -0.97951277]
 [ 0.20119055 -0.76758025  0.17296537 ... -1.52676426  0.30153432
  -0.69490939]]
B = [[ 3.  8.  1.]
 [-7.  2. -1.]
 [ 0.  2. -2.]]
I = [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
A.shape = (1024, 1024)
B.shape = (3, 3)
Iu = [1. 2. 3.]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-2b64b09504a2> in <module>
      9 print('B.shape = {}'.format(B.shape))
     10 print('Iu = {}'.format(I @ u)) # numpy uses @ to mean matrix multiply
---> 11 print('Au = {}'.format(A @ u))
     12 print('AB = {}'.format(A @ B))
     13 print('BA = {}'.format(B @ A)) # should cause an error!

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 1024)

Transposition

Transposition takes a $n \times m$ matrix and swaps the rows and columns to produce an $m \times n$ matrix. Formally,

$$(A^T)_{ij} = A_{ji}.$$

A matrix that is its own transpose (i.e. $A = A^T$) is called a symmetric matrix.

We can also transpose a vector. Transposing a vector $x \in \R^d$ gives a matrix in $\R^{1 \times d}$, also known as a row vector. This gives us a handy way of defining the dot product which maps a pair of vectors to a scalar.

$$x^T y = y^T x = \langle x, y \rangle = \sum_{i=1}^d x_i y_i$$
  • This is very useful in machine learning to express similarities, make predictions, compute norms, etc.
  • It also gives us a handy way of grabbing the $i$th element of a vector, since $x_i = e_i^T x$ (and $A_{ij} = e_i^T A e_j$).

  • A very useful identity: in $\R^d$, $\sum_{i=1}^d e_i e_i^T = I$.

In [12]:
A = numpy.array([[1,2,3],[4,5,6]])
print('A = {}'.format(A))
print('A.T = {}'.format(A.T))
print('u = {}'.format(u))
print('u.T * u = {}'.format(u.T @ u))
A = [[1 2 3]
 [4 5 6]]
A.T = [[1 4]
 [2 5]
 [3 6]]
u = [1. 2. 3.]
u.T * u = 14.0

Elementwise Operations

Often, we want to express some mathematics that goes beyond the addition and scalar multiplication operations in a vector space. Sometimes, to do this we use elementwise operations which operate on a vector/matrix (or pair of vectors/matrices) on a per-element basis. E.g. if

$$x = \begin{bmatrix}1 \\ 4 \\ 9 \\ 16\end{bmatrix},$$

then if $\operatorname{sqrt}$ operates elementwise,

$$\operatorname{sqrt}(x) = \begin{bmatrix}1 \\ 2 \\ 3 \\ 4\end{bmatrix}.$$

We can also do this with matrices and with binary operations.

Elementwise Operations in Python

In [13]:
x = numpy.array([1.0,4,9])
y = numpy.array([2,5,3])
z = numpy.array([2,3,7,8])

print('x = {}'.format(x))
print('y = {}'.format(y))
print('z = {}'.format(z))
print('sqrt(x) = {}'.format(numpy.sqrt(x)))
print('x * y = {}'.format(x * y)) # simple numerical operations are elementwise by default in numpy
print('x / y = {}'.format(x / y))
print('x * z = {}'.format(x * z)) # should cause error
x = [1. 4. 9.]
y = [2 5 3]
z = [2 3 7 8]
sqrt(x) = [1. 2. 3.]
x * y = [ 2. 20. 27.]
x / y = [0.5 0.8 3. ]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-c4773cac2c05> in <module>
      9 print('x * y = {}'.format(x * y)) # simple numerical operations are elementwise by default in numpy
     10 print('x / y = {}'.format(x / y))
---> 11 print('x * z = {}'.format(x * z)) # should cause error

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

The Power of Broadcasting

We just saw that we can't use elementwise operations on pairs of vectors/matrices if they are not the same size. Broadcasting allows us to be more expressive by automatically expanding a vector/matrix along an axis of dimension 1.

In [19]:
x = numpy.array([2.0,3])
A = numpy.array([[1.,2],[3,4]])

print('x = {}'.format(x))
print('A = {}'.format(A))
print('A + x = {}'.format(A + x)) # adds 2 to the first column of A and 3 to the second
print('A * x = {}'.format(A * x)) # DO NOT MIX THIS UP WITH MATRIX MULTIPLY!

numpy.dot(x,x)
numpy.matmul(x,x)
x = [2. 3.]
A = [[1. 2.]
 [3. 4.]]
A + x = [[3. 5.]
 [5. 7.]]
A * x = [[ 2.  6.]
 [ 6. 12.]]
Out[19]:
13.0

Tensors

We say that a matrix is stored as a 2-dimensional array. A tensor generalizes this to a matrix of whatever dimension you want.

From a mathematical perspective, a tensor is a multilinear map in the same way that a matrix is a linear map. That is, it's equivalent to a function

$$F(x_1, x_2, \ldots, x_n) \in \R$$

where $F$ is linear in each of the inputs $x_i \in \R^{d_i}$ taken individually (i.e. with all the other inputs fixed).

$$F\left(\begin{bmatrix} x_1 \\ y_1 \end{bmatrix}, \begin{bmatrix} x_2 \\ y_2 \end{bmatrix}, \begin{bmatrix} x_3 \\ y_3 \end{bmatrix} \right) = x_1 y_2 x_3.$$

We'll come back to this later when we discuss tensors in ML frameworks.

An Illustrative Example

Suppose that we have $n$ websites, and we have collected a matrix $A \in \R^{n \times n}$, where $A_{ij}$ counts the number of links from website $i$ to website $j$.

We want to produce a new matrix $B \in \R^{n \times n}$ such that $B_{ij}$ measures the fraction of links from website $i$ that go to website $j$.

How do we compute this?

$$B_{ij} = \frac{A_{ij}}{\sum_{k=1}^n A_{ik} 1}$$
In [62]:
# generate some random data to work with
n = 6
A = numpy.random.randint(0,6,(n,n))**2 + numpy.random.randint(0,5,(n,n))

B_for = numpy.zeros((n,n))
for i in range(n):
    for j in range(n):
        acc = 0
        for k in range(n):
            acc += A[i,k]
        B_for[i,j] = A[i,j] / acc

B_for

print(B_for - (A / numpy.sum(A, axis=1, keepdims=True)))

sumAik = A @ numpy.ones((n,1))
print(B_for - (A / sumAik))
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]

Gradients

Many, if not most, machine learning training algorithms use gradients to optimize a function.

What is a gradient?

Suppose I have a function $f$ from $\R^d$ to $\R$. The gradient, $\nabla f$, is a function from $\R^d$ to $\R^d$ such that

$$\left(\nabla f(w) \right)_i = \frac{\partial}{\partial w_i} f(w) = \lim_{\delta \rightarrow 0} \frac{f(w + \delta e_i) - f(w)}{\delta},$$

that is, it is the vector of partial derivatives of the function. Another, perhaps cleaner (and basis-independent), definition is that $f(w)^T$ is the linear map such that for any $u \in \R^d$

$$f(w)^T u = \lim_{\delta \rightarrow 0} \frac{f(w + \delta u) - f(w)}{\delta}.$$

More informally, it is the unique vector such that $f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0)$ for $w$ nearby $w_0$.

Let's derive some gradients!

$f(x) = x^T A x$

...

$f(x) = \| x \|_2^2 = \sum_{i=1}^d x_i^2$

...

$f(x) = \| x \|_1 = \sum_{i=1}^d | x_i |$

...

$f(x) = \| x \|_{\infty} = \max(|x_1|, |x_2|, \ldots, |x_d|)$

...

Takeaway: numpy gives us powerful capabilities to express numerical linear algebra...

...and you should become skilled in mapping from mathematical expressions to numpy and back.

In [ ]:
 
In [ ]: