In [1]:

```
import numpy
import scipy
import matplotlib
import time
def hide_code_in_slideshow():
from IPython import display
import binascii
import os
uid = binascii.hexlify(os.urandom(8)).decode()
html = """<div id="%s"></div>
<script type="text/javascript">
$(function(){
var p = $("#%s");
if (p.length==0) return;
while (!p.hasClass("cell")) {
p=p.parent();
if (p.prop("tagName") =="body") return;
}
var cell = p;
cell.find(".input").addClass("hide-in-slideshow")
});
</script>""" % (uid, uid)
display.display_html(html, raw=True)
```

Recall our first principle from last lecture...

**Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms
that update the model iteratively with easy-to-compute steps using numerical linear algebra.**

Recall our first principle from last lecture...

**Principle #1: Write your learning task as an optimization problem, and solve it via fast algorithms
that update the model iteratively with easy-to-compute steps using numerical linear algebra.**

A simple example: we can represent the properties of an object using a **feature vector** (or embedding) in $\R^d$. Say we wanted to predict something about a group of people including this guy (Svante Myrick)
using the fact that he is 33, graduated in 2009, started his current job in 2012, and makes $58,561 a year.

One way to represent this is as a vector in $4$ dimensional space.

$$x = \begin{bmatrix}33 \\ 2009 \\ 2012 \\ 58561\end{bmatrix}.$$Representing the information as a vector makes it easier for us to express ML models with it. We can then represent other objects we want to make predictions about with their own vectors, e.g.

$$x = \begin{bmatrix}78 \\ 1965 \\ 2021 \\ 400000\end{bmatrix}.$$Before we start in on how to compute with vectors, matrices, et cetera, we should make sure we're all on the same page about what these objects are.

A vector (represented on a computer) is an array of numbers (usually floating point numbers). We say that the **dimension** (or length) of the vector is the size of the array, i.e. the number of numbers it contains.

A vector (in mathematics) is an element of a **vector space**. Recall: a vector space over the real numbers is a set $V$ together with two binary operations $+$ (mapping $V \times V$ to $V$) and $\cdot$ (mapping $\R \times V$ to $V$) satisfying the following axioms for any $x, y, z \in V$ and $a, b \in \R$

- $x + y \in V$ and $a \cdot x = ax \in V$
*(closure)* - $(x + y) + z = x + (y + z)$
*(associativity of addition)* - $x + y = y + x$
*(transitivity of addition)* - there exists a $(-x)$ such that $x + (-x) = 0$
*(negation)* - $0 \in V$ such that $0 + x = x + 0 = x$
*(zero element)* - $a(bx) = b(ax) = (ab)x$
*(associativity of scalar multiplication)* - $1v = v$
*(multiplication by one)* - $a(x+y) = ax + ay$ and $(a+b)x = ax + bx$
*(distributivity)*

We can treat our CS-style array of numbers as modeling a mathematical vector by letting $+$ add the two vectors elementwise and $\cdot$ multiply each element of the vector by the same scalar.

Again from the maths perspective, we say that a set of vectors $x_1, x_2, \ldots, x_d$ is **linearly independent** when no vector can be written as a linear combination of the others. That is,

We say the **span** of some vectors $x_1, x_2, \ldots, x_d$ is the set of vectors that can be written as a linear combination of those vectors

Finally, a set of vectors is a **basis** for the vector space $V$ if it is linearly independent and if its span is the whole space $V$.

- Equivalently, a set of vectors is a basis if any vector $v \in V$ can be written uniquely as a linear combination of vectors in the basis.

We say the **dimension** of the space is $d$ if it has a basis of size $d$.

If any vector $v$ in the space can be written uniquely as

$$v = \alpha_1 x_1 + \alpha_2 x_2 + \cdots + \alpha_d x_d$$for some real numbers $\alpha_1, \alpha_2, \ldots$, then to represent $v$ on a computer, it suffices to store $\alpha_1$, $\alpha_2$, $\ldots$, and $\alpha_d$. We may as well store them in an array...and this gets us back to our CS-style notion of what a vector is.

- Importantly, this only works for finite-dimensional vector spaces!

Typically, when we work with a $d$-dimensonal vector space, we call it $\R^d$, and we use the **standard basis**, which I denote $e_1, \ldots, e_d$. E.g. in 3 dimensions this is defined as

and more generally $e_i$ has a $1$ in the $i$th entry of the vector and $0$ otherwise. In this case, if $x_i$ denotes the $i$th entry of a vector $x \in \R^d$, then

$$x = x_1 e_1 + x_2 e_2 + \cdots + x_d e_d = \sum_{i=1}^d x_i e_i.$$In Python, we use the library **numpy** to compute using vectors.

In [2]:

```
import numpy
u = numpy.array([1.0,2.0,3.0])
v = numpy.array([4.0,5.0,6.0])
print('u = {}'.format(u))
print('v = {}'.format(v))
print('u + v = {}'.format(u + v))
print('2 * u = {}'.format(2 * u))
```

We can see that the standard vector operations are both supported easily!

Question: What have you seen represented as a vector in your previous experience with machine learning?

Answers:

- words/test
- images
- gene sequences
- speech & video
- feature vectors
- graphs
- houses
- stock prices!

We say a function $F$ from a vector space $U$ to a vector space $V$ is a **linear map** if for any $x, y \in U$ and any $a \in \R$,

- Notice that if we know $F(e_i)$ for all the basis elements $e_i$ of $U$, then this uniquely determines $F$ (why?).
- So, if we want to represent $F$ on a computer and $U$ and $V$ are finite-dimensional vector spaces of dimensions $m$ and $n$ respectively, it suffices to store $F(e_1), F(e_2), \ldots, F(e_m)$.
- Each $F(e_i)$ is itself an element of $V$, which we can represent on a computer as an array of $n$ numbers (since $V$ is $n$-dimensional).
- So, we can represent $F$ as an array of $m$ arrays of $n$ numbers...or equivalently as a
**two-dimensional array**.- Sadly, this overloads the meaning of the term "dimension"...but usually the meaning is clear from context.

We call this two-dimensional-array representation of a linear map a **matrix**. Here is an example of a matrix in $\R^{3 \times 3}$

We use multiplication to denote the effect of a matrix operating on a vector (this is equivalent to applying a multilinear map as a function). E.g. if $F$ is the multilinear map corresponding to matrix $A$ (really they are the same object, but I'm using different letters here to keep the notation clear), then

$$y = F(x) \;\equiv\; y = Ax.$$We can add two matrices, and scale a matrix by a scalar.

- Note that this means that the set of matrices $\R^{n \times m}$
**is itself a vector space**.

If $A \in \R^{n \times m}$ is the matrix that corresponds to the linear map $F$, and $A_{ij}$ denotes the $(i,j)$th entry of the matrix, then by our construction

$$F(e_j) = \sum_{i=1}^n A_{ij} e_i$$and so for any $x \in \R^m$

$$F(x) = F\left( \sum_{j=1}^m x_j e_j \right) = \sum_{j=1}^m x_j F( e_j ) = \sum_{j=1}^m x_j \sum_{i=1}^n A_{ij} e_i = \sum_{i=1}^n \left( \sum_{j=1}^m A_{ij} x_j \right) e_i.$$So, this means that the $i$th entry of $F(x)$ will be

$$(F(x))_i = \sum_{j=1}^m A_{ij} x_j.$$A direct implementation of our matrix multiply formula: $$(F(x))_i = \sum_{j=1}^m A_{ij} x_j.$$

In [3]:

```
x = numpy.array([1.0,2.0,3.0])
A = numpy.array([[1.0,2,3],[4,5,6]])
def matrix_multiply(A, x):
(n,m) = A.shape
assert(m == x.size)
y = numpy.zeros(n)
for i in range(n):
for j in range(m):
y[i] += A[i,j] * x[j]
return y
print('x = {}'.format(x))
print('A = {}'.format(A))
print('Ax = {}'.format(matrix_multiply(A,x)))
```

In [4]:

```
# numpy has its own built-in support for matrix multiply
print('Ax = {}'.format(A @ x)) # numpy uses @ to mean matrix multiply
```

Comparing numpy matrix multiplies with my naive for-loop matrix multiply, one is much faster than the other.

In [9]:

```
# generate some random data
x = numpy.random.randn(1024)
A = numpy.random.randn(1024,1024)
t = time.time()
for trial in range(5):
B = matrix_multiply(A,x)
my_time = time.time() - t
print('my matrix multiply: {} seconds'.format(my_time))
t = time.time()
for trial in range(5):
B = A @ x
np_time = time.time() - t
print('numpy matmul: {} seconds'.format(np_time))
print('numpy was {:.0f}x faster'.format(my_time/np_time))
```

Question: What have you seen represented as a matrix in your previous experience with machine learning?

Answers:

- the weights in a layer of a neural network
- parameters
- tables
- a whole dataset
- image
- geometric transformation (physics, computer graphics)
- PCA
- covariance matrices

We can also multiply two matrices, which corresponds to function composition of linear maps.

- Of course, this only makes sense if the dimensions match!
- For example, if $A \in \R^{n \times m}$ and $B \in \R^{q \times p}$, then it only makes sense to write $AB$ if $m = q$.
- In this context, we often want to think of a vector $x \in \R^d$ as a $d \times 1$ matrix.

One special matrix is the **identity matrix** $I$, which has the property that $Ix = x$ for any $x$.

In [6]:

```
B = numpy.array([[3.0,8,1],[-7,2,-1],[0,2,-2]])
I = numpy.eye(3) # identity matrix
print('u = {}'.format(u))
print('A = {}'.format(A))
print('B = {}'.format(B))
print('I = {}'.format(I))
print('A.shape = {}'.format(A.shape))
print('B.shape = {}'.format(B.shape))
print('Iu = {}'.format(I @ u)) # numpy uses @ to mean matrix multiply
print('Au = {}'.format(A @ u))
print('AB = {}'.format(A @ B))
print('BA = {}'.format(B @ A)) # should cause an error!
```

Transposition takes a $n \times m$ matrix and **swaps the rows and columns** to produce an $m \times n$ matrix. Formally,

A matrix that is its own transpose (i.e. $A = A^T$) is called a **symmetric matrix**.

We can also transpose a vector. Transposing a vector $x \in \R^d$ gives a matrix in $\R^{1 \times d}$, also known as a **row vector**. This gives us a handy way of defining the **dot product** which maps a pair of vectors to a scalar.

- This is very useful in machine learning to express similarities, make predictions, compute norms, etc.
It also gives us a handy way of grabbing the $i$th element of a vector, since $x_i = e_i^T x$ (and $A_{ij} = e_i^T A e_j$).

A very useful identity: in $\R^d$, $\sum_{i=1}^d e_i e_i^T = I$.

In [12]:

```
A = numpy.array([[1,2,3],[4,5,6]])
print('A = {}'.format(A))
print('A.T = {}'.format(A.T))
print('u = {}'.format(u))
print('u.T * u = {}'.format(u.T @ u))
```

Often, we want to express some mathematics that goes beyond the addition and scalar multiplication operations in a vector space. Sometimes, to do this we use **elementwise operations** which operate on a vector/matrix (or pair of vectors/matrices) on a per-element basis. E.g. if

then if $\operatorname{sqrt}$ operates elementwise,

$$\operatorname{sqrt}(x) = \begin{bmatrix}1 \\ 2 \\ 3 \\ 4\end{bmatrix}.$$We can also do this with matrices and with binary operations.

In [13]:

```
x = numpy.array([1.0,4,9])
y = numpy.array([2,5,3])
z = numpy.array([2,3,7,8])
print('x = {}'.format(x))
print('y = {}'.format(y))
print('z = {}'.format(z))
print('sqrt(x) = {}'.format(numpy.sqrt(x)))
print('x * y = {}'.format(x * y)) # simple numerical operations are elementwise by default in numpy
print('x / y = {}'.format(x / y))
print('x * z = {}'.format(x * z)) # should cause error
```

We just saw that we can't use elementwise operations on pairs of vectors/matrices if they are not the same size. **Broadcasting** allows us to be more expressive by automatically expanding a vector/matrix along an axis of dimension 1.

In [19]:

```
x = numpy.array([2.0,3])
A = numpy.array([[1.,2],[3,4]])
print('x = {}'.format(x))
print('A = {}'.format(A))
print('A + x = {}'.format(A + x)) # adds 2 to the first column of A and 3 to the second
print('A * x = {}'.format(A * x)) # DO NOT MIX THIS UP WITH MATRIX MULTIPLY!
numpy.dot(x,x)
numpy.matmul(x,x)
```

Out[19]:

We say that a matrix is stored as a 2-dimensional array. A tensor generalizes this to a matrix of whatever dimension you want.

From a mathematical perspective, a tensor is a **multilinear map** in the same way that a matrix is a linear map. That is, it's equivalent to a function

where $F$ is linear in each of the inputs $x_i \in \R^{d_i}$ taken individually (i.e. with all the other inputs fixed).

$$F\left(\begin{bmatrix} x_1 \\ y_1 \end{bmatrix}, \begin{bmatrix} x_2 \\ y_2 \end{bmatrix}, \begin{bmatrix} x_3 \\ y_3 \end{bmatrix} \right) = x_1 y_2 x_3.$$*We'll come back to this later when we discuss tensors in ML frameworks.*

Suppose that we have $n$ websites, and we have collected a matrix $A \in \R^{n \times n}$, where $A_{ij}$ counts the number of links from website $i$ to website $j$.

We want to produce a new matrix $B \in \R^{n \times n}$ such that $B_{ij}$ measures the *fraction* of links from website $i$ that go to website $j$.

*How do we compute this?*

In [62]:

```
# generate some random data to work with
n = 6
A = numpy.random.randint(0,6,(n,n))**2 + numpy.random.randint(0,5,(n,n))
B_for = numpy.zeros((n,n))
for i in range(n):
for j in range(n):
acc = 0
for k in range(n):
acc += A[i,k]
B_for[i,j] = A[i,j] / acc
B_for
print(B_for - (A / numpy.sum(A, axis=1, keepdims=True)))
sumAik = A @ numpy.ones((n,1))
print(B_for - (A / sumAik))
```

Many, if not most, machine learning training algorithms use gradients to optimize a function.

*What is a gradient?*

Suppose I have a function $f$ from $\R^d$ to $\R$. The gradient, $\nabla f$, is a function from $\R^d$ to $\R^d$ such that

$$\left(\nabla f(w) \right)_i = \frac{\partial}{\partial w_i} f(w) = \lim_{\delta \rightarrow 0} \frac{f(w + \delta e_i) - f(w)}{\delta},$$that is, it is the **vector of partial derivatives of the function**.
Another, perhaps cleaner (and basis-independent), definition is that $f(w)^T$ is the linear map such that for any $u \in \R^d$

More informally, it is the unique vector such that $f(w) \approx f(w_0) + (w - w_0)^T \nabla f(w_0)$ for $w$ nearby $w_0$.

$f(x) = x^T A x$

...

$f(x) = \| x \|_2^2 = \sum_{i=1}^d x_i^2$

...

$f(x) = \| x \|_1 = \sum_{i=1}^d | x_i |$

...

$f(x) = \| x \|_{\infty} = \max(|x_1|, |x_2|, \ldots, |x_d|)$

...

**...and you should become skilled in mapping from mathematical expressions to numpy and back.**

In [ ]:

```
```

In [ ]:

```
```