Written Homework #1

Homework 1

Due: Monday 09/28/2015 2:15pm in class
Up to 5 members in each group. Please include NetID and names on front page.

Problem 1: L2 distances through matrix operations

This is the written part of project 0.

Many machine learning algorithms access their input data primarily through pairwise distances. It is therefore important that we have a fast function that computes pairwise (Euclidean) distances of input vectors. Assume we have $n$ data vectors $\vec x_1,\dots,\vec x_n\in{\cal R}^d$ and $m$ vectors $\vec z_1,\dots,z_m\in{\cal R}^d$. And let us define two matrices $X=[\vec x_1,\dots,\vec x_n]\in{\cal R}^{d\times n}$, where the $i^{th}$ column is a vector $\vec x_i$ and similarly $Z=[\vec z_1,\dots,\vec z_m]$. Our distance function takes as input the two matrices $X$ and $Z$ and outputs a matrix $D\in{\cal R}^{n\times m}$, where $$D_{ij}=\sqrt{(\vec x_i-\vec z_j)^\top (\vec x_i-\vec z_j)}.$$

The key to efficient programming in Octave and machine learning in general is to think about it in terms of mathematics, and not in terms of Loops.

(a) Show that the Gram matrix (aka inner-product matrix) $$ G_{ij}=\vec x_i^\top\vec z_j $$ can be expressed in terms of a pure matrix multiplication.

(b) Let us define two new matrices $S,R\in{\cal R}^{n\times m}$ $$S_{ij}=\vec x_i^\top\vec x_i, \ \ R_{ij}=\vec z_j^\top\vec z_j.$$ Show that the squared-euclidean matrix $D^2\in{\cal R}^{n\times m}$, defined as $$D^2_{ij}=(\vec x_i-\vec z_j)^2,$$ can be expressed as a linear combination of the matrix $S, G, R$. (Hint: It might help to first express $D^2_{ij}$ in terms of inner-products.) What do you need to do to obtain the true Euclidean distance matrix $D$?

Problem 2: k-Nearest-Neighbor

In this problem, you are going to look at a small dataset to understand various properties of k-NN better. Suppose there is a set of points on a two-dimensional plane from two different classes. Below are the coordinates of all points.

Points in class Red: (0, 1), (2, 3), (4, 4)

Points in class Blue: (2, 0), (5, 2), (6, 3)

(a) Draw the k-nearest-neighbor decision boundary for k = 1. Remember that the decision boundary is defined as the line where the classification of a test point changes. Use the standard Euclidean distance between points to determine the nearest neighbors. Start by plotting the points as a two-dimensional graph. Please use the corresponding colors for points of each class (e.g, blue and red).

(b) If the y-coordinate of each point was multiplied by 5, what would happen to the k = 1 boundary (Draw another picture)? Explain in at most two sentences how this effect might cause problems when working with real data.

(c) The k-NN decision boundary for k = 3 is shown as below. Suppose now we have a test point at (1, 2). How would it be classied under 3-NN? Given that you can modify the 3-NN decision boundary by adding points to the training set in the diagram, what is the minimum number of points that you need to add to change the classication at (1, 2)? Show also where you need to add these points.

(d) What is the testing complexity for one instance, e.g., how long does it take for kNN to classify one point? Assume your data has dimensionality d, you have n training examples and use Euclidean distance. Assume also that you use a quick select implementation which gives you the k smallest elements of a list of length m in O(m).

Problem 3: Naive Bayes

In this problem, we explore why the naive assumption is necessary.

After learning about using Bayes rule to make predictions (without the naive assumption), you want to apply the method to a complex problem. Keep in mind that you want to use the rule: \[ P(Y | X) = \frac{P(X | Y) P(Y))}{ P(X)} \] and you want to estimate the parameters of $P(X | Y)$ and $P(Y)$. However, before applying the method to your problem, you want to apply to a toy problem first.

a)

In the toy problem, $X = [X_1, X_2]$ (so $d = 2$), where $X_i$ is binary. $Y$ is also binary. You want to estimate P(X|Y) without the Naive Bayes assumption, that is you cannot write $P(X|Y)=\prod_{i=1}^d P(X_i=x_i|Y=y)$, instead you must estimate $P(X|Y)=P(X_1=x_1, X_2=x_2,... X_d=x_d|Y=y)$ for all combinations of the values of $x_1,x_2...x_d$, and $y$. How many parameters do you have to estimate for your toy problem? (Here parameter refers to the estimate of $P(X_1=x_1,X_2=x_2,...X_d=x_d|Y=y)$, and $P(Y=y)$ for some combination of $x_i$'s and $y$. In our case where d=2, examples of such parameters are $P(X_1=1,X_2=0|Y=1)$ and $P(Y=1)$ )

b)

After running the decision rule on your toy problem, you decide to apply it to the actual problem. However, in your problem, $d = 100$. How many parameters do you have to estimate now?

c)

When is it necessary for us to make the naive assumption? Explain by showing how the assumption will affect one of the answers from above.

Problem 4: Naive Bayes

a)

We will attempt to write the multinomial naive Bayes classifier's decision rule as a linear rule. Suppose that we have a document that is represented as a sequence $d = (w_1, w_2, \cdots, w_l)$ of length $l$. This can be classified as: \[h(d) = argmax_{y \in \{+1, -1\}}Pr(Y = y) \prod_{i=1}^{l}Pr(W=w_i | Y = y)\] Assuming that we have estimates for all the appropriate probabilities and none of them are zero, rewrite $h(d)$ as $h(d) = sign(\vec{v} \cdot \vec{x} + b)$. We assume that we have a vocabulary of words $\{a_1,a_2...a_m\}$ that contains $m$ words. Each word $w_i$ in the document is one of the $a_j$, and $\vec{x}$ is a vector where each component $x_j$ is the count of the number of times the word $a_j$ shows up in the document. Provide $\vec{v}$, and $b$

b)

Previous problems considered only those cases where $X$ consists of discrete values. Now, we will also consider the case where $X$ can take continous values.

But first, (i) write down the maximum likelihood estimates (MLE) for the parameters of the naive Bayes classifier, $Pr(X_i| Y)$ and $Pr(Y)$, where $X$ takes discrete ,categorical values.

Now let's look at naive Bayes classifier that takes vectors of continous values as input. In this case, we need a different formulation for $Pr(X_i| Y)$ (we don't need to worry about $Pr(Y )$ because $Y$ still takes discrete values). One way is to assume that, for each discrete value $y$, each $X_i$ comes from a Gaussian.

For example, consider the simplified case above, where $X_i$ takes a continuous value (a value along the x-axis) and $Y$ can be either $blue$ or $green$. As shown above, for each discrete value of $Y$ ($blue$ or $green$), $X_i$ is a random variable from a Gaussian specific to $X_i$ (not some other $X_j$) and the value of $Y$. As a result, we see two Gaussians, each generating $blue$ data points or $green$ data points.

With this, we get a Gaussian Naive Bayes classifier. Using the assumption, (ii) write down the MLE for the parameters of $Pr(X_i| Y)$ (Note: since $Pr(X_i| Y=y)$ is gaussian its parameters are its mean$\mu_{y,i}$ and standard deviation $\sigma_{y,i}$) . (iii) What is the total number of parameters?

c)

Using what we found in part (b), we will reformulate the classifier once more. Remember how a Gaussian naive Bayes classifier is defined:

$X_i$ is continuous
$Pr(X_i | Y = y)$ is a Gaussian with $\mu_{y, i}$, $\sigma_i$ (we will assume that each $\sigma$ only depends on the feature, and not the label $y$)
given the label, features are conditionally independent, just like the discrete version
$Y$ is a Bernoulli random variable

Now, find the weight vector $\vec{w}$ such that \[ Pr(Y = +1 | X) = \frac{Pr(X | Y = +1) Pr(Y = +1)}{Pr(X)} = \frac{1}{1 + exp(-(w_0 + \sum_{i=1}^{n} w_i X_i))} \] Be sure to use your answers for part (b).

Note: The result that you got is precisely the formulation for logistic regression, which you can read about here. However, Gaussian naive Bayes and logistic regression are different learning algorithms. They output the (asymptotically) same model only under special conditions. We will explore the relation further when we cover logistic regression.

Problem 5: Perceptron

In this problem we will show that one of the oldest machine learning algorithm: the perceptron indeed learns. The perceptron, invented by former Cornell professor Frank Rosenblatt, seeks a hyperplane that separates the data by sequentially updating an estimate when it makes an error. In other words, it tries to find a vector parameter $w$ such that for all feature-label pair $(x_i,y_i)$, where $y_i \in \{-1, 1\}$ we have $w^Tx_i > 0$ if $y_i =1$, and $w^Tx_i < 0$ if $y_i =-1$.

5.1: Learning a single input

First, show that if we feed only one point to the perceptron algorithm over and over again (hence the dataset is only ${(x_1,y_1)}$ ), it will get it right after the first update. (You may assume $x_1 \neq 0$.)

5.2: Learning a linearly separable dataset

We will now prove a theorem which bounds the number of errors that the perceptron algorithm makes assuming the data is linearly separable. We will make the following assumptions:

(The data is linearly separable and there exist a margin $\gamma$ between the positives examples and the negatives examples) Assume there exists such that $||w^*||=1$, and some $\gamma > 0$ such that for all $t=1...n$, $y_t(w^{*T}x_t) \ge \gamma $.
(The data lies in a ball of radius R) Assume that for all $t=1...n%$, $||x_t|| \le R$.

Note that we assume that there exists a parameter vector $w^*$ which has a "margin" of $\gamma$, but we do not claim that the perceptron algorithm will find $w^*$, we simply claim that the algorithm will a parameter vector which separates the data.

We define $w_k$ to be the parameter vector (denoted $w$ in the algorithm) when the $k$-th update happens (which is not the same as the iteration variable $j$ or $t$ in the algorithm). The proof relies on the observation that the magnitude of $w_k$ (denoted $||w_k||$) can be bounded by above and by below as a function of $k$, but the lower bound increases faster than the upper bound! Hence, after finitely many updates, the algorithm must stop.

a)

In this first part of the proof, we show that $||w_{k+1}||$ is bounded by below as a function of $k$ . Show that $w_{k+1}^T w^* \ge w_k^T w^*+ \gamma$

b)

Use part (a) to show that $||w_{k+1}||\ge k\gamma$. Hint: use induction, then use the Cauchy-Schwarz inequality ($||u||||v|| \geq |u^Tv|$). This shows that $||w_k||$ increases as $k$ (the number of error made) increases.

c)

Now we proceed to prove that $||w_{k+1}||$ is also bounded by above as a function of $k$ . Show $||w_{k+1}||^2 \le ||w_k||^2+R^2$

d)

Use (c) to show $||w_{k+1}||^2 \le kR^2$. This shows that $w_k$ is bounded by above as a function of $k$ .

e)

Use (b) and (d) to show $k\leq \frac{R^2}{\gamma ^2}$. What is the maximum number of errors made? Congratulations, you have just proved the convergence of the perceptron! Rosenblatt would be proud of you.

5.3: But when does it end?

We run the perceptron algorithm with a slight modification (see algorithm below) on a linearly separable dataset. This algorithm exits the outer loop only when the perceptron has classified all the data and no update has been done in a given iteration. For what values of T in algorithm 1 do algorithm 1 and algorithm 2 always output the same $w$?