Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js
Homework 1
Due: Monday 09/28/2015 2:15pm in class
Up to 5 members in each group. Please include NetID and names on front page.
Problem 1: L2 distances through matrix operations
This is the written part of project 0.
Many machine learning algorithms access their input data primarily through pairwise distances. It is therefore important that we have a fast function that computes pairwise (Euclidean) distances of input vectors. Assume we have n data vectors →x1,…,→xn∈Rd and m vectors →z1,…,zm∈Rd. And let us define two matrices X=[→x1,…,→xn]∈Rd×n, where the ith column is a vector →xi and similarly Z=[→z1,…,→zm]. Our distance function takes as input the two matrices X and Z and outputs a matrix D∈Rn×m, where
Dij=√(→xi−→zj)⊤(→xi−→zj).
The key to efficient programming in Octave and machine learning in general is to think about it in terms of mathematics, and not in terms of Loops.
(a) Show that the Gram matrix (aka inner-product matrix)
Gij=→x⊤i→zj
can be expressed in terms of a pure matrix multiplication.
(b) Let us define two new matrices S,R∈Rn×m
Sij=→x⊤i→xi, Rij=→z⊤j→zj.
Show that the squared-euclidean matrix D2∈Rn×m, defined as
D2ij=(→xi−→zj)2,
can be expressed as a linear combination of the matrix S,G,R. (Hint: It might help to first express D2ij in terms of inner-products.) What do you need to do to obtain the true Euclidean distance matrix D?
Problem 2: k-Nearest-Neighbor
In this problem, you are going to look at a small dataset to understand various properties of k-NN better. Suppose there is a set of points on a two-dimensional plane from two different classes. Below are the coordinates of all points.
Points in class Red: (0, 1), (2, 3), (4, 4)
Points in class Blue: (2, 0), (5, 2), (6, 3)
(a) Draw the k-nearest-neighbor decision boundary for k
= 1. Remember that the decision boundary is defined as the line where the classification of a test point changes. Use the standard Euclidean distance between points to determine the nearest neighbors. Start by plotting the points as a two-dimensional graph. Please use the corresponding colors for points of each class (e.g, blue and red).
(b) If the y-coordinate of each point was multiplied by 5, what would happen to the k = 1 boundary (Draw another picture)? Explain in at most two sentences how this effect might cause problems when working with real data.
(c) The k-NN decision boundary for k = 3 is shown as below. Suppose now we have a test point at (1, 2). How would it be classied under 3-NN? Given that you can modify the 3-NN decision boundary by adding points to the training set in the diagram, what is the minimum number of points that you need to add to change the classication at (1, 2)? Show also where you need to add these points.
(d) What is the testing complexity for one instance, e.g., how long does it take for kNN to classify one point? Assume your data has dimensionality d, you have n training examples and use Euclidean distance. Assume also that you use a quick select implementation which gives you the k smallest elements of a list of length m in O(m).
Problem 3: Naive Bayes
In this problem, we explore why the naive assumption is necessary.
After learning about using Bayes rule to make predictions (without the naive assumption), you want to apply the method to a complex problem. Keep in mind that you want to use the rule:
P(Y|X)=P(X|Y)P(Y))P(X)
and you want to estimate the parameters of P(X|Y) and P(Y). However, before applying the method to your problem, you want to apply to a toy problem first.
a)
In the toy problem, X=[X1,X2] (so d=2), where Xi is binary. Y is also binary. You want to estimate P(X|Y) without the Naive Bayes assumption, that is you cannot write P(X|Y)=∏di=1P(Xi=xi|Y=y), instead you must estimate P(X|Y)=P(X1=x1,X2=x2,...Xd=xd|Y=y) for all combinations of the values of x1,x2...xd, and y. How many parameters do you have to estimate for your toy problem?
(Here parameter refers to the estimate of P(X1=x1,X2=x2,...Xd=xd|Y=y), and P(Y=y) for some combination of xi's and y. In our case where d=2, examples of such parameters are P(X1=1,X2=0|Y=1) and P(Y=1) )
b)
After running the decision rule on your toy problem, you decide to apply it to the actual problem. However, in your problem, d=100. How many parameters do you have to estimate now?
c)
When is it necessary for us to make the naive assumption? Explain by showing how the assumption will affect one of the answers from above.
Problem 4: Naive Bayes
a)
We will attempt to write the multinomial naive Bayes classifier's decision rule as a linear rule. Suppose that we have a document that is represented as a sequence d=(w1,w2,⋯,wl) of length l. This can be classified as:
h(d)=argmaxy∈{+1,−1}Pr(Y=y)l∏i=1Pr(W=wi|Y=y)
Assuming that we have estimates for all the appropriate probabilities and none of them are zero, rewrite h(d) as h(d)=sign(→v⋅→x+b). We assume that we have a vocabulary of words {a1,a2...am} that contains m words. Each word wi in the document is one of the aj, and →x is a vector
where each component xj is the count of the number of times the word aj shows up in the document. Provide →v, and b
b)
Previous problems considered only those cases where X consists of discrete values. Now, we will also consider the case where X can take continous values.
But first, (i) write down the maximum likelihood estimates (MLE) for the parameters of the naive Bayes classifier, Pr(Xi|Y) and Pr(Y), where X takes discrete ,categorical values.
Now let's look at naive Bayes classifier that takes vectors of continous values as input. In this case, we need a different formulation for Pr(Xi|Y) (we don't need to worry about Pr(Y) because Y still takes discrete values). One way is to assume that, for each discrete value y, each Xi comes from a Gaussian.
For example, consider the simplified case above, where Xi takes a continuous value (a value along the x-axis) and Y can be either blue or green. As shown above, for each discrete value of Y (blue or green), Xi is a random variable from a Gaussian specific to Xi (not some other Xj) and the value of Y.
As a result, we see two Gaussians, each generating blue data points or green data points.
With this, we get a Gaussian Naive Bayes classifier. Using the assumption, (ii) write down the MLE for the parameters of Pr(Xi|Y) (Note: since Pr(Xi|Y=y) is gaussian its parameters are its meanμy,i and standard deviation σy,i) . (iii) What is the total number of parameters?
c)
Using what we found in part (b), we will reformulate the classifier once more. Remember how a Gaussian naive Bayes classifier is defined:
- Xi is continuous
- Pr(Xi|Y=y) is a Gaussian with μy,i, σi (we will assume that each σ only depends on the feature, and not the label y)
- given the label, features are conditionally independent, just like the discrete version
- Y is a Bernoulli random variable
Now, find the weight vector →w such that
Pr(Y=+1|X)=Pr(X|Y=+1)Pr(Y=+1)Pr(X)=11+exp(−(w0+∑ni=1wiXi))
Be sure to use your answers for part (b).
Note: The result that you got is precisely the formulation for logistic regression, which you can read about here. However, Gaussian naive Bayes and logistic regression are different learning algorithms. They output the (asymptotically) same model only under special conditions. We will explore the relation further when we cover logistic regression.
Problem 5: Perceptron
In this problem we will show that one of the oldest machine learning algorithm: the perceptron indeed learns. The perceptron, invented by former Cornell professor Frank Rosenblatt, seeks a hyperplane that separates the data by sequentially updating an estimate when it makes an error. In other words, it tries to find a
vector parameter w such that for all feature-label pair (xi,yi), where yi∈{−1,1} we have wTxi>0 if yi=1, and wTxi<0 if yi=−1.
5.1: Learning a single input
First, show that if we feed only one point to the perceptron algorithm over and over again (hence the dataset is only (x1,y1) ), it will get it right after the first update. (You may assume x1≠0.)
5.2: Learning a linearly separable dataset
We will now prove a theorem which bounds the number of errors that the perceptron algorithm makes assuming the data is linearly separable. We will make the following assumptions:
- (The data is linearly separable and there exist a margin γ between the positives examples and the negatives examples) Assume there exists such that ||w∗||=1, and some γ>0 such that for all t=1...n,
yt(w∗Txt)≥γ.
-
(The data lies in a ball of radius R) Assume that for all t=1...n, ||xt||≤R.
Note that we assume that there exists a parameter vector w∗ which has a "margin" of γ, but we do not claim that the perceptron algorithm will find w∗, we simply claim that the algorithm will a parameter vector which separates the data.
We define wk to be the parameter vector (denoted w in the algorithm) when the k-th update happens (which is not the same as the iteration variable j or t in the algorithm). The proof relies on the observation that the magnitude of wk (denoted ||wk||) can be bounded by above and by below as a function of k, but the lower bound increases faster than the upper bound! Hence, after finitely many updates, the algorithm must stop.
a)
In this first part of the proof, we show that ||wk+1|| is bounded by below as a function of k . Show that wTk+1w∗≥wTkw∗+γ
b)
Use part (a) to show that ||wk+1||≥kγ. Hint: use induction, then use the Cauchy-Schwarz inequality (||u||||v||≥|uTv|). This shows that ||wk|| increases as k (the number of error made) increases.
c)
Now we proceed to prove that ||wk+1|| is also bounded by above as a function of k . Show ||wk+1||2≤||wk||2+R2
d)
Use (c) to show ||wk+1||2≤kR2. This shows that wk is bounded by above as a function of k .
e)
Use (b) and (d) to show k≤R2γ2. What is the maximum number of errors made? Congratulations, you have just proved the convergence of the perceptron! Rosenblatt would be proud of you.
5.3: But when does it end?
We run the perceptron algorithm with a slight modification (see algorithm below) on a linearly separable dataset. This algorithm exits the outer loop only when the perceptron has classified all the data and no update has been done in a given iteration. For what values of T in algorithm 1 do algorithm 1 and algorithm 2 always output the same w?