Processing math: 100%

Homework 3

Due: December 4 1:25 PM in class
Up to 5 members in each group. Please include NetID and names on front page.

Kernels

Kernelized Ridge Regression

In this problem, we do a step-by-step derivation of kernelized version of ridge regression. In particular, we need to formulate the training and testing problems in ridge regression such that the data points (both training and testing) only appear inside inner products. When formulated in such a way, we can simply replace the inner products with the appropriate kernel to model non-linear functions.
Recall that the optimal weight vector in ridge regression can be found in closed form as: β=(XTX+λI)1XTy
  1. Show that β can be written as XTv for some vector v.
  2. Using the formulation above, we can write write the optimal β as a weighted combination of traning data points β=ni=1αixi where xi denotes the ith column of XT. So, the training problem now becomes finding optimal α, say α. Show that α=(XXT+λI)1y This means all the training data points (i.e. xs) appear only inside inner products.
  3. For a given test point x, show that computing ˆf(x) only requires inner products between x and training data points.
  4. If we are to store each real number as a floating point number, how many floating point numbers would be required to store the trained model in non-kernelized version and the kernelized version of ridge regression?

Ball Trees

Remember the ball-trees data structure from class. Assume you have a data set S={x1,,xn} and a ball B with center c and radius r, such that: B={xiS:xic2r2}
  1. Prove that for any test point xt the distance to any point xiB is always greater than or equal to the distance from the Ball. (Hint: Make a drawing and use the triangular inequality.)
  2. In some settings of kNN classification it can be the case that one class is much more common than the other. Let's call them negative (very common) and positive (very rare). Real world examples are fraud detection on credit cards, abnormal behavior detection on security cameras, fault detection in assembly lines etc.
    1. For a test point xt, if you knew the distance to the k closest positive examples, how could you speed up the kNN classification with ball-trees even further?
    2. How would you adapt your ball-tree data structure to such a setting. Explain why your setup can be much faster for kNN classification in the most common case. (Hint: Construct two different ball-trees and change the pruning rule.)

CART (Classification and Regression Trees)

You are building a regression tree, and your recursive function is called with data S={(x1,y1),,(xn,yn)} (where we have continuous labels yiR). For a leaf let the prediction be p. Your loss-function is the averaged squared-loss: L(S)=1|S|(xj,yj)S(yjp)2.
  1. Show that the average predictor pˉy=1n(xi,yi)Syi minimizes L.
  2. If the termination conditions are not met (i.e.\ you do not create a leaf), you need to split. For this, you need to find the best split for any feature f. Assume we have sorted our data set according to some feature f, such that f1f2fn, where fi is the value of feature f in example xi. Let all relevant splits be c1c2cn1, for example ci=fi+fi+12.
    1. Define the predictors ˉyiL,ˉyiR of the left and right sub-trees, after cutting at ci (left ci) in terms of y1,,yn.
    2. Write down the expressions for the loss LiL and LiR on the left and right sub-trees, after cutting at ci. What is the time complexity of computing both terms for any cut point ci?
    3. Express ˉyi+1L,ˉyi+1R in terms of ˉyiL,ˉyiR and yi+1.
    4. Let si=ij=1y2j and ri=nj=i+1y2j, express Li+1L,Li+1R in terms of yi+1,ˉyiL,ˉyiR,si,ri.
    5. Write a full update rule for iteration i+1, assuming you have ˉyiL,ˉyiR,si,ri,Li,Ri.
    6. What is your time complexity for the best-cut search with this method?
    7. What is the time complexity for a best-cut search with Entropy and Gini impurity measures?

Decision Trees

In many applications (e.g. medical sciences) it is not always possible to obtain all features (e.g. medical examinations) on all data examples xi (e.g. patients). This results in some missing features in the train / test data. In those cases we have an additional bit that indicates that a particular feature value does not exist.
  1. How would you adapt decision trees to this sort of data? (Just describe the necessary changes to the algorithm.)
  2. Can you also create a binary decision tree that incorporates missing data?

Bagging

Bagging is a method to reduce the variance of a classifier by averaging over several classifiers trained on subsets of the original training data. The subsets are obtained by uniform subsampling with replacement. i.e. if your data is S={(x1,y1),,(xn,yn)}, at each iteration you create a new data set S with n random picks, picking each example pair with probability 1n each time. As a result you could end up with multiple identical pairs, or some not present at all.

Let pn(m,k) be the probability that you have drawn m unique pairs after k picks with |S|=n. So clearly pn(m,k)=0 whenever m>k (because you cannot end up with more unique elements m than you have drawn), and also pn(m,k)=0 whenever m>n.
  1. What are the base-case values of pn(1,1),pn(m,1),pn(1,k)?
  2. Assume you are have already picked k1 elements. What is the probability that the kth pick will not increase your number of unique elements m? What is the probability that it will?
  3. Can you express pn(m,k) in terms of pn(m,k1) and pn(m1,k1)?
  4. Write out the formula for Ek=n[mn], the expected ratio of unique elements(m) and the total number of elements(n) after n picks (i.e. k=n) from set S with |S|=n.
  5. Write a little recursive function (in the programming language of your choice) that evaluates Ek=n[mn]. Plot its value as n increases. What value does it converge to?
  6. If you average over M classifiers, trained on sub-sets S1,,SM, what is the probability that one input pair is never picked in any of the training data sets Si? Plot this function as M increases. (Assuming that n is large enough for the convergence as observed in (c).)