Homework 3

Due: December 4 1:25 PM in class
Up to 5 members in each group. Please include NetID and names on front page.

Kernels

Kernelized Ridge Regression

In this problem, we do a step-by-step derivation of kernelized version of ridge regression. In particular, we need to formulate the training and testing problems in ridge regression such that the data points (both training and testing) only appear inside inner products. When formulated in such a way, we can simply replace the inner products with the appropriate kernel to model non-linear functions.
Recall that the optimal weight vector in ridge regression can be found in closed form as: \[\beta^* = (X^TX+\lambda I)^{-1}X^Ty\]
  1. Show that $\beta^*$ can be written as $X^Tv$ for some vector $v$.
  2. Using the formulation above, we can write write the optimal $\beta$ as a weighted combination of traning data points \[\beta = \sum_{i=1}^n \alpha_ix_i\] where $x_i$ denotes the $i^{th}$ column of $X^T$. So, the training problem now becomes finding optimal $\alpha$, say $\alpha^*$. Show that \[\alpha^* = (XX^T + \lambda I)^{-1}y\] This means all the training data points (i.e. $x$s) appear only inside inner products.
  3. For a given test point $x$, show that computing $\hat{f}(x)$ only requires inner products between $x$ and training data points.
  4. If we are to store each real number as a floating point number, how many floating point numbers would be required to store the trained model in non-kernelized version and the kernelized version of ridge regression?

Ball Trees

Remember the ball-trees data structure from class. Assume you have a data set $S=\{\vec x_1,\dots,\vec x_n\}$ and a ball $B$ with center $\vec c$ and radius $r$, such that: \begin{equation} B=\left\{\vec x_i\in S : \|\vec x_i-\vec c\|^2\leq r^2\right\} \end{equation}
  1. Prove that for any test point $\vec x_t$ the distance to any point $\vec x_i\in B$ is always greater than or equal to the distance from the Ball. (Hint: Make a drawing and use the triangular inequality.)
  2. In some settings of $k$NN classification it can be the case that one class is much more common than the other. Let's call them negative (very common) and positive (very rare). Real world examples are fraud detection on credit cards, abnormal behavior detection on security cameras, fault detection in assembly lines etc.
    1. For a test point $\vec x_t$, if you knew the distance to the $k$ closest positive examples, how could you speed up the $k$NN classification with ball-trees even further?
    2. How would you adapt your ball-tree data structure to such a setting. Explain why your setup can be much faster for $k$NN classification in the most common case. (Hint: Construct two different ball-trees and change the pruning rule.)

CART (Classification and Regression Trees)

You are building a regression tree, and your recursive function is called with data $S=\{(\vec x_1,y_1),\dots,(\vec x_n,y_n)\}$ (where we have continuous labels $y_i\in{\cal R}$). For a leaf let the prediction be $p$. Your loss-function is the averaged squared-loss: \begin{equation} {\cal L}(S)=\frac{1}{|S|}\sum_{(\vec x_j,y_j)\in S} {(y_j-p)}^2.\label{q2:loss} \end{equation}
  1. Show that the average predictor $p\leftarrow\bar y=\frac{1}{n}\sum_{(\vec x_i,y_i)\in S}y_i$ minimizes ${\cal L}$.
  2. If the termination conditions are not met (i.e.\ you do not create a leaf), you need to split. For this, you need to find the best split for any feature $f$. Assume we have sorted our data set according to some feature $f$, such that $f_1\leq f_2\leq \cdots \leq f_n$, where $f_i$ is the value of feature $f$ in example $\vec x_i$. Let all relevant splits be $c_1\leq c_2\leq \cdots \leq c_{n-1}$, for example $c_i=\frac{f_{i}+f_{i+1}}{2}$.
    1. Define the predictors $\bar y_L^i, \bar y_R^i$ of the left and right sub-trees, after cutting at $c_i$ (left $\leq c_i$) in terms of $y_1,\dots,y_n$.
    2. Write down the expressions for the loss ${\cal L}_L^i$ and ${\cal L}_R^i$ on the left and right sub-trees, after cutting at $c_i$. What is the time complexity of computing both terms for any cut point $c_i$?
    3. Express $\bar y^{i+1}_L,\bar y^{i+1}_R$ in terms of $\bar y^{i}_L,\bar y^{i}_R$ and $y_{i+1}$.
    4. Let $s^i=\sum_{j=1}^i y_j^2$ and $r^i=\sum_{j=i+1}^n y_j^2$, express ${\cal L}^{i+1}_L,{\cal L}^{i+1}_R$ in terms of $y_{i+1},\bar y^i_L,\bar y^i_R,s^i,r^i$.
    5. Write a full update rule for iteration $i+1$, assuming you have $\bar y^{i}_L,\bar y^i_R,s_i,r_i,{\cal L}^i,{\cal R}^i$.
    6. What is your time complexity for the best-cut search with this method?
    7. What is the time complexity for a best-cut search with Entropy and Gini impurity measures?

Decision Trees

In many applications (e.g. medical sciences) it is not always possible to obtain all features (e.g. medical examinations) on all data examples $\vec x_i$ (e.g. patients). This results in some missing features in the train / test data. In those cases we have an additional bit that indicates that a particular feature value does not exist.
  1. How would you adapt decision trees to this sort of data? (Just describe the necessary changes to the algorithm.)
  2. Can you also create a binary decision tree that incorporates missing data?

Bagging

Bagging is a method to reduce the variance of a classifier by averaging over several classifiers trained on subsets of the original training data. The subsets are obtained by uniform subsampling with replacement. i.e. if your data is $S=\{(\vec x_1,y_1),\dots,(\vec x_n,y_n)\}$, at each iteration you create a new data set $S'$ with $n$ random picks, picking each example pair with probability $\frac{1}{n}$ each time. As a result you could end up with multiple identical pairs, or some not present at all.

Let $p_n(m,k)$ be the probability that you have drawn $m$ unique pairs after $k$ picks with $|S|=n$. So clearly $p_n(m,k)=0$ whenever $m>k$ (because you cannot end up with more unique elements $m$ than you have drawn), and also $p_n(m,k)=0$ whenever $m>n$.
  1. What are the base-case values of $p_n(1,1), p_n(m,1), p_n(1,k)$?
  2. Assume you are have already picked $k-1$ elements. What is the probability that the $k^{th}$ pick will not increase your number of unique elements $m$? What is the probability that it will?
  3. Can you express $p_n(m,k)$ in terms of $p_n(m,k-1)$ and $p_n(m-1,k-1)$?
  4. Write out the formula for $E_{k=n}[\frac{m}{n}]$, the expected ratio of unique elements($m$) and the total number of elements($n$) after $n$ picks (i.e. $k=n$) from set S with $|S| = n$.
  5. Write a little recursive function (in the programming language of your choice) that evaluates $E_{k=n}[\frac{m}{n}]$. Plot its value as $n$ increases. What value does it converge to?
  6. If you average over $M$ classifiers, trained on sub-sets $S'_1,\dots,S'_M$, what is the probability that one input pair is never picked in any of the training data sets $S'_i$? Plot this function as $M$ increases. (Assuming that $n$ is large enough for the convergence as observed in (c).)