## Homework 5

Due: Friday 04/28/2017 11:59pm on CMS
Up to 5 members in each group. Please include NetID and names on front page.

### Problem 1: Kernels as covariance functions

In this problem, we will take a look at gaining a more intuitive understanding of kernels and how the choice of kernel can affect your machine learning model. To do this, we explore the use of kernels as covariance functions in Gaussian process regression. Recall from class that in Gaussian process regression, we often define the covariance between two labels $Y_{i}$ and $Y_{j}$ of points $x_{i}$ and $x_{j}$ with a kernel, $$\textrm{Cov}(Y_{i},Y_{j}) = k(x_{i},x_{j})$$. This implies that the correlation coefficient between $Y_{i}$ and $Y_{j}$ is then given by $$\textrm{Corr}(Y_{i},Y_{j}) = \frac{\textrm{Cov}(Y_{i},Y_{j})}{\sqrt{\textrm{Var}(Y_{i})\textrm{Var}(Y_{j})}}=\frac{k(x_{i},x_{j})}{\sqrt{k(x_{i},x_{i})k(x_{j},x_{j})}}$$ This is in some sense the key assumption made by GPs about your data: the correlation between the labels of two data points is given by the equation above.

(a) Suppose we are using the the 0 mean Gaussian process model used in class $$\mathbf{Y} \sim \mathcal{N}(0,\mathbf{K})$$ and furthermore assume that, for any $x$ and $z$, $k(x,x)=k(z,z)=s$. Show that if $\textrm{Corr}(Y_{i},Y_{j})=1$, then $p(Y_{i}|Y_{j}=y_{j},x_i,x_j)=\mathcal{N}(y_{j},0)$, and similarly $p(Y_{j}|Y_{i}=y_{i},x_i,x_j)=\mathcal{N}(y_{i},0)$. Hint: use the Gaussian conditioning rules given in class to derive $p(Y_{i}|Y_{j}=y_{j},x_i,x_j)$, and write it in terms of the correlation coefficient. Based on this form, what happens as $\textrm{Corr}(Y_{i},Y_{j})$ approaches 0?

.

SOLUTION: The GP posterior mean of $p(Y_j|Y_i=y_i,x_i,x_j)$ is given by $k(x_j,x_i)\frac{1}{k(x_i,x_i)}y_i$. Given that $k(x_i,x_i)=s$, this means that the mean is: $$\frac{k(x_j,x_i)}{s}y_i=\frac{Cov(Y_i,Y_j)}{s}y_i=Corr(Y_i,Y_j)y_i$$

Since we know the correlation is 1, this means the GP posterior mean is $y_i$. Next, the variance is $k(x_j,x_j)-\frac{k(x_j,x_i)k(x_i,x_j)}{k(x_i,x_i)}$. Again substituting $k(x_i,x_i)=s$, we have $s-\frac{k(x_j,x_i)}{s}k(x_j,x_i)$. We can factor an $s$ out to equate this to: $$s\left(1 - \frac{k(x_j,x_i)}{s}\frac{k(x_j,x_i)}{s}\right)=s\left(1 - Corr(Y_i,Y_j)^2\right)$$. Since the correlation is 1, this means the variance is $s(1-1)=0$. Thus, $p(Y_j|Y_i=y_i,x_i,x_j)=\mathcal{N}(y_i,0)$.

(b) Suppose you were to do Gaussian process regression in one dimension (i.e., $x_{i} \in \mathcal{R}$) as above using the kernel $$k_{PER}(x_{i},x_{j}) = \exp\left(-2\sin^{2}\left(\frac{\pi}{p} \vert x_{i} - x_{j} \vert \right)\right)$$

(i) For arbitrary $z$, $z+p$, and $z+2p$ (where $p$ is the parameter in the kernel above), compute $\textrm{Corr}(Y_{z},Y_{z+p})$, $\textrm{Corr}(Y_{z},Y_{z+2p})$ and $\textrm{Corr}(Y_{z+p},Y_{z+2p})$. Based on the result from part (a), what does this tell you about $Y_{z+p},Y_{z+2p},Y_{z+3p},...$ if you are given that $Y_{z}=y_{z}$?

SOLUTION: The correlations between any of these turn out to be 1. Based on part 1, this suggests that if we are given the label of $z$ as $y_z$, our model is immediately confident that the label of $z+p$, $z+2p$, and so on are all also $y_z$.

(ii) Now suppose that you are told that $p=4$ in the kernel above. Suppose further that you are given the following labeled dataset: $$x_{1}=1,y_{1}=1 \\ x_{2}=2,y_{2}=2 \\ x_{3}=3,y_{3}=3 \\ x_{4}=4,y_{4}=2$$. What is the Gaussian process mean prediction for the following test points: $$x_{5}=5, x_{6}=6, x_{7}=7, x_{8}=8,x_{9}=9,x_{10}=10,x_{11}=11,x_{12}=12$$ Hint: You should have to do no math. Plot by hand the Gaussian process mean prediction as a function of $x$ on the range $x \in [0,17]$. What kind of functions do you think a Gaussian process model with this kernel would be good at learning?

SOLUTION: Notice that $x_5 = x_1 + p$, and $x_9 = x_1 + 2p$. Indeed, every test point we are asked about is one or two periods from the data we are given. Based on the result in part two, we can immediately conclude that the GP prediction for $x_5$ and $x_9$ will be $1$, the prediction for $x_6$ and $x_10$ will be $2$, the prediction for $x_7$ and $x_11$ will be 3, and the prediction for $x_8$ and $x_12$ will be $2$. For the plot, any roughly sinusoidal periodic function would be acceptable: the main point to get across here was that the GP predictive mean was a periodic function.

(c) Let $k_{b}(x,z)$ be any kernel with the property that for any $x$ and $z$, $k_{b}(x,x)=k_{b}(z,z)=s$. Define a new kernel in terms of this base kernel, $$k_{sym}(x,z) = k_{b}(x,z)+k_{b}(-x,z)$$ Let $Y_{x}$ be the label of some point $x$, and let $Y_{-x}$ be the label of $-x$. Using the definition of correlation above, show that $k_{sym}(x,z)$ encodes reflective symmetry, i.e. the predictive mean satisfies $f(x)=f(-x)$, by showing that $\textrm{Corr}(Y_{x},Y_{-x})=1$ (hint: recall that $k(x,z)=k(z,x)$ for any kernel).

SOLUTION: Substituting the definition of $k_{sym}$ in to the correlation equation given: $$Corr(Y_x,Y_{-x})=\frac{k_b(x,-x)+k_b(-x,-x)}{\sqrt{(k(x,x)+k(-x,x))(k(-x,-x)+k(x,-x))}}$$. Next, substituting in the facts that $k(x,x)=k(-x,-x)=s$ and that $k(x,-x)=k(-x,x)$: $$Corr(Y_x,Y_{-x})=\frac{k_b(-x,x)+s}{\sqrt{(s+k(-x,x))(s+k(-x,x))}}$$. From this point, we simplify to arrive at $$Corr(Y_x,Y_{-x})=\frac{k_b(-x,x)+s}{k_b(-x,x)+s}=1$$.

### Problem 2: Efficiently implementing regression trees

You are building a regression tree, and right now you need to choose a split for the given dataset $S=\{(\vec x_1,y_1),\dots,(\vec x_n,y_n)\}$ (where we have continuous labels $y_i\in{\cal R}$). Suppose you split on some feature $j$ with value $c$ and partition the dataset in to two sets of indices, $L$--the set of indices on the left (i.e., $i \in L \Rightarrow [x_{i}]_{j} < c$)--and $R$--the set of indices on the right (i.e., $i \in R \Rightarrow [x_{i}]_{j} > c$). Suppose you assign every data point on the left the prediction $T_{L}$ and every data point on the right the prediction $T_{R}$. Finally, suppose that each data point $x_{i}$ has an associated weight $w_{i}$, and that the weights are normalized (i.e., $\sum_{i} w_{i} = 1$). Your loss function is the averaged weighted squared-loss: $${\cal L}(S)=\sum_{i \in L} {w_{i}(y_{i} - T_{L})}^2+\sum_{i \in R} {w_{i}(y_{i} - T_{R})}^2.\label{q2:loss}$$

(a) First, show that setting $T_{L}$ and $T_{R}$ to the weighted average label over their respective sets (i.e., $T_{L} = \frac{1}{W_{L}}\sum_{i\in L}w_{i}y_{i}$ and $T_{R} = \frac{1}{W_{R}}\sum_{i\in R}w_{i}y_{i}$) minimizes the loss $\cal L$, where $W_{L}=\sum_{i \in L}w_{i}$ and $W_{R}=\sum_{i \in R} w_{r}$ are the total weight of the left and right side respectively.

SOLUTION: We first take the derivative of the loss with respect to $T_{L}$ to obtain $$\frac{d}{dT_{L}} {\cal L}(S) = -2\sum_{i \in L}w_{i}(y_i - T_L)=-2\sum_{i\in L}w_iy_i + 2T_{L}\sum_{i}w_{i}$$ Setting this equal to zero and solving, we get $$2T_{L}w_{L}=2\sum_{i \in L}w_{i}y_{i}$$ and therefore $$T_{L} = \frac{1}{w_{L}}\sum_{i \in L}w_{i}y_{i}$$ A symmetric argument holds for $T_{R}$.

(b) Now, imagine you are considering splitting on some feature $j$, and suppose you have already sorted the training points in the order of this feature value, so that $[x_{1}]_{j} < [x_{2}]_{j} < \cdots < [x_{n}]_{j}$. You'd like to choose a split from among $c_{1} \leq c_{2} \leq \cdots \leq c_{n-1}$, where $c_{i}=\frac{[x_{i}]_{j}+[x_{i+1}]_{j}}{2}$. One way to do this would be to, for each possible split $c_{k}$, decide whether each $x_{i}$ should be partitioned left or right, and compute $\cal L$. At the end, take the split with the lowest loss. What is the running time complexity of this algorithm in terms of the number of data points $n$?

SOLUTION: There are $O(n)$ splits to consider, and the proposed algorithm would require $O(n)$ per split to evaluate $\cal L$, for a total of $O(n^2)$ time.

(c) Suppose some split $c_{k}$ results in the data being partitioned in to $L^{(k)}$ and $R^{(k)}$. Suppose you are given the following quantities precomputed: $W_{L^{(k)}}$, $P_{L^{(k)}} = \sum_{i \in L} w_{i}y_{i}$, and $Q_{L^{(k)}} = \sum_{i \in L} w_{i}y_{i}^{2}$. Similarly, you are given $W_{R^{(k)}}$, $P_{R^{(k)}}$ and $Q_{R^{(k)}}$. Show how, equipped with these precomputed quantities, you can compute $\cal L$ in constant time.

SOLUTION: Expand the left side of the loss to $$\sum_{i \in L}w_{i}y_{i}^{2} - 2\sum_{i \in L}w_{i}y_{i}T_{L} + \sum_{i \in L}w_{i}T_{L}^{2}$$. The first term is exactly $Q_{L^{(k)}}$. The second term can be written as $-2P_{L^{(k)}}\frac{P_{L^{(k)}}}{W_{L^{(k)}}}=-2\frac{P_{L^{(k)}}^{2}}{w_{L^{(k)}}}$. The last term can be written as $w_{L^{(k)}}\frac{P_{L^{(k)}}^{2}}{w_{L^{(k)}}^{2}}=\frac{P_{L^{(k)}}^{2}}{w_{L^{(k)}}}$. The first term plus the third term is therefore simply $-\frac{P_{L^{(k)}}^{2}}{w_{L^{(k)}}}$. Therefore the whole expression can be evaluated as: $$Q_{L^{(k)}}-\frac{P_{L^{(k)}}^{2}}{w_{L^{(k)}}}$$ Similarly, the right term is: $$Q_{R^{(k)}}-\frac{P_{R^{(k)}}^{2}}{w_{R^{(k)}}}$$

(d) If we split on $c_{k+1}$ instead of $c_{k}$, how many data points move from the right partition to the left partition? How many move from left to right? Based on this:

1. Give an efficient update rule to compute $W_{L^{(k+1)}}$ and and $W_{R^{(k+1)}}$ given $W_{L^{(k)}}$ and $W_{R^{(k)}}$.
2. Give an efficient update rule to compute $P_{L^{(k+1)}}$ and $P_{R^{(k+1)}}$ given $P_{L^{(k)}}$ and $P_{R^{(k)}}$.
3. Give an efficient update rule to compute $Q_{L^{(k+1)}}$ and $Q_{R^{(k+1)}}$ given $Q_{L^{(k)}}$ and $Q_{R^{(k)}}$.

SOLUTION: If all feature values are distinct, only one data point moves from $R$ to $L$ when moving from split $k$ to split $k+1$. Therefore, we simply update the values accordingly. For example, we subtract $w_{k}$ from $W_{R^{(k)}}$ and add it to $W_{L^{(k)}}$. We subtract $w_{k}y_{k}$ from $P_{R^{(k)}}$ and add it to $P_{L^{(k)}}$. We subtract $w_{k}y_{k}^{2}$ from $Q_{R^{(k)}}$ and add it to $Q_{L^{(k)}}$. Crucially, all of these updates take only constant time.

(e) Suppose that you compute the loss of each split using the update rules you derived in part d. What is the time complexity of finding the best cut using this method?

SOLUTION: Now, the loss at each split can be computed in constant ($O(1)$) time given the $W$, $P$, and $Q$ values. Furthermore, each of these can be updated in constant time when moving from one split to the next. As a result, each split takes only constant time to evaluate. Since there are $n$ splits, the whole evaluation takes $O(n)$ time.

### Problem 3: Decision trees with missing data

In many applications (e.g. medical sciences) it is not always possible to obtain all features (e.g. medical examinations) on all data examples $\vec x_i$ (e.g. patients). This results in some missing features in the train / test data. In those cases we have an additional bit that indicates that a particular feature value does not exist.
1. How would you adapt decision trees to this sort of data? (Just describe the necessary changes to the algorithm.)
2. Can you also create a binary decision tree that incorporates missing data?

SOLUTION: There are many valid solutions to dealing with missing data. One common option is imputation, where each missing feature is estimated, for example using the median feature value from the data that is given. There are a number of alternative approaches. For example, when splitting on a feature if a data point is missing the value, it can be sent to both child nodes and the weights halved.