## Homework 4

Due: Friday 04/14/2017 11:59pm on CMS
Up to 5 members in each group. Please include NetID and names on front page.

### Problem 1: Bias-variance of linear regression

In this problem, we derive the bias and variance of the MLE and MAP solutions of one-dimensional linear regression. Let $$y_i = w x_i + \epsilon_i$$ where $x_i \in \mathbb{R}$ are input features, $w \in \mathbb{R}$ is the true model parameter, and $\epsilon_i \sim N(0,\sigma^2)$ are iid Gaussian noise. The MLE solution is $$w^* = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}.$$ For any $\lambda > 0$, the MAP (i.e. ridge regression) solution is $$w^*(\lambda) = \frac{\sum_{i=1}^n x_i y_i}{\lambda + \sum_{i=1}^n x_i^2}.$$

(a) Derive the expectation and variance of $w^*$. Is $w^*$ an unbiased estimator of $w$?

Solution: \begin{align*} \mathbb{E}[w^*] &= \mathbb{E} \left[ \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \right] \\ &= \frac{\sum_{i=1}^n x_i \mathbb{E}_D[y_i]}{\sum_{i=1}^n x_i^2} \\ &= \frac{\sum_{i=1}^n x_i w x_i}{\sum_{i=1}^n x_i^2} \\ &= w \end{align*} \begin{align*} \text{Var}(w^*) &= \text{Var} \left( \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \right) \\ &= \frac{\sum_{i=1}^n x_i^2 \text{Var}_D[y_i]}{\left( \sum_{i=1}^n x_i^2 \right)^2} \\ &= \frac{\sum_{i=1}^n x_i^2 \sigma^2}{\left( \sum_{i=1}^n x_i^2 \right)^2} \\ &= \frac{\sigma^2}{\sum_{i=1}^n x_i^2} \end{align*}

Since $\mathbb{E}[w^*] = w$, $w$ is an unbiased estimator of $w$.

(b) For any $\lambda > 0$, derive the expectation and variance of $w^*(\lambda)$. Is $w^*(\lambda)$ an unbiased estimator of $w$?

Solution: \begin{align*} \mathbb{E}[w^*(\lambda)] &= \mathbb{E} \left[ \frac{\sum_{i=1}^n x_i y_i}{\lambda + \sum_{i=1}^n x_i^2} \right] \\ &= \frac{\sum_{i=1}^n x_i \mathbb{E}_D[y_i]}{\lambda + \sum_{i=1}^n x_i^2} \\ &= \frac{\sum_{i=1}^n x_i w x_i}{\lambda + \sum_{i=1}^n x_i^2} \\ &= \left( \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2} \right) w \end{align*} \begin{align*} \text{Var}(w^*(\lambda)) &= \text{Var} \left( \frac{\sum_{i=1}^n x_i y_i}{\lambda + \sum_{i=1}^n x_i^2} \right) \\ &= \frac{\sum_{i=1}^n x_i^2 \text{Var}_D[y_i]}{\left( \lambda + \sum_{i=1}^n x_i^2 \right)^2} \\ &= \frac{\sum_{i=1}^n x_i^2 \sigma^2}{\left( \lambda + \sum_{i=1}^n x_i^2 \right)^2} \\ &= \left[ \frac{\sum_{i=1}^n x_i^2}{\left( \lambda + \sum_{i=1}^n x_i^2 \right)^2} \right] \sigma^2 \end{align*}

For any $\lambda > 0$, $w^*(\lambda)$ is not an unbiased estimator of $w$.

(c) Suppose now that $x \sim \text{Unif}(0,1)$. For $x_1,\ldots,x_n$ fixed, the expected test error $\mathbb{E}_{x,y,D}[(h_D(x)-y)^2]$ can be decomposed into bias (squared), variance, and noise as $$\mathbb{E}_{x,y,D}[(h_D(x)-y)^2] = \mathbb{E}_x[(\bar{h}(x) - \bar{y}(x))^2] + \mathbb{E}_{x,D}[(h_D(x) - \bar{h}(x))^2] + \mathbb{E}_{x,y}[(\bar{y}(x) - y)^2]$$ where the expectation over $D$ is only over the $y_i$'s. Using (a) and (b), express each of the three terms for both linear regression and ridge regression.

Solution: By definition, $\bar{y}(x) = wx$. The noise term is $$\mathbb{E}_{x,y}[(\bar{y}(x) - y)^2] = \mathbb{E}_\epsilon[\epsilon^2] = \text{Var}(\epsilon) = \sigma^2.$$ For linear regression, $\bar{h}(x) = \mathbb{E}_D[w^*] x = wx = \bar{y}(x)$, hence the bias squared term is 0. The variance term is \begin{align*} \mathbb{E}_{x,D}[(h_D(x) - \bar{h}(x))^2] &= \mathbb{E}_{x,D}[(w^* x - wx)^2] \\ &= \mathbb{E}_{x,D}[(w^* - w)^2 x^2] \\ &= \mathbb{E}_D[(w^* - w)^2] \mathbb{E}_x[x^2] \hspace{18pt} \text{by independence} \\ &= \frac{\text{Var}(w^*)}{3}. \\ \end{align*} For ridge regression, $$\bar{h}(x) = \mathbb{E}_D[w^*(\lambda)] x = \left( \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2} \right) wx = \left( \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2} \right) \bar{y}(x).$$ Letting $\alpha = \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2}$, we get \begin{align*} \mathbb{E}_x[(\bar{h}(x) - \bar{y}(x))^2] &= \mathbb{E}_x[(\alpha wx - wx)^2] \\ &= (\alpha-1)^2 w^2 \mathbb{E}_x[x^2] \\ &= \frac{(\alpha-1)^2 w^2}{3} \end{align*} The variance term can be derive similar to the linear regression case. $$\mathbb{E}_{x,D}[(h_D(x) - \bar{h}(x))^2] = \frac{\text{Var}(w^*(\lambda))}{3}.$$

(d) If you know ahead of time that $\sigma^2$ is small, i.e. the label does not vary much given the feature, which model would you choose? What about if $\sigma^2$ is large?

Solution: The noise term is the same for both methods. If $\sigma^2$ is small, we will have $Var(w^*) \approx Var(w^*(\lambda))$, so the variance term is similar, while the bias squared term for linear regression is 0. Hence linear regression is better.

When $\sigma^2$ is large, we can use a large $\lambda$ to offset the $\sigma^2$ term in $Var(w^*(\lambda))$, so the variance term for ridge regression is much smaller than for linear regression. The bias term for ridge regression is bounded by $\frac{w^2}{3}$, which is a constant. Hence ridge regression is better.

### Problem 2: Constructing kernels

In class you saw how to construct new kernels from existing ones. This problem asks you to prove that these construction techniques indeed produce valid kernels.

Recall that a function $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is a kernel function if and only if the matrix $$K_{ij} = k(x_i,x_j)$$ is positive semidefinite for any set of vectors $x_1,\ldots,x_n \in \mathbb{R}^d$. Let $k_1,k_2$ be kernel functions. Prove that the following are kernel functions.

(a) $k(x,z) = x^\top A z$ where $A$ is a positive semidefinite matrix

Solution: Since $A$ is real and symmetric, there exist matrices $Q$ and $D$ with $A = QDQ^\top$ with $D$ being a diagonal matrix of eigenvalues. Moreover, since $A$ is positive semidefinite, all eigenvalues of $A$ are non-negative, so the diagonal matrix $D^\frac{1}{2}$ where $$D^\frac{1}{2}_{ii} = \sqrt{D_{ii}}$$ is well-defined. This allows us to write $A = L^\top L$ with $L = D^\frac{1}{2} Q^\top$.

Now define feature function $\phi(x) = Lx$, so $k(x,z) = \phi(x)^\top \phi(z)$ is a well-defined kernel since it is the inner product of two vectors.

(b) $k(x,z) = c k_1(x,z)$ for some $c \geq 0$

Solution: Since $c \geq 0$, the kernel matrix $K$ for $k$ and $x_1,\ldots,x_n \in \mathbb{R}^d$ is a non-negative multiple of the kernel matrix $K_1$ for $k_1$. Thus $K$ is positive semidefinite when $K_1$ is.

(c) $k(x,z) = k_1(x,z) + k_2(x,z)$

Solution: Recall from class that $k$ is a kernel if and only if there exists a feature function $\phi : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$ (with $d'$ being possibly infinite) such that $k(x,z) = \phi(x)^\top \phi(z)$.

Let $\phi_1$ be the feature function for $k_1$ and let $\phi_2$ be the feature function for $k_2$. Define the feature function $\phi$ such that $\phi(x)$ is the concatenation of $\phi_1(x)$ and $\phi_2(x)$. Then $$k(x,z) = k_1(x,z) + k_2(x,z) = \phi_1(x)^\top \phi_1(z) + \phi_2(x)^\top \phi_2(z) = \phi(x)^\top \phi(z).$$ This shows that $k$ is a kernel.

(d) $k(x,z) = k_1(x,z) k_2(x,z)$