Up to 5 members in each group. Please include NetID and names on front page.

In this problem, we derive the bias and variance of the MLE and MAP solutions of one-dimensional linear regression. Let $$y_i = w x_i + \epsilon_i$$ where $x_i \in \mathbb{R}$ are input features, $w \in \mathbb{R}$ is the true model parameter, and $\epsilon_i \sim N(0,\sigma^2)$ are iid Gaussian noise. The MLE solution is $$w^* = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}.$$ For any $\lambda > 0$, the MAP (i.e. ridge regression) solution is $$w^*(\lambda) = \frac{\sum_{i=1}^n x_i y_i}{\lambda + \sum_{i=1}^n x_i^2}.$$

(a) Derive the expectation and variance of $w^*$. Is $w^*$ an unbiased estimator of $w$?

Solution: \begin{align*} \mathbb{E}[w^*] &= \mathbb{E} \left[ \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \right] \\ &= \frac{\sum_{i=1}^n x_i \mathbb{E}_D[y_i]}{\sum_{i=1}^n x_i^2} \\ &= \frac{\sum_{i=1}^n x_i w x_i}{\sum_{i=1}^n x_i^2} \\ &= w \end{align*} \begin{align*} \text{Var}(w^*) &= \text{Var} \left( \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \right) \\ &= \frac{\sum_{i=1}^n x_i^2 \text{Var}_D[y_i]}{\left( \sum_{i=1}^n x_i^2 \right)^2} \\ &= \frac{\sum_{i=1}^n x_i^2 \sigma^2}{\left( \sum_{i=1}^n x_i^2 \right)^2} \\ &= \frac{\sigma^2}{\sum_{i=1}^n x_i^2} \end{align*}

Since $\mathbb{E}[w^*] = w$, $w$ is an unbiased estimator of $w$.

(b) For any $\lambda > 0$, derive the expectation and variance of $w^*(\lambda)$. Is $w^*(\lambda)$ an unbiased estimator of $w$?

Solution: \begin{align*} \mathbb{E}[w^*(\lambda)] &= \mathbb{E} \left[ \frac{\sum_{i=1}^n x_i y_i}{\lambda + \sum_{i=1}^n x_i^2} \right] \\ &= \frac{\sum_{i=1}^n x_i \mathbb{E}_D[y_i]}{\lambda + \sum_{i=1}^n x_i^2} \\ &= \frac{\sum_{i=1}^n x_i w x_i}{\lambda + \sum_{i=1}^n x_i^2} \\ &= \left( \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2} \right) w \end{align*} \begin{align*} \text{Var}(w^*(\lambda)) &= \text{Var} \left( \frac{\sum_{i=1}^n x_i y_i}{\lambda + \sum_{i=1}^n x_i^2} \right) \\ &= \frac{\sum_{i=1}^n x_i^2 \text{Var}_D[y_i]}{\left( \lambda + \sum_{i=1}^n x_i^2 \right)^2} \\ &= \frac{\sum_{i=1}^n x_i^2 \sigma^2}{\left( \lambda + \sum_{i=1}^n x_i^2 \right)^2} \\ &= \left[ \frac{\sum_{i=1}^n x_i^2}{\left( \lambda + \sum_{i=1}^n x_i^2 \right)^2} \right] \sigma^2 \end{align*}

For any $\lambda > 0$, $w^*(\lambda)$ is not an unbiased estimator of $w$.

(c) Suppose now that $x \sim \text{Unif}(0,1)$. For $x_1,\ldots,x_n$ *fixed*, the expected test error $\mathbb{E}_{x,y,D}[(h_D(x)-y)^2]$ can be decomposed into bias (squared), variance, and noise as
$$\mathbb{E}_{x,y,D}[(h_D(x)-y)^2] = \mathbb{E}_x[(\bar{h}(x) - \bar{y}(x))^2] + \mathbb{E}_{x,D}[(h_D(x) - \bar{h}(x))^2] + \mathbb{E}_{x,y}[(\bar{y}(x) - y)^2]$$
where the expectation over $D$ is only over the $y_i$'s. Using (a) and (b), express each of the three terms for both linear regression and ridge regression.

Solution: By definition, $\bar{y}(x) = wx$. The noise term is $$\mathbb{E}_{x,y}[(\bar{y}(x) - y)^2] = \mathbb{E}_\epsilon[\epsilon^2] = \text{Var}(\epsilon) = \sigma^2.$$ For linear regression, $\bar{h}(x) = \mathbb{E}_D[w^*] x = wx = \bar{y}(x)$, hence the bias squared term is 0. The variance term is \begin{align*} \mathbb{E}_{x,D}[(h_D(x) - \bar{h}(x))^2] &= \mathbb{E}_{x,D}[(w^* x - wx)^2] \\ &= \mathbb{E}_{x,D}[(w^* - w)^2 x^2] \\ &= \mathbb{E}_D[(w^* - w)^2] \mathbb{E}_x[x^2] \hspace{18pt} \text{by independence} \\ &= \frac{\text{Var}(w^*)}{3}. \\ \end{align*} For ridge regression, $$\bar{h}(x) = \mathbb{E}_D[w^*(\lambda)] x = \left( \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2} \right) wx = \left( \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2} \right) \bar{y}(x).$$ Letting $\alpha = \frac{\sum_{i=1}^n x_i^2}{\lambda + \sum_{i=1}^n x_i^2}$, we get \begin{align*} \mathbb{E}_x[(\bar{h}(x) - \bar{y}(x))^2] &= \mathbb{E}_x[(\alpha wx - wx)^2] \\ &= (\alpha-1)^2 w^2 \mathbb{E}_x[x^2] \\ &= \frac{(\alpha-1)^2 w^2}{3} \end{align*} The variance term can be derive similar to the linear regression case. $$\mathbb{E}_{x,D}[(h_D(x) - \bar{h}(x))^2] = \frac{\text{Var}(w^*(\lambda))}{3}.$$

(d) If you know ahead of time that $\sigma^2$ is small, i.e. the label does not vary much given the feature, which model would you choose? What about if $\sigma^2$ is large?

Solution: The noise term is the same for both methods. If $\sigma^2$ is small, we will have $Var(w^*) \approx Var(w^*(\lambda))$, so the variance term is similar, while the bias squared term for linear regression is 0. Hence linear regression is better.

When $\sigma^2$ is large, we can use a large $\lambda$ to offset the $\sigma^2$ term in $Var(w^*(\lambda))$, so the variance term for ridge regression is much smaller than for linear regression. The bias term for ridge regression is bounded by $\frac{w^2}{3}$, which is a constant. Hence ridge regression is better.

In class you saw how to construct new kernels from existing ones. This problem asks you to prove that these construction techniques indeed produce valid kernels.

Recall that a function $k : \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is a kernel function if and only if the matrix $$K_{ij} = k(x_i,x_j)$$ is positive semidefinite for any set of vectors $x_1,\ldots,x_n \in \mathbb{R}^d$. Let $k_1,k_2$ be kernel functions. Prove that the following are kernel functions.

(a) $k(x,z) = x^\top A z$ where $A$ is a positive semidefinite matrix

Solution: Since $A$ is real and symmetric, there exist matrices $Q$ and $D$ with $A = QDQ^\top$ with $D$ being a diagonal matrix of eigenvalues. Moreover, since $A$ is positive semidefinite, all eigenvalues of $A$ are non-negative, so the diagonal matrix $D^\frac{1}{2}$ where $$D^\frac{1}{2}_{ii} = \sqrt{D_{ii}}$$ is well-defined. This allows us to write $A = L^\top L$ with $L = D^\frac{1}{2} Q^\top$.

Now define feature function $\phi(x) = Lx$, so $k(x,z) = \phi(x)^\top \phi(z)$ is a well-defined kernel since it is the inner product of two vectors.

(b) $k(x,z) = c k_1(x,z)$ for some $c \geq 0$

Solution: Since $c \geq 0$, the kernel matrix $K$ for $k$ and $x_1,\ldots,x_n \in \mathbb{R}^d$ is a non-negative multiple of the kernel matrix $K_1$ for $k_1$. Thus $K$ is positive semidefinite when $K_1$ is.

(c) $k(x,z) = k_1(x,z) + k_2(x,z)$

Solution: Recall from class that $k$ is a kernel if and only if there exists a feature function $\phi : \mathbb{R}^d \rightarrow \mathbb{R}^{d'}$ (with $d'$ being possibly infinite) such that $k(x,z) = \phi(x)^\top \phi(z)$.

Let $\phi_1$ be the feature function for $k_1$ and let $\phi_2$ be the feature function for $k_2$. Define the feature function $\phi$ such that $\phi(x)$ is the concatenation of $\phi_1(x)$ and $\phi_2(x)$. Then $$k(x,z) = k_1(x,z) + k_2(x,z) = \phi_1(x)^\top \phi_1(z) + \phi_2(x)^\top \phi_2(z) = \phi(x)^\top \phi(z).$$ This shows that $k$ is a kernel.

(d) $k(x,z) = k_1(x,z) k_2(x,z)$

Solution: Let $\phi_1, \phi_2$ be as in (c). Then \begin{align*} $k(x,z) &= (\phi_1(x)^\top \phi_1(z)) (\phi_2(x)^\top \phi_2(z)) \\ &= \left( \sum_{i=1}^{d_1} \phi_1(x)_i \phi_1(z)_i \right) \left( \sum_{j=1}^{d_2} \phi_2(x)_j \phi_2(z)_j \right) \\ &= \sum_{i=1}^{d_1} \sum_{j=1}^{d_2} \phi_1(x)_i \phi_2(x)_j \phi_1(z)_i \phi_2(z)_j \end{align*} Define the feature function $\phi : \mathbb{R}^d \rightarrow \mathbb{R}^{d_1 \times d_2}$ by $\phi(x)_{ij} = \phi_1(x)_i \phi_2(x)_j$ and inner product $\langle \cdot , \cdot \rangle$ as the element-wise product of two matrices. This gives $k(x,z) = \langle \phi(x), \phi(z) \rangle$, so $k$ is a kernel.

(e) $k(x,z) = q(k_1(x,z))$ where $q$ is a polynomial with non-negative coefficients Solution: Let $q(t) = \sum_{i=0}^m c_i t^i$ for non-negative coefficients $c_0,\ldots,c_m$. It is easy to show that $k'(x,z) = c_0$ is a kernel. For each $i=1,\ldots,m$, use (d) recursively to show that $k_1(x,z)^i$ is a kernel, then use (b) to show that $c_i k_1(x,z)^i$ is a kernel. Finally, use (c) on the sum to show that $k$ is a kernel.

(f) $k(x,z) = e^{k_1(x,z)}$

Solution: The idea is to use the Taylor expansion $e^t = \sum_{n=0}^\infty \frac{t^n}{n!}$, which converges pointwise for all $t \in \mathbb{R}$. The coefficients are all non-negative, so we can use (e) on this "infinite polynomial" to show that $k$ is a kernel.

Although we accept this argument for full credit, a more rigorous proof involves using the following lemma on pointwise limit of kernels.

**Lemma:** Suppose that $k_1,k_2,\ldots$ is an infinite sequence of kernels and converges pointwise to $k$. Then $k$ is a kernel.

We leave the proof of this lemma as an exercise. To prove (f) rigorously, define $k_m = \sum_{n=0}^m \frac{t^n}{n!}$. This is a sequence of kernels (which can be shown using (e)) that converges pointwise to $k$, so $k$ is a kernel by the above lemma.

It is possible to define kernel functions over discrete objects such as strings. This is used in applications such as bio-informatics and natural language processing. Let $\Sigma$ be the alphabet space and let $x,z \in \Sigma^*$. Define \[ \delta(x,z) = \begin{cases} 1 & \text{if } x = z \\ 0 & \text{otherwise} \end{cases} \]

(a) Let $\sigma_n(x)$ denote the set of $n$-grams of $x$, e.g. $\sigma_2(\texttt{abcd}) = \{\texttt{ab},\texttt{bc},\texttt{cd}\}$. Express the kernel function $$k(x,z) = \sum_{s \in \sigma_n(x)} \sum_{t \in \sigma_n(z)} \delta(s,t)$$ using an appropriate feature expansion.

Let $S = \{s_1,s_2,\ldots,s_N\}$ be the set of $n$-grams of $\Sigma$. Let $\phi : \Sigma^* \rightarrow \mathbb{R}^N$ be such that \[ \phi(x)_i = \begin{cases} 1 & \text{if } s_i \in \sigma_n(x) \\ 0 & \text{otherwise} \end{cases} \] Then $\phi(x)_i \phi(z)_i = 1$ if and only if $s_i \in \sigma_n(x) \cap \sigma_n(z)$. Hence $$k(x,z) = \sum_{s \in \sigma_n(x)} \sum_{t \in \sigma_n(z)} \delta(s,t) = \sum_{s_i \in \sigma_n(s) \cap \sigma_n(t)} 1 = \sum_{s_i \in S} \phi(x)_i \phi(z)_i = \phi(x)^\top \phi(z).$$

(b) Let $\tau(x)$ denote the set of words of $x$ delimited by whitespace, e.g. $\tau(\texttt{winter is cold}) = \{\texttt{winter},\texttt{is},\texttt{cold}\}$. Express the kernel function $$k(x,z) = \sum_{s \in \tau(x)} \sum_{t \in \tau(z)} \delta(s,t)$$ using an appropriate feature expansion.

A feature expansion function $\phi : \Sigma^* \rightarrow \mathbb{R}^d$ is such that $k(x,z) = \phi(x)^\top \phi(z)$. Note that $d$ does not necessarily have to be finite.

Solution: Let $\Sigma^* = \{s_1,s_2,\ldots\}$ be an enumeration of $\Sigma^*$. We can define $\phi$ similarly, except that $\phi(x)$ is an infinite-dimensional vector.

**Note:** Both (a) and (b) have an alternative interpretation where the set $\sigma_n(x)$ and $\tau(x)$ are *multisets* as opposed to sets, i.e. the same element may appear multiple times. In this case, define $\phi(x)_i = \text{number of occurrences of } s_i$ will work.

One weakness of logistic regression is that it is a linear classifier. In this problem, you will use kernel functions to derive a non-linear version of logistic regression.

Given the parameter vector $w$, the model predicts $$\mathbb{P}(y | x) = \frac{1}{1 + e^{-yw^\top x}}.$$ The MLE estimate for $w$ can be found by minimizing the negative log-likelihood over the training data $(x_1,y_1),\ldots,(x_n,y_n)$: $$\ell(w) = \sum_{i=1}^n \log(1+e^{-y_i w^\top x_i}).$$

(a) Express the gradient of $\ell(w)$ as a linear combination of input vectors $x_1,\ldots,x_n$. You may use your solution from Homework 3 for the gradient.

Solution: The gradient with respect to $w$ is $$\nabla \ell(w) = -\sum_{i=1}^n \frac{y_i x_i e^{-y_i w^\top x_i}}{1+e^{-y_i w^\top x_i}}.$$ Notice that each term in the sum is a multiple of $x_i$, so the gradient is a linear combination of $x_i$ at each iteration.

(b) Use (a) to express the model in terms of inner products between training inputs and a test input $x$. Write down the kernel logistic regression model by replacing inner products with a kernel function $k$.

Solution: If we initialize $w$ to the all-zero vector then we can write $w = \sum_{i=1}^n \alpha_i x_i$ for $\alpha_i \in \mathbb{R}$. The logistic regression model becomes $$\mathbb{P}(y | x) = \frac{1}{1 + e^{-y\sum_{i=1}^n \alpha_i x_i^\top x}}.$$ Replacing inner products with a kernel $k$, we get $$\mathbb{P}(y | x) = \frac{1}{1 + e^{-y\sum_{i=1}^n \alpha_i k(x_i,x)}}.$$

(c) For the kernelized model, write down the loss function $\ell(\alpha)$ for the coefficient vector $\alpha$ and derive its gradient.

Solution: $$\ell(\alpha) = \sum_{i=1}^n \log(1+e^{-y_i \sum_{j=1}^n \alpha_j k(x_j,x_i)}) = \sum_{i=1}^n \log(1+e^{-y_i (K \alpha)_i})$$ where $K_{ij} = k(x_i,x_j)$ is the kernel matrix. Its gradient is $$\nabla \ell(\alpha) = \sum_{i=1}^n \frac{\nabla_\alpha (1+e^{-y_i(K \alpha)_i)}}{1+e^{-y_i(K \alpha)_i)}} = -\sum_{i=1}^n \frac{y_i K_i^\top e^{-y_i(K \alpha)_i)}}{1+e^{-y_i w^\top x_i}}.$$

(d) Suppose now that we add a regularization term to the parameter vector for the original model, i.e. $$\ell(w) = \sum_{i=1}^n \log(1+e^{-y_i w^\top x_i}) + \lambda \|w\|_2^2.$$ Express the equivalent kernelized loss function for $\alpha$ and derive its gradient.

At this point, you should be able to use gradient descent to train the kernel logistic regression model for any binary classification task!

Solution: Writing $w = \sum_{i=1}^n \alpha_i x_i$ gives $$\|w\|_2^2 = w^\top w = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j x_i^\top x_j.$$ Replacing inner product with $k(x_i,x_j)$ and rewriting in terms of matrix product gives $$\ell(\alpha) = \sum_{i=1}^n \log(1+e^{-y_i (K \alpha)_i}) + \lambda \alpha^\top K \alpha.$$ Its gradient is $$\nabla \ell(\alpha) = -\sum_{i=1}^n \frac{y_i K_i^\top e^{-y_i(K \alpha)_i)}}{1+e^{-y_i w^\top x_i}} + 2\lambda K \alpha.$$