Homework 3

Due: Friday, March 17th in class
Up to 5 members in each group. Please include NetID and names on front page.

ERM

I. Loss function optimization using gradient descent

Derive the gradient update (with stepsize $c$) for your weight vector $w$ for each of the following loss functions: (here: $\|w\|_2^2=w^\top w$ and $|w|=\sum_{\alpha=1}^d |w_\alpha|$, also $\lambda$ and $C$ are non-negative constants.)
a
Ridge Regression: ${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda \|w\|_2^2$. (You will implement this in a future project)
solution:
Let $X = \left[X_1, X_2, \cdots, X_n\right]^T, Y = \left[y_1, y_2, \cdots, y_n\right]^T$,

$$\begin{eqnarray*} {\cal L}(w)&=&\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda \|w\|^2\\ &=& (Xw - Y)^2 + \lambda w^Tw \\ &=& w^TX^TXw - w^TX^TY - Y^TXw + Y^TY + \lambda w^Tw\\ &=& Trace(w^TX^TXw - w^TX^TY - Y^TXw + Y^TY + \lambda w^Tw)\\ \bigtriangledown {\cal L}(w) &=& 2X^TXw - 2X^TY + 2\lambda w \end{eqnarray*}$$ $$w = w - c \bigtriangledown {\cal L}(w) \mbox{, where } c \mbox{ is the stepsize.}$$
b
Lasso Regression: ${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda |w|$}
solution:
The only difference between this function and the previous one is the regularization part. Hence, we have $$\bigtriangledown {\cal L}(w) = 2X^TXw - 2X^TY + \lambda sign(w) $$ $$w = w - c \bigtriangledown {\cal L}(w)$$
c
Logistic Regression ($y_i\in\{+1,-1\}$): ${\cal L}(w)=\sum_{i=1}^n \log(1+\exp{(-y_i w^\top x_i)})$
solution:
$$\bigtriangledown {\cal L}(w) = -\sum_{i=1}^n \frac{\exp{(-y_i w^\top x_i)}}{1+\exp{(-y_i w^\top x_i)}} y_ix_i = -\sum_{i=1}^n \frac{y_ix_i}{1+\exp{(y_i w^\top x_i)}} $$ $$w = w - c \bigtriangledown {\cal L}(w)$$
d
Linear Support Vector Machine ($y_i\in\{+1,-1\}$): ${\cal L}(w)=C\sum_{i=1}^n \max(1-y_iw^\top x_i,0)+\|w\|_2^2$}
solution:
$$\bigtriangledown {\cal L}(w) = -C\sum_{i=1}^n \delta(1-y_iw^\top x_i > 0)y_ix_i + 2w$$ $$w = w - c \bigtriangledown {\cal L}(w)$$

II. Another look at SVM

You suddenly have this vision, that it might be beneficial to pre-compute all inner-products in your data set. I.e. you store a $n\times n$ matrix $K_{ij}=x_i^\top x_j$
a
Imagine you start the gradient descent procedure from above with initial weights $w^0=0$ (this is just to help intuition). Take the gradient update of SVM (question I.d) and show that after $t$ gradient steps you can express the weight vector as $w^t=\sum_{i=1}^n\alpha_i^t x_i$ for some values $\alpha_i^t$. (HINT: each gradient step update the $\alpha$'s, if the update is correct you can always write $w^t=\sum_{i=1}^n\alpha_i^t x_i$ starting from update $t=0$ to any update $t>0$)
solution:
In the previous problem, we have derived the update rule to be $$\begin{eqnarray*} w &=& w - c \bigtriangledown {\cal L}(w)\\ &=& (1-2c)w + c C\sum_{i=1}^n \delta(1-y_iw^\top x_i > 0)y_ix_i\\ \end{eqnarray*}$$
Initialize $\alpha_i^0 = 0, \forall i = 1, \cdots, n$ $$\begin{equation} \Delta\alpha_i^t =\left\{\!\!\! \begin{array}{ll} cCy_i & 1-y_i{w^t}^\top x_i > 0\\ 0 & else. \end{array}\right.%} \end{equation}$$
At each iteration, we update $\alpha_i^{t+1} = (1-2c)\alpha_i^t + \Delta\alpha_i^t$.
Starting with iteration 0, we have, $w^0 = \sum_{i=1}^n\alpha_i^0 x_i = 0$, after each update, we have, $w^t = \sum_{i=1}^n\alpha_i^t x_i$
b
Take this new definition $w = \sum_{i=1}^n \alpha_i x_i$ and substitute it into the loss function of SVM (question I.d). You can now write the loss function as ${\cal L}(\alpha)$ where $\alpha=[\alpha_1 \dots \alpha_n]^\top$ (no need to force the vector representation in your formula). Further, write the loss ${\cal L}(\alpha)$ only using the precomputed matrix entries $K_{ij}$.
solution:
Let $X = \left[x_1, x_2, \cdots, x_n\right]^\top$, then we have, $w = X^\top \alpha$
Let $K_j = \left[K_{1j}, K_{2j}, \cdots, K_{nj}\right]^\top$ $$\begin{eqnarray*} {\cal L}(\alpha)&=& C\sum_{i=1}^n \max(1-y_i(X^\top \alpha)^\top x_i,0)+\|X^\top \alpha\|^2\\ &=& C\sum_{i=1}^n \max(1-y_i\alpha^\top K_i,0)+ \alpha^\top K \alpha \end{eqnarray*}$$
c
Can you derive a gradient descent update rule of SVM (question I.d) with respect to $\alpha_i$?
solution:
$$\bigtriangledown {\cal L}(\alpha) = -C\sum_{i=1}^n \delta(1-y_i\alpha^\top K_i > 0)y_iK_i + 2K\alpha$$ $$\alpha = \alpha - c \bigtriangledown {\cal L}(\alpha)$$

III. Weighted Ridge Regression

Assume that in addition to your data $\left\{(x_1,y_1),\dots,(x_n,y_n)\right\}$ you also have weights $p_i\geq 0$ for each example. Let your loss function be \begin{equation} {\cal L}(w)=\sum_{i=1}^n p_i(w^\top x_i-y_i)^2+\lambda w^\top w.\label{prob2:loss} \end{equation}
a
Rephrase the previous equation in terms of the matrices $X=[x_1,\dots,x_n]^\top$, $Y=[y_1,\dots,y_n]^\top$ and the diagonal matrix $P=diag([p_1,\dots,p_n])$ (where the diag operator performs like the Matlab function with the same name.)
solution:
$${\cal L}(w)=(Xw-Y)^\top P(Xw-Y)+\lambda w^\top w.$$
b
Derive a closed form solution for $w$. (You can use: $\frac{\partial (w^\top A)}{\partial w}=A$, $\frac{\partial (w^\top B w)}{\partial w}=Bw + B^\top w$ and $w^\top w=w^\top I w$ where $I$ is the identity matrix.)
solution:
$$\frac{\partial {\cal L}(w)}{\partial w} = 2X^\top P(Xw-Y) + 2\lambda Iw = 0$$ $$w = (X^\top PX+\lambda I)^{-1} X^\top PY$$

IV. Regularized Logistic Regression & MAP estimate

a
One way to reduce overfitting is by adding a regularization term to the loss function. For example, look at Ridge Regression (I.a): it is just a regularized version of ordinary least square, where the regularization term is $\lambda ||w||^2_2$ ($\lambda$ is the regularization parameter and is fixed). Define a "Regularized Logisitic Regression" (RLR) loss function where the regularization term is a multiple $\lambda$ of the squared 2 norm of w (like in Ridge regression). ${\cal L}(w)=\sum_{i=1}^n \log(1+\exp{(-y_i w^\top x_i)}) + \lambda||w||^2_2 $
b
Another reasonable (seemingly unrelated) way to compute an estimate for logistic regression is through the Maximum A Posteriori (MAP) estimate: $$w_{MAP}=argmax\prod_{i=1}^N P(y_i|x_i,w)P(w)$$ where we assume that: Show that for a particular value of $\sigma$ and $\lambda$, the MAP estimate is the same as the w obtained by minimizing the RLR.
solution
$$\begin{eqnarray*} w_{MAP}&=&argmax\prod_{i=1}^N P(y_i|x_i,w)P(w)\\ &=&argmax \log (\prod_{i=1}^N P(y_i|x_i,w) P(w))\\ &=&argmax \sum_{i=1}^N \log(P(y_i|x_i,w)) + P(w)\\ &=&argmax \sum_{i=1}^N \log{\frac{1}{1+exp(-y_iw^Tx_i)}} + \log (P(w))\\ &=&argmax \sum_{i=1}^N -\log(1+exp(-y_iw^Tx_i)) + \log (P(w))\\ \end{eqnarray*}$$ Note: $$\begin{eqnarray*} \log(P(w)) &=&\log(\prod_{j=1}^d \frac{1}{\sqrt{2 \pi} \sigma}\exp({\frac{-w_j^2}{2 \sigma^2}}))\\ &=& d\log(\frac{1}{\sqrt{2 \pi} \sigma})+ \sum_{i=1}^d log(\exp({\frac{-w_j^2}{2 \sigma^2}}))\\ &=& d\log(\frac{1}{\sqrt{2 \pi} \sigma})+ \sum_{i=1}^d \frac{-w_j^2}{2 \sigma^2}\\ &=& d\log(\frac{1}{\sqrt{2 \pi} \sigma})- \sum_{i=1}^d \frac{w_j^2}{2 \sigma^2}\\ &=& d\log(\frac{1}{\sqrt{2 \pi} \sigma}) - \frac{||w||_2^2}{2 \sigma^2}\\ \end{eqnarray*}$$ Hence: $$\begin{eqnarray*} w_{MAP}&=&argmax \sum_{i=1}^N -\log(1+exp(-y_iw^Tx_i)) + d\log(\frac{1}{\sqrt{2 \pi} \sigma}) - \frac{||w||_2^2}{2 \sigma^2}\\ &=&argmax - (\sum_{i=1}^N \log(1+exp(-y_iw^Tx_i)) + \frac{||w||_2^2}{2 \sigma^2})\\ &=&argmin \sum_{i=1}^N \log(1+exp(-y_iw^Tx_i)) + \frac{||w||_2^2}{2 \sigma^2}\\ \end{eqnarray*}$$ Hence if $\lambda = \frac{1}{2 \sigma^2} $, $w_{MAP}$ also minimizes the loss function of the RLR.

V. Linearity of Naive Bayes with Gaussian likelihood factors

In this question, you will show that Naive Bayes is a linear classifier when using Gaussian likelihood factors with shared variances. Specifically, consider the following Naive Bayes model: \begin{equation} p(y|x) = \frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y)p(y)}{p(x)} \end{equation} With: \begin{equation} p([x]_{\alpha}|y) = \mathcal{N}([\mu_{y}]_{\alpha},[\sigma^{2}]_{\alpha}) \end{equation} That is, there is a separate mean value for each feature $[x]_{\alpha}$ and each class $y \in \left\{0,1\right\}$. However, the variances are shared across classes, so that there is only one variance $[\sigma^{2}]_{\alpha}$ per feature.
a
Show that the decision rule $p(y=1|x)$ can equivalently be written as \begin{equation} p(y=1|x) = \frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)+\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}. \end{equation} Hint: remember the sum rule and product rule.
solution
First, note that the numerator follows immediately from Bayes' rule, we have just substituted the actual given value $y=1$ for $y$ in the first equation in this second. For the denominator, we expand $p(x)$ using first the sum rule and the product rule. By the sum rule, $p(x)=p(x,y=1)+p(x,y=0)$. Applying the product rule to both terms on the right hand side, we get: \begin{equation} p(x) = p(x|y=1)p(y=1)+p(x|y=0)p(y=0) \end{equation} Next, we apply the Naive Bayes' assumption to $p(x|y=1)$ and $p(x|y=0)$ to get: \begin{equation} p(x) = \prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)+\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0) \end{equation} Plugging this in for the denominator in Bayes' rule, we achieve the desired result.
b
Using this formulation, show how to rewrite $p(y=1|x)$ as: \begin{equation} p(y=1|x) = \frac{1}{1+\exp{\left(-\log\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}\right)}} \end{equation}
solution
Observe that, in general, $\frac{a}{a+b}$ can equivalently be written as $\frac{1}{1+\frac{b}{a}}$. Furthermore, the equation for $p(y=1|x)$ derived in the previous part has exactly the form $\frac{a}{a+b}$! Therefore, we can rewrite it as: \begin{equation} p(y=1 \mid x)=\frac{1}{1+\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}} \end{equation} Next, since $\exp(\log(x))=x$, \begin{equation} p(y=1 \mid x)=\frac{1}{1+\exp\left(\log\left(\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}\right)\right)} \end{equation} Finally, pulling a negative sign out of the log lets us flip the fraction inside: \begin{equation} p(y=1 \mid x) = \frac{1}{1+\exp{\left(-\log\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}\right)}} \end{equation}
c
Given the above expression for $p(y=1|x)$, show that Naive Bayes with this definition of $p([x]_{\alpha}|y)$ is a linear model. Hint: the form you derived in part b should remind you of a decision rule you've seen before.
solution
To show this, we simply plug in the following definitions to the equation we derived in part b: \begin{align} p(y=1) &= \rho \\ p([x]_{\alpha} \mid y=1)&=\frac{1}{[\sigma^{2}]_{\alpha}\sqrt{2\pi}}\exp\left(\frac{-([x]_{\alpha}-[\mu_{1}]_{\alpha})^{2}}{2[\sigma^{2}]_{\alpha}}\right) \\ p([x]_{\alpha} \mid y=0)&=\frac{1}{[\sigma^{2}]_{\alpha}\sqrt{2\pi}}\exp\left(\frac{-([x]_{\alpha}-[\mu_{0}]_{\alpha})^{2}}{2[\sigma^{2}]_{\alpha}}\right) \end{align} Expanding $-\log\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}$ we get: \begin{equation} -\log p(y=1) - \log \prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1) + \log p(y=0) + \log \prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0) \end{equation} Observing that $\log \prod_{i} x_{i} = \sum_{i} \log x_{i}$ and rearranging terms, this is equal to: \begin{equation} \log \frac{p(y=0)}{p(y=1)} + \sum_{\alpha=1}^{d} \log \frac{p([x]_{\alpha}|y=0)}{p([x]_{\alpha}|y=1)} \end{equation} Plugging in the definition of $p(y=1)$, the first term in this is equal to $\log \frac{1-\rho}{\rho}$. For the second term, we plug in the Gaussian distributions for $p([x]_{\alpha} \mid y=1)$ and $p([x]_{\alpha} \mid y=0)$, and then do a bit of algebra to get: \begin{equation} \sum_{\alpha=1}^{d} \frac{([\mu_{0}]_{\alpha}-[\mu_{1}]_{\alpha})[x]_{\alpha}}{[\sigma^{2}]_{\alpha}} + \frac{[\mu_{1}]^{2}_{\alpha}-[\mu_{0}]^{2}_{\alpha}}{2[\sigma^{2}]_{\alpha}} \end{equation} Putting everything together we get: \begin{equation} \log \frac{1-\rho}{\rho} + \sum_{\alpha=1}^{d} \frac{([\mu_{0}]_{\alpha}-[\mu_{1}]_{\alpha})[x]_{\alpha}}{[\sigma^{2}]_{\alpha}} + \frac{[\mu_{1}]^{2}_{\alpha}-[\mu_{0}]^{2}_{\alpha}}{2[\sigma^{2}]_{\alpha}} \end{equation} And finally we just start renaming terms. Let's first define: \begin{equation} b = \log \frac{1-\rho}{\rho} + \frac{[\mu_{1}]^{2}_{\alpha}-[\mu_{0}]^{2}_{\alpha}}{2[\sigma^{2}]_{\alpha}} \end{equation} Next, create a vector $\mathbf{w}$ so that: \begin{equation} [\mathbf{w}]_{\alpha} = \frac{[\mu_{0}]_{\alpha}-[\mu_{1}]_{\alpha}}{[\sigma^{2}]_{\alpha}} \end{equation} Then the sum (the second term) is simply equal to $\mathbf{w}^{\top}x$. Therefore, \begin{equation} -\log\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)} = \mathbf{w}^{\top}x+b \end{equation} Plugging this in to the decision rule $p(y=1|x)$ we derived in part b, we finally see that: \begin{equation} p(y=1|x) = \frac{1}{1+\exp\left(\mathbf{w}^{\top}x+b\right)} \end{equation} This is not only a linear decision boundary, but should look very similar indeed to the linear decision rule you've seen from logistic regression!