Homework 3

Due: Friday, March 17th in class.
Up to 5 members in each group. Please include NetID and names on front page.

I. Loss function optimization using gradient descent

Derive the gradient update (with stepsize $c$) for your weight vector $w$ for each of the following loss functions: (here: $\|w\|_2^2=w^\top w$ and $|w|=\sum_{\alpha=1}^d |w_\alpha|$, also $\lambda$ and $C$ are non-negative constants.)
a
Ridge Regression: ${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda \|w\|_2^2$. (You will implement this in a future project)
b
Lasso Regression: ${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda |w|$
c
Logistic Regression ($y_i\in\{+1,-1\}$): ${\cal L}(w)=\sum_{i=1}^n \log(1+\exp{(-y_i w^\top x_i)})$
d
Linear Support Vector Machine ($y_i\in\{+1,-1\}$): ${\cal L}(w)=C\sum_{i=1}^n \max(1-y_iw^\top x_i,0)+\|w\|_2^2$

II. Another look at SVM

You suddenly have this vision, that it might be beneficial to pre-compute all inner-products in your data set. I.e. you store a $n\times n$ matrix $K_{ij}=x_i^\top x_j$
a
Imagine you start the gradient descent procedure from above with initial weights $w^0=0$ (this is just to help intuition). Take the gradient update of SVM (question I.d) and show that after $t$ gradient steps you can express the weight vector as $w^t=\sum_{i=1}^n\alpha_i^t x_i$ for some values $\alpha_i^t$. (HINT: each gradient step update the $\alpha$'s, if the update is correct you can always write $w^t=\sum_{i=1}^n\alpha_i^t x_i$ starting from update $t=0$ to any update $t>0$)
b
Take this new definition $w = \sum_{i=1}^n \alpha_i x_i$ and substitute it into the loss function of SVM (question I.d). You can now write the loss function as ${\cal L}(\alpha)$ where $\alpha=[\alpha_1 \dots \alpha_n]^\top$ (no need to force the vector representation in your formula). Further, write the loss ${\cal L}(\alpha)$ only using the precomputed matrix entries $K_{ij}$.
c
Can you derive a gradient descent update rule of SVM (question I.d) with respect to $\alpha_i$?

III. Weighted Ridge Regression

Assume that in addition to your data $\left\{(x_1,y_1),\dots,(x_n,y_n)\right\}$ you also have weights $p_i\geq 0$ for each example. Let your loss function be \begin{equation} {\cal L}(w)=\sum_{i=1}^n p_i(w^\top x_i-y_i)^2+\lambda w^\top w.\label{prob2:loss} \end{equation}
a
Rephrase the previous equation in terms of the matrices $X=[x_1,\dots,x_n]^\top$, $Y=[y_1,\dots,y_n]^\top$ and the diagonal matrix $P=diag([p_1,\dots,p_n])$ (where the diag operator performs like the Matlab function with the same name.)
b
Derive a closed form solution for $w$. (You can use: $\frac{\partial (w^\top A)}{\partial w}=A$, $\frac{\partial (w^\top B w)}{\partial w}=Bw + B^\top w$ and $w^\top w=w^\top I w$ where $I$ is the identity matrix.)

V. Regularized Logistic Regression & MAP estimate

One way to reduce overfitting is by adding a regularization term to the loss function. For example, look at Ridge Regression (I.a): it is just a regularized version of ordinary least square, where the regularization term is $\lambda ||w||^2_2$ ($\lambda$ is the regularization parameter and is fixed). Define a "Regularized Logisitic Regression" (RLR) loss function where the regularization term is a multiple $\lambda$ of the squared 2 norm of w (like in Ridge regression).
b
Another reasonable (seemingly unrelated) way to compute an estimate for logistic regression is through the Maximum A Posteriori (MAP) estimate: $$w_{MAP}=argmax\prod_{i=1}^N P(y_i|x_i,w)P(w)$$ where we assume that: Show that for a particular value of $\sigma$ and $\lambda$, the MAP estimate is the same as the w obtained by minimizing the RLR.
c
Suppose instead of a normal prior on w, we place a zero mean Laplace prior on each weight, i.e. $P(w)=C\exp{\left(-\lambda |w|\right)}$. Show that the MAP estimate for w is the same as obtained by minimizing the logistic regression loss with L1 regularization, i.e. a regularization term of $\gamma |w|$.

V. Linearity of Naive Bayes with Gaussian likelihood factors

In this question, you will show that Naive Bayes is a linear classifier when using Gaussian likelihood factors with shared variances. Specifically, consider the following Naive Bayes model: \begin{equation} p(y|x) = \frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y)p(y)}{p(x)} \end{equation} With: \begin{equation} p([x]_{\alpha}|y) = \mathcal{N}([\mu_{y}]_{\alpha},[\sigma^{2}]_{\alpha}) \end{equation} That is, there is a separate mean value for each feature $[x]_{\alpha}$ and each class $y \in \left\{0,1\right\}$. However, the variances are shared across classes, so that there is only one variance $[\sigma^{2}]_{\alpha}$ per feature.
a
Show that the decision rule $p(y=1|x)$ can equivalently be written as \begin{equation} p(y=1|x) = \frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)+\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}. \end{equation} Hint: remember the sum rule and product rule.
b
Using this formulation, show how to rewrite $p(y=1|x)$ as: \begin{equation} p(y=1|x) = \frac{1}{1+\exp{\left(-\log\frac{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=1)p(y=1)}{\prod_{\alpha=1}^{d} p([x]_{\alpha}|y=0)p(y=0)}\right)}} \end{equation}
c
Given the above expression for $p(y=1|x)$, show that Naive Bayes with this definition of $p([x]_{\alpha}|y)$ is a linear model. Hint: the form you derived in part b should remind you of a decision rule you've seen before.