Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Homework 2

Due: October 14 1:25 PM in class
Up to 5 members in each group. Please include NetID and names on front page.

ERM

I. Loss function optimization using gradient descent

Derive the gradient update (with stepsize c) for your weight vector w for each of the following loss functions: (here: and |w|=\sum_{\alpha=1}^d |w_\alpha|, also \lambda and C are non-negative constants.)
a
Ridge Regression: {\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda \|w\|_2^2. (You will implement this in a future project)
b
Lasso Regression: {\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda |w|
c
Logistic Regression (y_i\in\{+1,-1\}): {\cal L}(w)=\sum_{i=1}^n \log(1+\exp{(-y_i w^\top x_i)})
d
Linear Support Vector Machine (y_i\in\{+1,-1\}): {\cal L}(w)=C\sum_{i=1}^n \max(1-y_iw^\top x_i,0)+\|w\|_2^2

II. Another look at SVM

You suddenly have this vision, that it might be beneficial to pre-compute all inner-products in your data set. I.e. you store a n\times n matrix K_{ij}=x_i^\top x_j
a
Imagine you start the gradient descent procedure from above with initial weights w^0=0 (this is just to help intuition). Take the gradient update of SVM (question I.d) and show that after t gradient steps you can express the weight vector as w^t=\sum_{i=1}^n\alpha_i^t x_i for some values \alpha_i^t. (HINT: each gradient step update the \alpha's, if the update is correct you can always write w^t=\sum_{i=1}^n\alpha_i^t x_i starting from update t=0 to any update t>0)
b
Take this new definition w = \sum_{i=1}^n \alpha_i x_i and substitute it into the loss function of SVM (question I.d). You can now write the loss function as {\cal L}(\alpha) where \alpha=[\alpha_1 \dots \alpha_n]^\top (no need to force the vector representation in your formula). Further, write the loss {\cal L}(\alpha) only using the precomputed matrix entries K_{ij}.
c
Can you derive a gradient descent update rule of SVM (question I.d) with respect to \alpha_i?

III. Weighted Ridge Regression

Assume that in addition to your data \left\{(x_1,y_1),\dots,(x_n,y_n)\right\} you also have weights p_i\geq 0 for each example. Let your loss function be \begin{equation} {\cal L}(w)=\sum_{i=1}^n p_i(w^\top x_i-y_i)^2+\lambda w^\top w.\label{prob2:loss} \end{equation}
a
Rephrase the previous equation in terms of the matrices X=[x_1,\dots,x_n]^\top, Y=[y_1,\dots,y_n]^\top and the diagonal matrix P=diag([p_1,\dots,p_n]) (where the diag operator performs like the Matlab function with the same name.)
b
Derive a closed form solution for w. (You can use: \frac{\partial (w^\top A)}{\partial w}=A, \frac{\partial (w^\top B w)}{\partial w}=Bw + B^\top w and w^\top w=w^\top I w where I is the identity matrix.)

IV. Newton's Method

Let us re-visit Logistic Regression, however with y_i\in\{0,1\}.
a
Let \sigma(a) = 1/(1+e^{-a}). Show that with these new labels the loss function can be written as \begin{equation} {\cal L}(w)=-\sum_{i=1}^n (y_i \log (\sigma(w^\top x_i))+(1-y_i)\log (1-\sigma(w^\top x_i))) \end{equation}
b
Show that the gradient of {\cal L} can be written as \begin{equation} \frac{\partial L}{\partial w}=-\sum_{i=1}^n(y_i-\sigma (w^\top x_i))x_i. \end{equation} (HINT: You can use the fact that \frac{\partial \sigma(z)}{\partial z}=\sigma(z)(1-\sigma (z)).)
c
Let the n\times n diagonal matrix W_{ii}=\sigma(w^\top x_i)(1-\sigma(w^\top x_i)) and let X=[x_1,\dots,x_n]^\top. Show that the Hessian matrix is H=X^\top W X. For which examples is W_{ii} large, for which is it small?
d
Write down the update rule for a newton step. Show that if you use the substitution \vec z where z_i=\vec x_i^\top w+\frac{1}{W_{ii}}(y_i-\sigma(w^\top x_i)), you arrive at \begin{equation} w^{new}\leftarrow (X^\top W X)^{-1} X^\top W z. \end{equation}
e
Look at the result of Question III in the Weighted Ridge Regression section (\lambda=0). Hold your breath and enjoy the moment of amazement. Why is this algorithm also called Iteratively Reweighted Least Squares?

V. Regularized Logistic Regression & MAP estimate

a
You just used your brilliant new algorithm from part IV which gave great result on your training set, but abysmal results on your testing set. You stare at your screen in disbelief until it hits you: "My algorithm must have overfitted!".

One way to reduce overfitting is by adding a regularization term to the loss function. For example, look at Ridge Regression (I.a): it is just a regularized version of ordinary least square, where the regularization term is \lambda ||w||^2_2 (\lambda is the regularization parameter and is fixed). Define a "Regularized Logisitic Regression" (RLR) loss function where the regularization term is a multiple \lambda of the squared 2 norm of w (like in Ridge regression).
b
Another reasonable (seemingly unrelated) way to compute an estimate for logistic regression is through the Maximum A Posteriori (MAP) estimate: w_{MAP}=argmax\prod_{i=1}^N P(y_i|x_i,w)P(w) where we assume that: Show that for a particular value of \sigma and \lambda, the MAP estimate is the same as the w obtained by minimizing the RLR.