Written Homework #2

Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js

Homework 2

Due: October 14 1:25 PM in class
Up to 5 members in each group. Please include NetID and names on front page.

ERM

I. Loss function optimization using gradient descent

Derive the gradient update (with stepsize

$c$ ) for your weight vector

$w$ for each of the following loss functions: (here:

$\|w\|_2^2=w^\top w$ and

$|w|=\sum_{\alpha=1}^d |w_\alpha|$ , also

$\lambda$ and

$C$ are non-negative constants.)

a

Ridge Regression:

${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda \|w\|_2^2$ . (You will implement this in a future project)

b

Lasso Regression:

${\cal L}(w)=\sum_{i=1}^n (w^\top x_i-y_i)^2+\lambda |w|$

c

Logistic Regression (

$y_i\in\{+1,-1\}$ ):

${\cal L}(w)=\sum_{i=1}^n \log(1+\exp{(-y_i w^\top x_i)})$

d

Linear Support Vector Machine (

$y_i\in\{+1,-1\}$ ):

${\cal L}(w)=C\sum_{i=1}^n \max(1-y_iw^\top x_i,0)+\|w\|_2^2$

II. Another look at SVM

You suddenly have this vision, that it might be beneficial to pre-compute all inner-products in your data set. I.e. you store a

$n\times n$ matrix

$K_{ij}=x_i^\top x_j$

a

Imagine you start the gradient descent procedure from above with initial weights

$w^0=0$ (this is just to help intuition). Take the gradient update of SVM (question I.d) and show that after

$t$ gradient steps you can express the weight vector as

$w^t=\sum_{i=1}^n\alpha_i^t x_i$ for some values

$\alpha_i^t$ . (HINT: each gradient step update the

$\alpha$ 's, if the update is correct you can always write

$w^t=\sum_{i=1}^n\alpha_i^t x_i$ starting from update

$t=0$ to any update

$t>0$ )

b

Take this new definition

$w = \sum_{i=1}^n \alpha_i x_i$ and substitute it into the loss function of SVM (question I.d). You can now write the loss function as

${\cal L}(\alpha)$ where

$\alpha=[\alpha_1 \dots \alpha_n]^\top$ (no need to force the vector representation in your formula). Further, write the loss

${\cal L}(\alpha)$ only using the precomputed matrix entries

$K_{ij}$ .

c

Can you derive a gradient descent update rule of SVM (question I.d) with respect to

$\alpha_i$ ?

III. Weighted Ridge Regression

Assume that in addition to your data

$\left\{(x_1,y_1),\dots,(x_n,y_n)\right\}$ you also have weights

$p_i\geq 0$ for each example. Let your loss function be

$\begin{equation} {\cal L}(w)=\sum_{i=1}^n p_i(w^\top x_i-y_i)^2+\lambda w^\top w.\label{prob2:loss} \end{equation}$

a

Rephrase the previous equation in terms of the matrices

$X=[x_1,\dots,x_n]^\top$ ,

$Y=[y_1,\dots,y_n]^\top$ and the diagonal matrix

$P=diag([p_1,\dots,p_n])$ (where the diag operator performs like the Matlab function with the same name.)

b

Derive a closed form solution for

$w$ . (You can use:

$\frac{\partial (w^\top A)}{\partial w}=A$ ,

$\frac{\partial (w^\top B w)}{\partial w}=Bw + B^\top w$ and

$w^\top w=w^\top I w$ where

$I$ is the identity matrix.)

IV. Newton's Method

Let us re-visit Logistic Regression, however with

$y_i\in\{0,1\}$ .

a

Let

$\sigma(a) = 1/(1+e^{-a})$ . Show that with these new labels the loss function can be written as

$\begin{equation} {\cal L}(w)=-\sum_{i=1}^n (y_i \log (\sigma(w^\top x_i))+(1-y_i)\log (1-\sigma(w^\top x_i))) \end{equation}$

b

Show that the gradient of

${\cal L}$ can be written as

$\begin{equation} \frac{\partial L}{\partial w}=-\sum_{i=1}^n(y_i-\sigma (w^\top x_i))x_i. \end{equation}$ (HINT: You can use the fact that

$\frac{\partial \sigma(z)}{\partial z}=\sigma(z)(1-\sigma (z))$ .)

c

Let the

$n\times n$ diagonal matrix

$W_{ii}=\sigma(w^\top x_i)(1-\sigma(w^\top x_i))$ and let

$X=[x_1,\dots,x_n]^\top$ . Show that the Hessian matrix is

$H=X^\top W X$ . For which examples is

$W_{ii}$ large, for which is it small?

d

Write down the update rule for a newton step. Show that if you use the substitution

$\vec z$ where

$z_i=\vec x_i^\top w+\frac{1}{W_{ii}}(y_i-\sigma(w^\top x_i))$ , you arrive at

$\begin{equation} w^{new}\leftarrow (X^\top W X)^{-1} X^\top W z. \end{equation}$

e

Look at the result of Question III in the Weighted Ridge Regression section (

$\lambda=0$ ). Hold your breath and enjoy the moment of amazement. Why is this algorithm also called Iteratively Reweighted Least Squares?

V. Regularized Logistic Regression & MAP estimate

a

You just used your brilliant new algorithm from part IV which gave great result on your training set, but abysmal results on your testing set. You stare at your screen in disbelief until it hits you: "My algorithm must have overfitted!".

One way to reduce overfitting is by adding a regularization term to the loss function. For example, look at Ridge Regression (I.a): it is just a regularized version of ordinary least square, where the regularization term is

$\lambda ||w||^2_2$ (

$\lambda$ is the regularization parameter and is fixed). Define a "Regularized Logisitic Regression" (RLR) loss function where the regularization term is a multiple

$\lambda$ of the squared 2 norm of w (like in Ridge regression).

b

Another reasonable (seemingly unrelated) way to compute an estimate for logistic regression is through the Maximum A Posteriori (MAP) estimate:

$w_{MAP}=argmax\prod_{i=1}^N P(y_i|x_i,w)P(w)$ where we assume that:

(Logistic Regression Condition) $P(y_i|x_i,w)=\frac{1}{1+exp(-y_iw^Tx_i)}$ for all $i \in{1...N}$
(Prior on w) w is normally distributed, has zero mean and its covariance matrix is a multiple of the identity matrix, i.e $P(w)=\prod_{j=1}^d \frac{1}{\sqrt{2 \pi} \sigma}\exp{\frac{-w_j^2}{2 \sigma^2}}$ for some given $\sigma$

Show that for a particular value of

$\sigma$ and

$\lambda$ , the MAP estimate is the same as the w obtained by minimizing the RLR.