Processing math: 100%
Homework 2
Due: October 14 1:25 PM in class
Up to 5 members in each group. Please include NetID and names on front page.
ERM
I. Loss function optimization using gradient descent
Derive the gradient update (with stepsize c) for your weight vector w for each of the following loss functions: (here: ‖w‖22=w⊤w and |w|=∑dα=1|wα|, also λ and C are non-negative constants.)
a
Ridge Regression: L(w)=∑ni=1(w⊤xi−yi)2+λ‖w‖22. (You will implement this in a future project)
b
Lasso Regression: L(w)=∑ni=1(w⊤xi−yi)2+λ|w|
c
Logistic Regression (yi∈{+1,−1}): L(w)=∑ni=1log(1+exp(−yiw⊤xi))
d
Linear Support Vector Machine (yi∈{+1,−1}): L(w)=C∑ni=1max(1−yiw⊤xi,0)+‖w‖22
II. Another look at SVM
You suddenly have this vision, that it might be beneficial to
pre-compute all inner-products in your data set. I.e. you store a n×n matrix Kij=x⊤ixj
a
Imagine you start the gradient descent procedure from above with initial weights w0=0 (this is just to help intuition).
Take the gradient update of SVM (question I.d) and show that after t gradient steps you can express the weight vector as wt=∑ni=1αtixi for some values αti. (HINT: each gradient step update the α's, if the update is correct you can always write wt=∑ni=1αtixi starting from update t=0 to any update t>0)
b
Take this new definition w=∑ni=1αixi and substitute it into the loss function of SVM (question I.d). You can now write the loss function as L(α) where α=[α1…αn]⊤ (no need to force the vector representation in your formula). Further, write the loss L(α) only using the precomputed matrix entries Kij.
c
Can you derive a gradient descent update rule of SVM (question I.d) with respect to αi?
III. Weighted Ridge Regression
Assume that in addition to your data {(x1,y1),…,(xn,yn)} you also have weights pi≥0 for each example.
Let your loss function be
L(w)=n∑i=1pi(w⊤xi−yi)2+λw⊤w.
a
Rephrase the previous equation in terms of the matrices X=[x1,…,xn]⊤, Y=[y1,…,yn]⊤ and the diagonal matrix P=diag([p1,…,pn]) (where the diag operator performs like the Matlab function with the same name.)
b
Derive a closed form solution for w. (You can use: ∂(w⊤A)∂w=A, ∂(w⊤Bw)∂w=Bw+B⊤w and w⊤w=w⊤Iw where I is the identity matrix.)
IV. Newton's Method
Let us re-visit Logistic Regression, however with yi∈{0,1}.
a
Let σ(a)=1/(1+e−a). Show that with these new labels the loss function can be written as
L(w)=−n∑i=1(yilog(σ(w⊤xi))+(1−yi)log(1−σ(w⊤xi)))
b
Show that the gradient of L can be written as
∂L∂w=−n∑i=1(yi−σ(w⊤xi))xi.
(HINT: You can use the fact that ∂σ(z)∂z=σ(z)(1−σ(z)).)
c
Let the n×n diagonal matrix Wii=σ(w⊤xi)(1−σ(w⊤xi)) and let X=[x1,…,xn]⊤. Show that the Hessian matrix is H=X⊤WX. For which examples is Wii large, for which is it small?
d
Write down the update rule for a newton step. Show that if you use the substitution →z where zi=→x⊤iw+1Wii(yi−σ(w⊤xi)), you arrive at
wnew←(X⊤WX)−1X⊤Wz.
e
Look at the result of Question III in the Weighted Ridge Regression section (λ=0). Hold your breath and enjoy the moment of amazement. Why is this algorithm also called Iteratively Reweighted Least Squares?
V. Regularized Logistic Regression & MAP estimate
a
You just used your brilliant new algorithm from part IV which gave great result on your training set, but abysmal results on your testing set.
You stare at your screen in disbelief until it hits you: "My algorithm must have overfitted!".
One way to reduce overfitting is by adding a regularization term to the loss function. For example, look at Ridge Regression (I.a): it is just a regularized version of ordinary least square, where the
regularization term is λ||w||22 (λ is the regularization parameter and is fixed).
Define a "Regularized Logisitic Regression" (RLR) loss function where the regularization term is a multiple λ of the squared 2 norm of w (like in Ridge regression).
b
Another reasonable (seemingly unrelated) way to compute an estimate for logistic regression is through the Maximum A Posteriori (MAP) estimate:
wMAP=argmaxN∏i=1P(yi|xi,w)P(w)
where we assume that:
- (Logistic Regression Condition) P(yi|xi,w)=11+exp(−yiwTxi) for all i∈1...N
- (Prior on w) w is normally distributed, has zero mean and its covariance matrix is a multiple of the identity matrix, i.e P(w)=∏dj=11√2πσexp−w2j2σ2
for some given σ
Show that for a particular value of σ and λ, the MAP estimate is the same as the w obtained by minimizing the RLR.