Lecture 11: Reduction from Elastic Net to SVM



Let's say that for every elastic net problem, there's an equivalent SVM problem such that the elastic net solution we obtain from SVM solution is optimal if and only if the SVM solution is optimal. Then, we can take advantage of very efficient SVM solvers that utilize GPU and multi-core CPUs in order to solve the elastic net problem. So, how can we reduce an elastic net problem to a svm problem?

Elastic Net

Given input data x1,,xnRd with real labels y1,,ynR,
we want to find wRd such that it minimizes ni=1(xiwyi)2+λ||w||22 subject to: |w|t

SVM

Given ˆX=[ˆx1,ˆx2,,ˆxm] and ˆY=[ˆy1,ˆy2,,ˆym] where ˆxiRp and ^yiR and some constant C,

we want to find ˆwRp such that it minimizes Cmi=1max(1ˆwxiyi,0)2+ˆwˆw
(This is the formulation of SVM with squared-hinge-loss and without bias term.)

Reduction

Given an input to the elastic net problem, we need to transform it such that we have a corresponding input to the SVM problem.

The input to the elastic net problem is X=[x1,,xn] and Y=[y1,y2,,yn] where xiRd and yiR

Then we can create an input to the SVM problem in the following way:

For α=1,,d



We have ˆX=[ˆx1,ˆx2,,ˆx2d] and ˆY=[ˆy1,ˆy2,,ˆy2d] where ˆxiRn and ^yi{1,+1} and some constant C0. Hence, we can solves this SVM problem in order to find ˆw such that it minimizes Cmi=1max(1ˆwxiyi,0)2+ˆwˆw
Now, from ˆw, we can recover w for the elastic net problem such that it minimizes we want to find wRd such that it minimizes ni=1(xiwyi)2+λ||w||22 subject to: |w|t

Calculating w

hα=Cmax(1ˆwˆxαˆyα,0), which is the hinge loss of point ˆxα. Now, for α=1,,d
wα=thαhα+d2dβ=1hβ

Intuition

For each αth feature from the original elastic net problem we create two points for the SVM problem. These points have n dimensions, one for each sample. The ith dimension of such a point is the feature value α from sample xi, plus or minus the label scaled by 1t.

The SVM finds a separating hyperplane. Only support vectors, i.e. points that violate the margin constraint or lie exactly on the margin, contribute to the loss function (the hinge loss of a point ˆxα is Cmax(1ˆwˆxαˆyα,0). And we know the number of such vectors that violate the margin constraint are very few. These "points" correspond to the features for which the elastic net assigns non-zero weights.

Note that if λ=0, which is equivalent to pure l1 regularization, then C becomes infinitely large. In this case the SVM problem becomes identical to running an SVM without slack variables.