Lecture 11: Reduction from Elastic Net to SVM
Let's say that for every elastic net problem, there's an equivalent SVM problem such that the elastic net solution we obtain from SVM solution is optimal if and only if the SVM solution is optimal. Then, we can take advantage of very efficient SVM solvers that utilize GPU and multi-core CPUs in order to solve the elastic net problem. So, how can we reduce an elastic net problem to a svm problem?
Elastic Net
Given input data x1,…,xn∈Rd with real labels y1,…,yn∈R,
we want to find w∈Rd such that it minimizes
n∑i=1(x⊤iw−yi)2+λ||w||22 subject to: |w|≤t
SVM
Given ˆX=[ˆx1,ˆx2,…,ˆxm] and ˆY=[ˆy1,ˆy2,…,ˆym] where ˆxi∈Rp and ^yi∈R and some constant C,
we want to find ˆw∈Rp such that it minimizes
Cm∑i=1max(1−ˆw⊤xiyi,0)2+ˆw⊤ˆw
(This is the formulation of SVM with squared-hinge-loss and without bias term.)
Reduction
Given an input to the elastic net problem, we need to transform it such that we have a corresponding input to the SVM problem.
The input to the elastic net problem is X=[x1,…,xn] and Y=[y1,y2,…,yn] where xi∈Rd and yi∈R
Then we can create an input to the SVM problem in the following way:
For α=1,…,d
- ˆxα=[[x1]α+y1t⋮[xn]α+ynt]∈Rn with: ˆyα=+1
- ˆxα+d=[[x1]α−y1t⋮[xn]α−ynt]∈Rn with: ˆyα+d=−1
- and regularization constant: C=12λ
We have ˆX=[ˆx1,ˆx2,…,ˆx2d] and ˆY=[ˆy1,ˆy2,…,ˆy2d] where ˆxi∈Rn and ^yi∈{−1,+1} and some constant C≥0. Hence, we can solves this SVM problem in order to find ˆw such that it minimizes
Cm∑i=1max(1−ˆw⊤xiyi,0)2+ˆw⊤ˆw
Now, from ˆw, we can recover w for the elastic net problem such that it minimizes
we want to find w∈Rd such that it minimizes
n∑i=1(x⊤iw−yi)2+λ||w||22 subject to: |w|≤t
Calculating w
hα=C∗max(1−ˆw⊤ˆxαˆyα,0), which is the hinge loss of point ˆxα.
Now, for α=1,…,d
wα=t∗hα−hα+d∑2dβ=1hβ
Intuition
For each αth feature from the original elastic net problem we create two points for the SVM problem. These points have n dimensions, one for each sample. The ith dimension of such a point is the feature value α from sample xi, plus or minus the label scaled by 1t.
The SVM finds a separating hyperplane. Only support vectors, i.e. points that violate the margin constraint or lie exactly on the margin, contribute to the loss function (the hinge loss of a point ˆxα is C∗max(1−ˆw⊤ˆxαˆyα,0). And we know the number of such vectors that violate the margin constraint are very few. These "points" correspond to the features for which the elastic net assigns non-zero weights.
Note that if λ=0, which is equivalent to pure l1 regularization, then C becomes infinitely large. In this case the SVM problem becomes identical to running an SVM without slack variables.