Loading [MathJax]/jax/output/HTML-CSS/fonts/TeX/fontdata.js
Lecture 11: Reduction from Elastic Net to SVM
Let's say that for every elastic net problem, there's an equivalent SVM problem such that the elastic net solution we obtain from SVM solution is optimal if and only if the SVM solution is optimal. Then, we can take advantage of very efficient SVM solvers that utilize GPU and multi-core CPUs in order to solve the elastic net problem. So, how can we reduce an elastic net problem to a svm problem?
Elastic Net
Given input data x1,…,xn∈Rd with real labels y1,…,yn∈R,
we want to find w∈Rd such that it minimizes
n∑i=1(x⊤iw−yi)2+λ||w||22 subject to: |w|≤t
SVM
Given ˆX=[ˆx1,ˆx2,…,ˆxm] and ˆY=[ˆy1,ˆy2,…,ˆym] where ˆxi∈Rp and ^yi∈R and some constant C,
we want to find ˆw∈Rp such that it minimizes
Cm∑i=1max
(This is the formulation of SVM with squared-hinge-loss and without bias term.)
Reduction
Given an input to the elastic net problem, we need to transform it such that we have a corresponding input to the SVM problem.
The input to the elastic net problem is X=[\mathbf{x}_1, \dots, \mathbf{x}_n] and Y=[y_1, y_2, \dots, y_n] where \mathbf{x}_i \in \mathbb{R}^d and y_i \in \mathbb{R}
Then we can create an input to the SVM problem in the following way:
For \alpha=1, \dots, d
- \hat{\mathbf{x}}_\alpha = \begin{bmatrix}
[x_1]_\alpha + \frac{y_1}{t} \\
\vdots \\
[x_n]_\alpha + \frac{y_n}{t}\\
\end{bmatrix}\in{\mathcal{R}^n} \textrm{ with: } \hat{y}_\alpha = + 1
- \hat{\mathbf{x}}_{\alpha+d} = \begin{bmatrix}
[x_1]_\alpha - \frac{y_1}{t} \\
\vdots \\
[x_n]_\alpha - \frac{y_n}{t}\\
\end{bmatrix}\in{\mathcal{R}^n} \textrm{ with: } \hat{y}_{\alpha+d} = - 1
- \textrm{ and regularization constant: } C = \frac{1}{2\lambda}
We have \hat{X} = [\hat{\mathbf{x}}_1, \hat{\mathbf{x}}_2, \dots, \hat{\mathbf{x}}_{2d}] and \hat{Y} = [\hat{y}_1, \hat{y}_2, \dots, \hat{y}_{2d}] where \hat{\mathbf{x}}_i \in \mathbb{R}^n and \hat{y_i} \in \{-1,+1\} and some constant C \geq 0. Hence, we can solves this SVM problem in order to find \hat{\mathbf{w}} such that it minimizes
C \sum_{i=1}^m \max(1-\hat{\mathbf{w}}^\top \mathbf{x}_i y_i, 0)^2 + \hat{\mathbf{w}}^\top \hat{\mathbf{w}}
Now, from \hat{\mathbf{w}}, we can recover \mathbf{w} for the elastic net problem such that it minimizes
we want to find w \in \mathbb{R}^d such that it minimizes
\sum_{i=1}^n (x^\top _i\mathbf{w}-y_i)^2 + \lambda||w||^2_2 \textrm{ subject to: }
|\mathbf{w}| \le t
Calculating w
h_\alpha = C * \max(1-\hat{\mathbf{w}}^\top \hat{\mathbf{x}}_\alpha \hat{y}_\alpha, 0), which is the hinge loss of point \hat{\mathbf{x}}_\alpha.
Now, for \alpha = 1, \dots, d
w_\alpha = t * \frac{h_\alpha - h_{\alpha+d}}{\sum_{\beta=1}^{2d} h_\beta}
Intuition
For each \alphath feature from the original elastic net problem we create two points for the SVM problem. These points have n dimensions, one for each sample. The i^{th} dimension of such a point is the feature value \alpha from sample \mathbf{x}_i, plus or minus the label scaled by \frac{1}{t}.
The SVM finds a separating hyperplane. Only support vectors, i.e. points that violate the margin constraint or lie exactly on the margin, contribute to the loss function (the hinge loss of a point \hat{\mathbf{x}}_\alpha is C * \max(1-\hat{\mathbf{w}}^\top \hat{\mathbf{x}}_\alpha \hat{y}_\alpha, 0). And we know the number of such vectors that violate the margin constraint are very few. These "points" correspond to the features for which the elastic net assigns non-zero weights.
Note that if \lambda=0, which is equivalent to pure l_1 regularization, then C becomes infinitely large. In this case the SVM problem becomes identical to running an SVM without slack variables.