Lecture 11: Reduction from Elastic Net to SVM

Processing math: 100%

Lecture 11: Reduction from Elastic Net to SVM

Let's say that for every elastic net problem, there's an equivalent SVM problem such that the elastic net solution we obtain from SVM solution is optimal if and only if the SVM solution is optimal. Then, we can take advantage of very efficient SVM solvers that utilize GPU and multi-core CPUs in order to solve the elastic net problem. So, how can we reduce an elastic net problem to a svm problem?

Elastic Net

Given input data

$\mathbf{x}_1, \dots, \mathbf{x}_n\in{\cal R}^d$ with real labels

$y_1, \dots, y_n\in{\cal R}$ ,
we want to find

$\mathbf{w} \in \mathbb{R}^d$ such that it minimizes

$\sum_{i=1}^n (\mathbf{x}_i^\top \mathbf{w}-y_i)^2 + \lambda||\mathbf{w}||^2_2 \textrm{ subject to: } |\mathbf{w}|\le t$

SVM

Given

$\hat{X} = [\hat{\mathbf{x}}_1, \hat{\mathbf{x}}_2, \dots, \hat{\mathbf{x}}_m]$ and

$\hat{Y} = [\hat{y}_1, \hat{y}_2, \dots, \hat{y}_m]$ where

$\hat{\mathbf{x}}_i \in \mathbb{R}^p$ and

$\hat{y_i} \in \mathbb{R}$ and some constant

$C$ ,

we want to find

$\hat{\mathbf{w}} \in \mathbb{R}^p$ such that it minimizes

$C \sum_{i=1}^m \max(1-\hat{\mathbf{w}}^\top \mathbf{x}_i y_i, 0)^2 + \hat{\mathbf{w}}^\top \hat{\mathbf{w}}$ (This is the formulation of SVM with squared-hinge-loss and without bias term.)

Reduction

Given an input to the elastic net problem, we need to transform it such that we have a corresponding input to the SVM problem.

The input to the elastic net problem is $X=[\mathbf{x}_1, \dots, \mathbf{x}_n]$ and $Y=[y_1, y_2, \dots, y_n]$ where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$

Then we can create an input to the SVM problem in the following way:

For

$\alpha=1, \dots, d$

$\hat{\mathbf{x}}_\alpha = \begin{bmatrix} [x_1]_\alpha + \frac{y_1}{t} \\ \vdots \\ [x_n]_\alpha + \frac{y_n}{t}\\ \end{bmatrix}\in{\mathcal{R}^n} \textrm{ with: } \hat{y}_\alpha = + 1$
$\hat{\mathbf{x}}_{\alpha+d} = \begin{bmatrix} [x_1]_\alpha - \frac{y_1}{t} \\ \vdots \\ [x_n]_\alpha - \frac{y_n}{t}\\ \end{bmatrix}\in{\mathcal{R}^n} \textrm{ with: } \hat{y}_{\alpha+d} = - 1$
$\textrm{ and regularization constant: } C = \frac{1}{2\lambda}$

We have

$\hat{X} = [\hat{\mathbf{x}}_1, \hat{\mathbf{x}}_2, \dots, \hat{\mathbf{x}}_{2d}]$ and

$\hat{Y} = [\hat{y}_1, \hat{y}_2, \dots, \hat{y}_{2d}]$ where

$\hat{\mathbf{x}}_i \in \mathbb{R}^n$ and

$\hat{y_i} \in \{-1,+1\}$ and some constant

$C \geq 0$ . Hence, we can solves this SVM problem in order to find

$\hat{\mathbf{w}}$ such that it minimizes

$C \sum_{i=1}^m \max(1-\hat{\mathbf{w}}^\top \mathbf{x}_i y_i, 0)^2 + \hat{\mathbf{w}}^\top \hat{\mathbf{w}}$ Now, from

$\hat{\mathbf{w}}$ , we can recover

$\mathbf{w}$ for the elastic net problem such that it minimizes we want to find

$w \in \mathbb{R}^d$ such that it minimizes

$\sum_{i=1}^n (x^\top _i\mathbf{w}-y_i)^2 + \lambda||w||^2_2 \textrm{ subject to: } |\mathbf{w}| \le t$

Calculating w

$h_\alpha = C * \max(1-\hat{\mathbf{w}}^\top \hat{\mathbf{x}}_\alpha \hat{y}_\alpha, 0)$ , which is the hinge loss of point

$\hat{\mathbf{x}}_\alpha$ . Now, for

$\alpha = 1, \dots, d$

$w_\alpha = t * \frac{h_\alpha - h_{\alpha+d}}{\sum_{\beta=1}^{2d} h_\beta}$

Intuition

For each $\alpha$ th feature from the original elastic net problem we create two points for the SVM problem. These points have $n$ dimensions, one for each sample. The $i^{th}$ dimension of such a point is the feature value $\alpha$ from sample $\mathbf{x}_i$ , plus or minus the label scaled by $\frac{1}{t}$ .

The SVM finds a separating hyperplane. Only support vectors, i.e. points that violate the margin constraint or lie exactly on the margin, contribute to the loss function (the hinge loss of a point $\hat{\mathbf{x}}_\alpha$ is $C * \max(1-\hat{\mathbf{w}}^\top \hat{\mathbf{x}}_\alpha \hat{y}_\alpha, 0)$ . And we know the number of such vectors that violate the margin constraint are very few. These "points" correspond to the features for which the elastic net assigns non-zero weights.

Note that if $\lambda=0$ , which is equivalent to pure $l_1$ regularization, then $C$ becomes infinitely large. In this case the SVM problem becomes identical to running an SVM without slack variables.