Loss and Regularizer 
Comments 
Ordinary Least Squares
\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}y_{i})^{2}\)

 Squared Loss
 No Regularization
 Closed form solution:
 \(\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{1}\mathbf{X}\mathbf{y}^{\top}\right.\)
 \(\left.\mathbf{X}=[\mathbf{x}_{1}, ..., \mathbf{x}_{n}]\right.\)
 \(\left.\mathbf{y}=[y_{1},...,y_{n}]\right.\)

Ridge Regression
\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}y_{i})^{2}+\lambda\{w}\_{2}^{2}\)

 Squared Loss
 \(l_{2}\)Regularization
 \(\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^{\top}+\lambda\mathbb{I})^{1}\mathbf{X}\mathbf{y}^{\top}\right.\)

Lasso
\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}{y}_{i})^{2}+\lambda\\mathbf{w}\_{1}\)

 + sparsity inducing (good for feature selection)
 + Convex
  Not strictly convex (no unique solution)
  Not differentiable (at 0)
 Solve with (sub)gradient descent or
SVEN

Elastic Net \(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}{y}_{i})^{2}+\left.\alpha\\mathbf{w}\_{1}+(1\alpha)\{\mathbf{w}}\_{2}^{2}\right.\)
\(\left.\alpha\in[0, 1)\right.\)

 ADVANTAGE: Strictly convex (i.e. unique solution)
 + sparsity inducing (good for feature selection)
 + Dual of squaredloss SVM, see SVEN
 DISADVANTAGE:  Nondifferentiable

Logistic Regression
\(\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}\)

 Often \(l_{1}\) or \(l_{2}\) Regularized
 Solve with gradient descent.
 \(\left.\Pr{(yx)}=\frac{1}{1+e^{y(\mathbf{w}^{\top}x+b)}}\right.\)

Linear Support Vector Machine
\(\min_{\mathbf{w},b} C\sum\limits_{i=1}^n \max[1y_{i}(\mathbf{w}^\top{\mathbf{x}_i+b}), 0]+\\mathbf{w}\_2^2\)

 Typically \(l_2\) regularized (sometimes \(l_1\)).
 Quadratic program.
 When kernelized leads to sparse solutions.
 Kernelized version can be solved very efficiently with specialized algorithms (e.g. SMO)
