| Loss \(\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))\) | Comments | 
|---|
| Squared Loss \(\left.(h(\mathbf{x}_{i})-y_{i})^{2}\right.\) | 
 Most popular regression loss function  Estimates Mean Label  ADVANTAGE: Differentiable everywhere  DISADVANTAGE: Somewhat sensitive to outliers/noise  Also known as Ordinary Least Squares (OLS)  | 
| Absolute Loss \(\left.|h(\mathbf{x}_{i})-y_{i}|\right.\) | 
 Also a very popular loss function  Estimates Median Label  ADVANTAGE: Less sensitive to noise  DISADVANTAGE: Not differentiable at \(0\)  | 
| Huber Loss 
 \(\left.\frac{1}{2}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}\right.\) if \(|h(\mathbf{x}_{i})-y_{i}|<\delta\), otherwise \(\left.\delta(|h(\mathbf{x}_{i})-y_{i}|-\frac{\delta}{2})\right.\)  | 
 Also known as Smooth Absolute Loss  ADVANTAGE: "Best of Both Worlds" of Squared and Absolute Loss  Once-differentiable  Takes on behavior of Squared-Loss when loss is small, and Absolute Loss when loss is large.  | 
| Log-Cosh Loss \(\left.log(cosh(h(\mathbf{x}_{i})-y_{i}))\right.\), \(\left.cosh(x)=\frac{e^{x}+e^{-x}}{2}\right.\) | ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere | 
| Regularizer \(r(\mathbf{w})\) | Properties | 
|---|
| \(l_{2}\)-Regularization 
			\(\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = \|{\mathbf{w}}\|_{2}^{2}\right.\) | 
 ADVANTAGE: Strictly Convex ADVANTAGE: Differentiable DISADVANTAGE: Uses weights on all features, i.e. relies on all features to some degree (ideally we would like to avoid this) - these are known as Dense Solutions.
			 | 
| \(l_{1}\)-Regularization \(\left.r(\mathbf{w}) = \|\mathbf{w}\|_{1}\right.\) | 
 Convex (but not strictly) DISADVANTAGE: Not differentiable at \(0\) (the point which minimization is intended to bring us to Effect: Sparse (i.e. not Dense) Solutions | 
| \(l_p\)-Norm 
			\(\left.\|{\mathbf{w}}\|_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right.\) | 
(often \(\left.0<p\leq1\right.\)) DISADVANTAGE: Non-convex  ADVANTAGE: Very sparse solutions Initialization dependent DISADVANTAGE: Not differentiable | 
| Loss and Regularizer | Comments | 
|---|
| Ordinary Least Squares 
			\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}\) | 
 Squared Loss No Regularization Closed form solution: \(\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^\top)^{-1}\mathbf{X}\mathbf{y}^{\top}\right.\) \(\left.\mathbf{X}=[\mathbf{x}_{1}, ..., \mathbf{x}_{n}]\right.\) \(\left.\mathbf{y}=[y_{1},...,y_{n}]\right.\) | 
| Ridge Regression 
			\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}-y_{i})^{2}+\lambda\|{w}\|_{2}^{2}\) | 
 Squared Loss \(l_{2}\)-Regularization \(\left.\mathbf{w}=(\mathbf{X}\mathbf{X}^{\top}+\lambda\mathbb{I})^{-1}\mathbf{X}\mathbf{y}^{\top}\right.\)  | 
| Lasso 
			\(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\lambda\|\mathbf{w}\|_{1}\) | 
+ sparsity inducing (good for feature selection)+ Convex - Not strictly convex (no unique solution)- Not differentiable (at 0) Solve with (sub)-gradient descent or 
				SVEN | 
| Elastic Net \(\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}-{y}_{i})^{2}+\left.\alpha\|\mathbf{w}\|_{1}+(1-\alpha)\|{\mathbf{w}}\|_{2}^{2}\right.\)
			\(\left.\alpha\in[0, 1)\right.\) | 
 ADVANTAGE: Strictly convex (i.e. unique solution) + sparsity inducing (good for feature selection) + Dual of squared-loss SVM, see SVEN DISADVANTAGE: - Non-differentiable | 
| Logistic Regression 
			\(\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{-y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}\) | 
Often \(l_{1}\) or \(l_{2}\) Regularized Solve with gradient descent. \(\left.\Pr{(y|x)}=\frac{1}{1+e^{-y(\mathbf{w}^{\top}x+b)}}\right.\) | 
| Linear Support Vector Machine 
			 \(\min_{\mathbf{w},b} C\sum\limits_{i=1}^n \max[1-y_{i}(\mathbf{w}^\top{\mathbf{x}_i+b}), 0]+\|\mathbf{w}\|_2^2\) | 
			Typically \(l_2\) regularized (sometimes \(l_1\)).  Quadratic program.  When kernelized leads to  sparse solutions.  Kernelized version can be solved very efficiently with specialized algorithms (e.g. SMO) |