Loss $\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))$  Usage  Comments  

1.HingeLoss $max\left[1h_{\mathbf{w}}(\mathbf{x}_{i})y_{i},0\right]^{p}$  When used for Standard SVM, the loss function denotes margin length between linear separator and its closest point in either class. Only differentiable everywhere at $\left.p=2\right.$.  
2.LogLoss $\left.log(1+e^{h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}})\right.$  Logistic Regression  One of the most popular loss functions in Machine Learning, since its outputs are very welltuned.  
3.Exponential Loss $\left. e^{h_{\mathbf{w}}(\mathbf{x}_{i})y_{i}}\right.$  AdaBoost  This function is very aggressive. The loss of a misprediction increases exponentially with the value of $h_{\mathbf{w}}(\mathbf{x}_i)y_i$.  
4.ZeroOne Loss $\left.\delta(\textrm{sign}(h_{\mathbf{w}}(\mathbf{x}_{i}))\neq y_{i})\right.$  Actual Classification Loss  Noncontinuous and thus impractical to optimize. 
Some additional notes on loss functions:Figure 4.1: Plots of Common Classification Loss Functions  xaxis: $\left.h(\mathbf{x}_{i})y_{i}\right.$, or "correctness" of prediction; yaxis: loss value
Loss $\ell(h_{\mathbf{w}}(\mathbf{x}_i,y_i))$  Comments  

1.Squared Loss $\left.(h(\mathbf{x}_{i})y_{i})^{2}\right.$ 


2.Absolute Loss $\left.h(\mathbf{x}_{i})y_{i}\right.$ 


3.Huber Loss



4.LogCosh Loss $\left.log(cosh(h(\mathbf{x}_{i})y_{i}))\right.$, $\left.cosh(x)=\frac{e^{x}+e^{x}}{2}\right.$ 
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere 
Figure 4.2: Plots of Common Regression Loss Functions  xaxis: $\left.h(\mathbf{x}_{i})y_{i}\right.$, or "error" of prediction; yaxis: loss value
Regularizer $r(\mathbf{w})$  Properties  

1.$l_{2}$Regularization
$\left.r(\mathbf{w}) = \mathbf{w}^{\top}\mathbf{w} = (\{\mathbf{w}}\_{2})^{2}\right.$ 


2.$l_{1}$Regularization $\left.r(\mathbf{w}) = \\mathbf{w}\_{1}\right.$ 


3.Elastic Net $\left.\alpha\\mathbf{w}\_{1}+(1\alpha)(\{\mathbf{w}}\_{2})^{2}\right.$ $\left.\alpha\in[0, 1)\right.$ 


4.lpNorm often $\left.0<p\leq1\right.$ $\left.\{\mathbf{w}}\_{p} = (\sum\limits_{i=1}^d v_{i}^{p})^{1/p}\right.$ 

Figure 4.3: Plots of Common Regularizers
Loss and Regularizer  Classification  Solutions  

1.Ordinary Least Squares $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}y_{i})^{2}$ 



2.Ridge Regression $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}x_{i}y_{i})^{2}+\lambda\{w}\_{2}^{2}$ 



3.Lasso $\min_{\mathbf{w}} \frac{1}{n}\sum\limits_{i=1}^n (\mathbf{w}^{\top}\mathbf{x}_{i}\vec{y_{i}})^{2}+\lambda\\mathbf{w}\_{1}$ 



4.Logistic Regression $\min_{\mathbf{w},b} \frac{1}{n}\sum\limits_{i=1}^n \log{(1+e^{y_i(\mathbf{w}^{\top}\mathbf{x}_{i}+b)})}$ 

