| Loss ℓ(hw(xi,yi)) | Usage | Comments | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hinge-Loss max[1−hw(xi)yi,0]p |
When used for Standard SVM, the loss function denotes the size of the margin between linear separator and its closest points in either class. Only differentiable everywhere with p=2. | ||||||||||||||||
| Log-Loss log(1+e−hw(xi)yi) | Logistic Regression | One of the most popular loss functions in Machine Learning, since its outputs are well-calibrated probabilities. | |||||||||||||||
| Exponential Loss e−hw(xi)yi | AdaBoost | This function is very aggressive. The loss of a mis-prediction increases exponentially with the value of −hw(xi)yi. This can lead to nice convergence results, for example in the case of Adaboost, but it can also cause problems with noisy data. | |||||||||||||||
| Zero-One Loss δ(sign(hw(xi))≠yi) | Actual Classification Loss | Non-continuous and thus impractical to optimize. |
Some questions about the loss functions:![]()
Figure 4.1: Plots of Common Classification Loss Functions - x-axis: h(xi)yi, or "correctness" of prediction; y-axis: loss value
| Loss ℓ(hw(xi,yi)) | Comments | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Squared Loss (h(xi)−yi)2 |
|
||||||||||
| Absolute Loss |h(xi)−yi| |
|
||||||||||
Huber Loss
|
|
||||||||||
| Log-Cosh Loss log(cosh(h(xi)−yi)), cosh(x)=ex+e−x2 |
ADVANTAGE: Similar to Huber Loss, but twice differentiable everywhere |
![]()
Figure 4.2: Plots of Common Regression Loss Functions - x-axis: h(xi)yi, or "error" of prediction; y-axis: loss value
| Regularizer r(w) | Properties | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
l2-Regularization
r(w)=w⊤w=‖w‖22 |
|
||||||||||
| l1-Regularization r(w)=‖w‖1 |
|
||||||||||
| lp-Norm ‖w‖p=(d∑i=1vpi)1/p |
|
![]()
Figure 4.3: Plots of Common Regularizers
| Loss and Regularizer | Comments | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ordinary Least Squares minw1nn∑i=1(w⊤xi−yi)2 |
|
||||||||||
| Ridge Regression minw1nn∑i=1(w⊤xi−yi)2+λ‖w‖22 |
|
||||||||||
| Lasso minw1nn∑i=1(w⊤xi−yi)2+λ‖w‖1 |
|
||||||||||
| Elastic Net minw1nn∑i=1(w⊤xi−yi)2+α‖w‖1+(1−α)‖w‖22 α∈[0,1) |
|
||||||||||
| Logistic Regression minw,b1nn∑i=1log(1+e−yi(w⊤xi+b)) |
|
||||||||||
| Linear Support Vector Machine minw,bCn∑i=1max[1−yi(w⊤xi+b),0]+‖w‖22 |
|
||||||||||