## 19: Boosting

### Boosting reduces Bias

Scenario: Hypothesis class $\mathbb{H}$, the sets of classifiers, has large bias, i.e. training error is high.
Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can weak learners ($H$) be combined to generate a strong learner with low bias?
Famous answer: Yes! (Robert Schapire in 1990)
Solution: Create ensemble classifier $H(\vec x) = \sum_{t = 1}^{T}\alpha_t h_t(\vec{x})$. This ensemble classifier is built in an iterative fashion. In iteration $t$ we add the classifier $\alpha_th_t(\vec x)$ to the ensemble. During test time we evaluate all classifier and return the weighted sum.
The process of constructing such an ensemble in a stage-wise fashion is very similar to gradient descent. However, instead of updating the model parameters in each iteration, we add functions to our ensemble.
Let $\ell$ denote a (convex and differentiable) loss function. With a little abuse of notation we write $$\ell(H_T)=\frac{1}{n}\sum_{i=1}^n \ell(\vec x_i,y_i).$$ Assume we have already finished $t$ iterations and already have an ensemble classifier $H_t(\vec{x})$. Now in iteration $t+1$ we want to add one more weak learner $h_{t+1}$ to the ensemble. To this end we search for the weak learner that minimizes the loss the most, $$h_{t+1} = argmin_{h \in \mathbb{H}}\ell(H_t + \alpha h_t).$$ Once $h_{t+1}$ has been found, we add it to our ensemble, i.e. $H_{t+1} := H_t + \alpha h$.
How can we find such $h \in \mathbb{H}$?
Answer: Use gradient descent in function space. In function space, inner product can be defined as $(h,g)=\int\limits_x h(x)g(x)dx$. Since we only have training set, we define $(h,g)= \sum_{i = 1}^{n} h(x_i)g(x_i)$

#### Gradient descent in functional space

Given $H$, we want to find the step-size $\alpha$ and (weak learner) $h$ to minimize the loss $l(H+\alpha h)$. Use Taylor Approximation on $l(H+\alpha h)$. $$l(H+\alpha h) \approx l(H) + \alpha<\nabla l(H),h>. \label{c8:eq:taylorapprox}$$ This approximation (of $l$ as a linear function) only holds within a small region around $l(H)$, i. as long a $\alpha$ is small. We therefore fix it to a small constant (e.g. $\alpha\approx 0.1$). With the step-size $\alpha$ fixed, we can use the approximation above to find an almost optimal $h$: $$argmin_{h\in H}l(H+\alpha h) \approx argmin_{h\in H}<\nabla l(H),h>= argmin_{h \in \mathbb{H}}\sum_{i = 1}^{n}\frac{\partial l}{\partial [H(x_i)]}h(x_i)$$ We can write $l(H) = \sum_{i = 1}^{n}l(H(x_i)) = l(H(x_1), ... , H(x_n))$ (each prediction is an input to the loss function)
$\frac{\partial l}{\partial H}(x_i) = \frac{\partial l}{\partial [H(x_i)]}$
So we can do boosting if we have an algorithm $\mathbb{A}$ to solve
$h_{t+1} = argmin_{h \in \mathbb{H}} \sum_{i = 1}^{n} \frac{\partial l}{\partial [H(x_i)]} h(x)$
For $w_i, x_i$, $\mathbb{A}$ returns $h = argmin_{h \in \mathbb{H}} \sum_{i = 1}^{n} w_i h(x_i)$.

### Case study #1: Gradient Boosted Regression Tree(GBRT)

Choice of weak learners $H$: limited depth decision/regression trees.(e.g. depth = 4)
But how can we know a tree such that $h = argmin_{h \in \mathbb{H}} \sum_{i = 1}^{n} w_i h(x_i)$ where $w_i = \frac{\partial l}{\partial H(x_i)}$
Assume that $\sum_{i = 1}^{n} h^2(x_i)$ = constant (e.g. normalize output)
Realize
1. $\sum_{i = 1}^{n} (\frac{\partial l}{\partial H(x_i)})^2$ is independent of $h_{t+1}$
2. CART trees are negation closed, i.e. $\forall \enspace h \in \mathbb{H}$ => $\exists \enspace -h \in \mathbb{H}$.
$\min_{h \in \mathbb{H}} \sum_{i = 1}^{n} w_i h(x_i)$
= $\min_{-h \in \mathbb{H}}-2\sum_{i = 1}^{n} w_i h\prime(x_i)$
= $-\min_{-h \in \mathbb{H}} \sum_{i = 1}^{n} w_i^2 - 2w_i h(x_i) + (h(x_i))^2$
= $-\min_{-h \in \mathbb{H}}\sum_{i = 1}^{n}(h(x_i)-w_i))^2$

### GBRT

(This is one of Kilian's favorite ML algorithm.)

• Classification ($y_i \in \{+1,-1\}$) $\enspace i$ = weak learners, $h \in \mathbb{H}$ are binary, $h(x_i) \in \{-1,+1\}, \forall x$
• Perform line-search to obtain best step size
• Loss function: Exponential loss $l(H)=\sum_{i=1}^{n} e^{-y_i H(x_i)}$
• #### Finding the best weak learner

Gradient: $\frac{\partial l}{\partial H(x_i)}=-y_i \underbrace{e^{-y_i H(x_i)}}_\mathrm{defined \enspace from \enspace w_i} < 0 \enspace \forall x_i$

$\underbrace{r_i}_\mathrm{row\enspace weight}=e^{-H(x_i)y_i} \qquad \underbrace{w_i}_\mathrm{normalized\enspace weight}= \frac{e^{-H(x_i)y_i}}{\underbrace{z}_\mathrm{normalization\enspace for\enspace convenience}},\qquad \forall x_i$

$z=\sum_{i=1}^{n} e^{-H(x_i)y_i}$ so that $\sum_{i=1}^{n} w_i=1$

$argmin_{h}-\sum_{i=1}^{n}y_i e^{-H(x_i)y_i} h(x_i) = argmax_{h}\underbrace{\sum_{i=1}^{n} w_i \underbrace{y_i h(x_i)}_\mathrm{+1\enspace if\enspace h(x_i)=y_i,\enspace -1\enspace 0/w}}_\mathrm{this \enspace is \enspace the\enspace training\enspace accuracy\enspace (up\enspace to\enspace scaling) \enspace weighted\enspace by\enspace distribution\enspace w_1,w_2...w_n}$
So for AdaBoost, we only need a classifier that can take training data and a distribution over the training set and which returns a classifier $h\in H$ with less than 0.5 weighted training error.
Weighted training error: $\epsilon=\sum_{i:h(x_i)y_i=-1} w_i$
Condition: for $w_1,...,w_n$ s.t.: $w_i\geq 0$ and $\sum_{i=1}^{n} w_i=1 \enspace h(D, w_1,...w_n)$ is such that $\epsilon <0.5$

#### Finding the stepsize $\alpha$

(by line search to minimize l)
Remember: $\epsilon=\sum_{i:y_i h(x_i)=1} w_i$
Choose: $\alpha=argmin_{\alpha}l(H+\alpha h)$
= $argmin_{\alpha} \sum_{i=1}^{n} e^{y_i[H(x_i)+\alpha h(x_i)]}$
$\downarrow$ Differentiating w.r.l.$\alpha$ and equating with zero.
= $\sum_{i=1}^{n} e^{y_i H(x_i)+\alpha y_i h(x_i)} * \underbrace{y_i h(x_i)}_\mathrm{\in \{+1,-1\}}=0$
$-\sum_{i:h(x_i) y_i=1} e^{-(y_i H(x_i)+\alpha \underbrace{y_i h(x_i)}_\mathrm{1})} + \sum_{i:h(x_i) y_i \neq 1} e^{-(y_i H(x_i)+\alpha \underbrace{y_i h(x_i)}_\mathrm{-1})}=0 \mid : \sum_{\underbrace{i=1}_\mathrm{normalizer}}^{n} e^{-y_i H(x_i)}$
$-\sum_{i:h(x_i) y_i=1} w_i e^{-\alpha} + \sum_{i:h(x_i) y_i \neq 1} w_i e^{+\alpha}=0$
$-(1-\epsilon)e^{-\alpha}+\epsilon e^{+\alpha}=0$
$e^{2 \alpha}=\frac{1-\epsilon}{\epsilon}$
$\alpha=\frac{1}{2}\ln \frac{1-\epsilon}{\epsilon}$