19: Boosting

Boosting reduces Bias

Scenario: Hypothesis class $\mathbb{H}$, the sets of classifiers, has large bias, i.e. training error is high.
Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can weak learners ($H$) be combined to generate a strong learner with low bias?
Famous answer: Yes! (Robert Schapire in 1990)
Solution: Create ensemble classifier $H(\vec x) = \sum_{t = 1}^{T}\alpha_t h_t(\vec{x})$. This ensemble classifier is built in an iterative fashion. In iteration $t$ we add the classifier $\alpha_th_t(\vec x)$ to the ensemble. During test time we evaluate all classifier and return the weighted sum.
The process of constructing such an ensemble in a stage-wise fashion is very similar to gradient descent. However, instead of updating the model parameters in each iteration, we add functions to our ensemble.
Let $\ell$ denote a (convex and differentiable) loss function. With a little abuse of notation we write \begin{equation} \ell(H_T)=\frac{1}{n}\sum_{i=1}^n \ell(\vec x_i,y_i). \end{equation} Assume we have already finished $t$ iterations and already have an ensemble classifier $H_t(\vec{x})$. Now in iteration $t+1$ we want to add one more weak learner $h_{t+1}$ to the ensemble. To this end we search for the weak learner that minimizes the loss the most, \begin{equation} h_{t+1} = argmin_{h \in \mathbb{H}}\ell(H_t + \alpha h_t). \end{equation} Once $h_{t+1}$ has been found, we add it to our ensemble, i.e. $H_{t+1} := H_t + \alpha h$.
How can we find such $h \in \mathbb{H}$?
Answer: Use gradient descent in function space. In function space, inner product can be defined as $(h,g)=\int\limits_x h(x)g(x)dx$. Since we only have training set, we define $(h,g)= \sum_{i = 1}^{n} h(x_i)g(x_i)$

Gradient descent in functional space

Given $H$, we want to find the step-size $\alpha$ and (weak learner) $h$ to minimize the loss $l(H+\alpha h)$. Use Taylor Approximation on $l(H+\alpha h)$. \begin{equation} l(H+\alpha h) \approx l(H) + \alpha<\nabla l(H),h>. \label{c8:eq:taylorapprox} \end{equation} This approximation (of $l$ as a linear function) only holds within a small region around $l(H)$, i. as long a $\alpha$ is small. We therefore fix it to a small constant (e.g. $\alpha\approx 0.1$). With the step-size $\alpha$ fixed, we can use the approximation above to find an almost optimal $h$: \begin{equation} argmin_{h\in H}l(H+\alpha h) \approx argmin_{h\in H}<\nabla l(H),h>= argmin_{h \in \mathbb{H}}\sum_{i = 1}^{n}\frac{\partial l}{\partial [H(x_i)]}h(x_i) \end{equation} We can write $l(H) = \sum_{i = 1}^{n}l(H(x_i)) = l(H(x_1), ... , H(x_n))$ (each prediction is an input to the loss function)
$\frac{\partial l}{\partial H}(x_i) = \frac{\partial l}{\partial [H(x_i)]}$
So we can do boosting if we have an algorithm $\mathbb{A}$ to solve
$h_{t+1} = argmin_{h \in \mathbb{H}} \sum_{i = 1}^{n} \frac{\partial l}{\partial [H(x_i)]} h(x)$
For $w_i, x_i$, $\mathbb{A}$ returns $h = argmin_{h \in \mathbb{H}} \sum_{i = 1}^{n} w_i h(x_i)$.

Generic boosting (a.k.a Anyboost)

Case study #1: Gradient Boosted Regression Tree(GBRT)

Choice of weak learners $H$: limited depth decision/regression trees.(e.g. depth = 4)
But how can we know a tree such that $h = argmin_{h \in \mathbb{H}} \sum_{i = 1}^{n} w_i h(x_i)$ where $w_i = \frac{\partial l}{\partial H(x_i)}$
Assume that $\sum_{i = 1}^{n} h^2(x_i)$ = constant (e.g. normalize output)
1. $\sum_{i = 1}^{n} (\frac{\partial l}{\partial H(x_i)})^2$ is independent of $h_{t+1}$
2. CART trees are negation closed, i.e. $\forall \enspace h \in \mathbb{H}$ => $\exists \enspace -h \in \mathbb{H}$.
$\min_{h \in \mathbb{H}} \sum_{i = 1}^{n} w_i h(x_i)$
= $\min_{-h \in \mathbb{H}}-2\sum_{i = 1}^{n} w_i h\prime(x_i)$
= $-\min_{-h \in \mathbb{H}} \sum_{i = 1}^{n} w_i^2 - 2w_i h(x_i) + (h(x_i))^2$
= $-\min_{-h \in \mathbb{H}}\sum_{i = 1}^{n}(h(x_i)-w_i))^2$


(This is one of Kilian's favorite ML algorithm.)

Case Study #2: AdaBoost