19: Boosting


Boosting reduces Bias

Scenario: Hypothesis class $\mathbb{H}$, the sets of classifiers, has large bias, i.e. training error is high.
Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can weak learners ($H$) be combined to generate a strong learner with low bias?
Famous answer: Yes! (Robert Schapire in 1990)
Solution: Create ensemble classifier $H_T(\vec x) = \sum_{t = 1}^{T}\alpha_t h_t(\vec{x})$. This ensemble classifier is built in an iterative fashion. In iteration $t$ we add the classifier $\alpha_th_t(\vec x)$ to the ensemble. During test time we evaluate all classifier and return the weighted sum.
The process of constructing such an ensemble in a stage-wise fashion is very similar to gradient descent. However, instead of updating the model parameters in each iteration, we add functions to our ensemble.
Let $\ell$ denote a (convex and differentiable) loss function. With a little abuse of notation we write \begin{equation} \ell(H)=\frac{1}{n}\sum_{i=1}^n \ell(H(\mathbf{x}_i),y_i). \end{equation} Assume we have already finished $t$ iterations and already have an ensemble classifier $H_t(\vec{x})$. Now in iteration $t+1$ we want to add one more weak learner $h_{t+1}$ to the ensemble. To this end we search for the weak learner that minimizes the loss the most, \begin{equation} h_{t+1} = \textrm{argmin}_{h \in \mathbb{H}}\ell(H_t + \alpha h_t). \end{equation} Once $h_{t+1}$ has been found, we add it to our ensemble, i.e. $H_{t+1} := H_t + \alpha h$.
How can we find such $h \in \mathbb{H}$?
Answer: Use gradient descent in function space. In function space, inner product can be defined as $< h,g >=\int\limits_x h(x)g(x)dx$. Since we only have training set, we define $< h,g>= \sum_{i = 1}^{n} h(\mathbf{x}_i)g(\mathbf{x}_i)$.

Gradient descent in functional space

Given $H$, we want to find the step-size $\alpha$ and (weak learner) $h$ to minimize the loss $\ell(H+\alpha h)$. Use Taylor Approximation on $\ell(H+\alpha h)$. \begin{equation} \ell(H+\alpha h) \approx \ell(H) + \alpha<\nabla \ell(H),h>. \label{c8:eq:taylorapprox} \end{equation} This approximation (of $\ell$ as a linear function) only holds within a small region around $\ell(H)$, i. as long as $\alpha$ is small. We therefore fix it to a small constant (e.g. $\alpha\approx 0.1$). With the step-size $\alpha$ fixed, we can use the approximation above to find an almost optimal $h$: \begin{equation} \textrm{argmin}_{h\in H}\ell(H+\alpha h) \approx \textrm{argmin}_{h\in H}<\nabla \ell(H),h>= \textrm{argmin}_{h \in \mathbb{H}}\sum_{i = 1}^{n}\frac{\partial \ell}{\partial [H(\mathbf{x}_i)]}h(\mathbf{x}_i) \end{equation} We can write $\ell(H) = \sum_{i = 1}^{n}\ell(H(\mathbf{x}_i)) = \ell(H(x_1), ... , H(x_n))$ (each prediction is an input to the loss function)
$\frac{\partial \ell}{\partial H}(\mathbf{x}_i) = \frac{\partial \ell}{\partial [H(\mathbf{x}_i)]}$
So we can do boosting if we have an algorithm $\mathbb{A}$ to solve
$h_{t+1} = \textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} \underbrace{\frac{\partial \ell}{\partial [H(\mathbf{x}_i)]}}_{r_i} h(x)$

We need a function $\mathbb{A}(\{(\mathbf{x}_1,r_1),\dots,(\mathbf{x}_n,r_n)\})=\textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} r_i h(\mathbf{x}_i)$. In order to make progress this $h$ does not have to be great. We still make progress as long as $\sum_{i = 1}^{n} r_i h(\mathbf{x}_i)<0$.

Generic boosting (a.k.a Anyboost)

Case study #1: Gradient Boosted Regression Tree(GBRT)

In order to use regression trees for gradient boosting, we must be able to find a tree $h()$ that maximizes $h = \textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} r_i h(\mathbf{x}_i)$ where $r_i = \frac{\partial \ell}{\partial H(\mathbf{x}_i)}$.
We will make two assumptions:
  1. First, we assume that $\sum_{i = 1}^{n} h^2(\mathbf{x}_i)$ = constant. This is simple to do (we normalize the predictions) and important because we could always decrease $\sum_{i=1}^n h(\mathbf{x}_i)r_i$ by rescaling $h$ with a large constant. By fixing $\sum_{i=1}^n h^s(\mathbf{x}_i)$ to a constant we are essentially fixing the vector $h$ to lie on a circle, and we are only concerned with its direction but not its length.

  2. CART trees are negation closed, i.e. $\forall \enspace h \in \mathbb{H}$ => $\exists -\!h \in \mathbb{H}$. (This is generally true.)
  3. We can define the negative graident as $t_i=-r_i$.

$\textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} r_i h(\mathbf{x}_i)$
= $\textrm{argmin}_{h \in \mathbb{H}}-2\sum_{i = 1}^{n} t_i h(\mathbf{x}_i)$
= $-\textrm{argmin}_{h \in \mathbb{H}} \sum_{i = 1}^{n} \underbrace{t_i^2}_{\textrm{constant}} - 2t_i h(\mathbf{x}_i) + \underbrace{(h(\mathbf{x}_i))^2}_{\textrm{constant}}$
= $-\textrm{argmin}_{h \in \mathbb{H}}\sum_{i = 1}^{n}(h(\mathbf{x}_i)-t_i)^2$
In other words, we can use the good old Regression trees and feed in the value $r_i$ as labels for each $\mathbf{x}_i.$ Each iteration we build a new tree for a different set of "labels" $r_1,\dots,r_n$.

If the loss function $\ell$ is the squared loss, i.e. $\ell(H)=\frac{1}{2}\sum_{i=1}^n (H(\mathbf{x}_i)-y_i)^2$, then it is easy to show that $$t_i=-\frac{\partial \ell}{H(\mathbf{x}_i)}=y_i-H(\mathbf{x}_i),$$ which is simply the residual, i.e. $\mathbf{r}$ is the vector pointing from $\mathbf{y}$ to $\mathbf{H}$. However, it is important that you can use any other differentiable and convex loss function $\ell$, and the solution for your next weak learner $h()$ will always be the regression tree minimizing the squared loss.

GBRT in Pseudo Code

Case Study #2: AdaBoost

Finding the best weak learner

First we compute the gradient $r_i=\frac{\partial \ell}{\partial H(\mathbf{x}_i)}=-y_i {e^{-y_i H(\mathbf{x}_i)}}$.

For notational convenience (and for reason that will become clear in a little bit), let us define $w_i= \frac{1}{Z}e^{-y_iH(\mathbf{x}_i)}$, where $Z=\sum_{i=1}^{n} e^{-y_iH(\mathbf{x}_i)}$ is a normalizing factor so that $\sum_{i=1}^{n} w_i=1.$ Note that the normalizing constant $Z$ is identical to the loss function. Each weight $w_i$ therefore has a very nice interpretation. It is the relative contribution of the training point $(\mathbf{x}_i,y_i)$ towards the overall loss.

In order to find the best next weak learner, we need to solve the optimization problem: (in the following, we will make use of the fact that $h(\mathbf{x}_i)\in \{+1,-1\}$.) \begin{align} h(\mathbf{x}_i)&=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i=1}^{n}r_ih(\mathbf{x}_i) && \Big(\textrm{substitute in: } r_i=e^{-H(\mathbf{x}_i)y_i}\Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}-\sum_{i=1}^n y_i e^{-H(\mathbf{x}_i)y_i}h(\mathbf{x}_i) && \Big(\textrm{substitute in: } w_i=\frac{1}{Z}e^{-H(\mathbf{x}_i)y_i}\Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}-\sum_{i=1}^{n} w_i y_i h(\mathbf{x}_i) && \Big(y_ih(\mathbf{x}_i)\in \{+1,-1\} \textrm{ with } h(\mathbf{x}_i)y_i=1 \iff h(\mathbf{x}_i)=y_i \Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i: h(\mathbf{x}_i)\neq y_i} w_i - \sum_{i: h(\mathbf{x}_i)= y_i} w_i && \Big(\sum_{i: h(\mathbf{x}_i)= y_i} w_i=1-\sum_{i: h(\mathbf{x}_i)\neq y_i} w_i\Big)\\ &=\textrm{argmin}_{h \in \mathbb{H}}\sum_{i: h(\mathbf{x}_i)\neq y_i} w_i && \Big(\textrm{This is the weighted classification error.}\Big) \end{align}

Let us denote this weighted classification error as $\epsilon=\sum_{i:h(\mathbf{x}_i)y_i=-1} w_i$. So for AdaBoost, we only need a classifier that can take training data and a distribution over the training set (i.e. normalzied weights $w_i$ for all training samples) and which returns a classifier $h\in H$ that reduces the weighted classification error of these training samples. It doesn't have to do all that well, in order for the inner-product $\sum_i r_i h(\mathbf{x}_i)$ to be negative, it just needs less than $\epsilon<0.5$ weighted training error.

Finding the stepsize $\alpha$

In the previous example, GBRT, we set the stepsize $\alpha$ to be a small constant. As it turns out, in the AdaBoost setting we can find the optimal stepsize (i.e. the one that minimizes $\ell$ the most) in closed form every time we take a "gradient" step.

When we are given $\ell, H, h$, we would like to solve the following optimization problem: \begin{align} \alpha&=\textrm{argmin}_{\alpha}\ell(H+\alpha h)\\ &=\textrm{argmin}_{\alpha} \sum_{i=1}^{n} e^{-y_i[H(\mathbf{x}_i)+\alpha h(\mathbf{x}_i)]}\\ \end{align}

We differentiate w.r.t. $\alpha$ and equate with zero: \begin{align} \sum_{i=1}^{n} y_i h(\mathbf{x}_i) e^{-y_i H(\mathbf{x}_i)+\alpha y_i h(\mathbf{x}_i)} &=0 && \Big( y_ih(\mathbf{x}_i)\in\{+1,-1\}\Big)\\ -\sum_{i:h(\mathbf{x}_i) y_i=1} e^{-(y_i H(\mathbf{x}_i)+\alpha \underbrace{y_i h(\mathbf{x}_i)}_\mathrm{1})} + \sum_{i:h(\mathbf{x}_i) y_i \neq 1} e^{-(y_i H(\mathbf{x}_i)+\alpha \underbrace{y_i h(\mathbf{x}_i)}_\mathrm{-1})}&=0 && \Big(w_i= \frac{1}{Z}e^{-y_iH(\mathbf{x}_i)}\Big)\\ -\sum_{i:h(\mathbf{x}_i) y_i=1} w_i e^{-\alpha} + \sum_{i:h(\mathbf{x}_i) y_i \neq 1} w_i e^{\alpha}&=0 && \Big( \epsilon=\!\!\sum_{i:h(\mathbf{x}_i)y_i=-1} \!\!w_i \Big)\\ -(1-\epsilon)e^{-\alpha}+\epsilon e^{\alpha}&=0 && \\ e^{2 \alpha}&=\frac{1-\epsilon}{\epsilon}\\ \alpha&=\frac{1}{2}\ln \frac{1-\epsilon}{\epsilon}\\ \end{align}

It is unusual that we can find the optimal step-size in such a simple closed form. One consequence is that AdaBoost converges extremely fast.


After you take a step, i.e. $H_{t+1}=H_{t}+\alpha h$, you need to re-compute all the weights and then re-normalize. It is however straight-forward to show that the unnormalized weight $\hat{w}_i$ is updated as $$\hat{w}_i \leftarrow \hat{w_i}*e^{-\alpha h(\mathbf{x}_i)y_i}$$ and that the normalizer $Z$ becomes $$Z\leftarrow Z * 2\sqrt{\epsilon(1-\epsilon)}.$$ Putting these two together we obtain the following multiplicative update rule: $${w}_i\leftarrow w_i\frac{e^{-\alpha h(\mathbf{x}_i)y_i}}{2\sqrt{\epsilon(1-\epsilon)}}.$$

AdaBoost Pseudo-code

Further analysis

Let us examine each one of these updates.


Boosting is a great way to turn a week classifier into a strong classifier. It defines a whole family of algorithms, including Gradient Boosting, AdaBoost, LogitBoost, and many others ... Gradient Boosted Regression Trees is one of the most popular algorithms for Learning to Rank, the branch of machine learning focused on learning ranking functions, for example for web search engines. A few additional things to know: