Processing math: 100%

19: Boosting


Boosting reduces Bias

Scenario: Hypothesis class H, the sets of classifiers, has large bias, i.e. training error is high.
Famous question: In his machine learning class project in 1988 Michael Kearns famously asked the question: Can weak learners (H) be combined to generate a strong learner with low bias?
Famous answer: Yes! (Robert Schapire in 1990)
Solution: Create ensemble classifier H(x)=Tt=1αtht(x). This ensemble classifier is built in an iterative fashion. In iteration t we add the classifier αtht(x) to the ensemble. During test time we evaluate all classifier and return the weighted sum.
The process of constructing such an ensemble in a stage-wise fashion is very similar to gradient descent. However, instead of updating the model parameters in each iteration, we add functions to our ensemble.
Let denote a (convex and differentiable) loss function. With a little abuse of notation we write (HT)=1nni=1(xi,yi). Assume we have already finished t iterations and already have an ensemble classifier Ht(x). Now in iteration t+1 we want to add one more weak learner ht+1 to the ensemble. To this end we search for the weak learner that minimizes the loss the most, ht+1=argminhH(Ht+αht). Once ht+1 has been found, we add it to our ensemble, i.e. Ht+1:=Ht+αh.
How can we find such hH?
Answer: Use gradient descent in function space. In function space, inner product can be defined as (h,g)=xh(x)g(x)dx. Since we only have training set, we define (h,g)=ni=1h(xi)g(xi)

Gradient descent in functional space

Given H, we want to find the step-size α and (weak learner) h to minimize the loss l(H+αh). Use Taylor Approximation on l(H+αh). l(H+αh)l(H)+α<l(H),h>. This approximation (of l as a linear function) only holds within a small region around l(H), i. as long a α is small. We therefore fix it to a small constant (e.g. α0.1). With the step-size α fixed, we can use the approximation above to find an almost optimal h: argminhHl(H+αh)argminhH<l(H),h>=argminhHni=1l[H(xi)]h(xi) We can write l(H)=ni=1l(H(xi))=l(H(x1),...,H(xn)) (each prediction is an input to the loss function)
lH(xi)=l[H(xi)]
So we can do boosting if we have an algorithm A to solve
ht+1=argminhHni=1l[H(xi)]h(x)
For wi,xi, A returns h=argminhHni=1wih(xi).

Generic boosting (a.k.a Anyboost)



Case study #1: Gradient Boosted Regression Tree(GBRT)

Choice of weak learners H: limited depth decision/regression trees.(e.g. depth = 4)
But how can we know a tree such that h=argminhHni=1wih(xi) where wi=lH(xi)
Assume that ni=1h2(xi) = constant (e.g. normalize output)
Realize
1. ni=1(lH(xi))2 is independent of ht+1
2. CART trees are negation closed, i.e. hH => hH.
minhHni=1wih(xi)
= minhH2ni=1wih(xi)
= minhHni=1w2i2wih(xi)+(h(xi))2
= minhHni=1(h(xi)wi))2

GBRT

(This is one of Kilian's favorite ML algorithm.)

Case Study #2: AdaBoost