Processing math: 100%
18: Bagging
Also known as Bootstrap Aggregating (Breiman 96). Bagging is an ensemble method.
Bagging Reduces Variance
Remember the Bias / Variance decomposition:
E[(hD(x)−y)2]⏟Error=E[(hD(x)−ˉh(x))2]⏟Variance+E[(ˉh(x)−ˉy(x))2]⏟Bias+E[(ˉy(x)−y(x))2]⏟Noise
Our goal is to reduce the variance term: E[(hD(x)−ˉh(x))2].
For this, we want hD→ˉh.
Weak law of large numbers
The weak law of large numbers says (roughly) for i.i.d. random variables xi with mean μ, we have,
1mm∑i=1xi→ˉx as m→∞
Apply this to classifiers: Assume we have m training sets D1,D2,...,Dn drawn from Pn. Train a classifier on each one and average result:
ˆh=1mm∑i=1hDi→ˉhas m→∞
We refer to such an average of multiple classifiers as an ensemble of classifiers.
Good news: If ˆh→ˉh the variance component of the error must also becomes zero, i.e.
E[(hD(x)−ˉh(x))2]→0
Problem
We don't have m data sets D1,....,Dm, we only have D.
Solution: Bagging (Bootstrap Aggregating)
Simulate drawing from P by drawing uniformly with replacement from the set D.
i.e. let Q((xi,yi))=1n∀(xi,yi)∈D
Draw Di Qn, i.e. |Di|=n, and Di is picked with replacement
Q: What is E[|D∩Di|]?
Bagged classifier: ˆhD=1m∑mi=1hDi
Notice: ˆhD=1m∑mi=1hDi↛ˉh(cannot use W.L.L.N here, W.L.L.N only works for i.i.d). However, in practice bagging still reduces variance very effectively.
Analysis
Although we cannot prove that the new samples are i.i.d., we can show that they are drawn from the original distribution P.
Assume P is discrete, with P(X=xi)=pi over some set Ω=x1,...xN (N very large)
(let's ignore the label for now for simplicity)
Q(X=xi)=n∑k=1(nk)pki(1−pi)n−k⏟Probability that arek copies of xi in Dkn⏟Probabilitypick one ofthese copies=1nn∑k=1(nk)pki(1−pi)n−kk⏟Expected value ofBinomial Distributionwith parameter piE[B(pi,n)]=npi=1nnpi=pi←TATAAA_!! Each data set D′l is drawn from P, but not independently.
Bagging summarized
- Sample m data sets D1,…,Dm from D with replacement.
- For each Dj train a classifier hj()
- The final classifier is h(x)=1m∑mj=1hj(x).
In practice larger m results in a better ensemble, however at some point you will obtain diminishing returns. Note that setting m unnecessarily high will only slow down your classifier but will not increase the error of your classifier.
Advantages of Bagging
- Easy to implement
- Works well with many (high variance) classifiers
- Provides an unbiased estimate of the test error, which we refer to as the out-of-bag error. For each training point (xi,yi)∈D compute the out-of-bag error as the average error obtained on this training point from all the classifiers that were trained without it.
For (xi,yi)∈D, let
zi=∑Dj(xi,yi)∉Dj1←number of data sets w/o (xi,yi)ϵOOB=∑(x1,y1)∈Dj1zi∑Dl(xi,yi)∉Dll(hDl(xi),yi)
- From the predictions of the individual classifiers in the bag we can estimate the variance of our ensemble. This can be very helpful to estimate the uncertainty with which the classifier makes a prediction. For example, if each one of the m classifiers agrees on the label the ensemble is very certain. On the other hand, is only 51% agree on the label but the other 49% disagree, the classifier would be very uncertain.
Random Forest
One of the most famous and useful bagged algorithms is the Random Forest! A Random Forest is essentially nothing else but bagged decision trees, with a slightly modified splitting criteria.
The algorithm works as follows:
- Sample m data sets D1,…,Dm from D with replacement.
- For each Dj train a full decision tree hj() (max-depth=∞) with one small modification: before each split randomly subsample k≤d features (without replacement) and only consider these for your split. (This further increases the variance of the trees.)
- The final classifier is h(x)=1m∑mj=1hj(x).
The Random Forest is one of the best, most popular and easiest to use out-of-the-box classifier.
There are two reasons for this:
- The RF only has two hyper-parameters, m and k. It is extremely insensitive to both of these. A good choice for k is k=√d (where d denotes the number of features). You can set m as large as you can afford.
- Decision trees do not require a lot of preprocessing. For example, the features can be of different scale, magnitude, or slope. This can be highly advantageous in scenarios with heterogeneous data, for example the medical settings where features could be things like blood pressure, age, gender, ..., each of which is recorded in completely different units.