18: Bagging
Also known as Bootstrap Aggregating (Breiman 96)
\begin{equation*}
\underbrace{\mathbb{E}[(h_D(x) - y)^2]}_\mathrm{Error} = \underbrace{\mathbb{E}[(h_D(x)-\bar{h}(x))^2]}_\mathrm{Variance} + \underbrace{\mathbb{E}[(\bar{h}(x)-\bar{y}(x))^2]}_\mathrm{Bias} + \underbrace{\mathbb{E}[(\bar{y}(x)-y(x))^2]}_\mathrm{Noise}
\end{equation*}
Bagging Reduces Variance
Reduce: $\mathbb{E}[(h_D(x)-\bar{h}(x))^2]$
We want $h_D \to \bar{h}$
Weak law of large numbers
The weak law of large numbers says (roughly) for i.i.d. random variables $x_i$ with mean $\mu$, we have,
\[
\frac{1}{m}\sum_{i = 1}^{m}x_i \rightarrow \bar{x} \textrm{ as } m\rightarrow \infty
\]
Apply this to classifiers: Assume we have m training sets $D_1, D_2, ..., D_n$ drawn from $P^n$. Train a classifier on each one and average result:
$$\hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} \to \bar{h} \qquad as\ m \to \infty$$
Good news: If $\hat{h}\rightarrow \bar{h}$ the variance component of the error must also becomes zero, i.e.
$\mathbb{E}[(h_D(x)-\bar{h}(x))^2]\rightarrow 0$
Problem
We don't have $m$ data sets $D_1, ...., D_m $, we only have D.
Solution: Bagging (Bootstrap Aggregating)
Simulate drawing from P by drawing uniformly with replacement from the set D.
i.e. let $Q((\vec{x_i}, y_i)) = \frac{1}{n} \qquad\forall (\vec{x_i}, y_i)\in D$
Draw $D_i~Q^n$, i.e. $|D_i| =n$, and $D_i$ is picked with replacement
Q: What is $\mathbb{E}[|D\cap D_i|]^2$
Bagged classifier: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}$
Notice: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h}$(cannot use W.L.L.N here, W.L.L.N only works for i.i.d)
Analysis
Assume P is discrete, with $P(X=x_i) = p_i$ over some set $\Omega = {x_1, ... x_N}$ (N very large)
(let's ignore the label for now for simplicity)
\begin{equation*}
\begin{aligned}
Q(X=x_i)&= \underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}}_{\substack{\text{Probability that are}\\\text{k copies of $x_i$ in D}}} \underbrace{\frac{k}{n}}_\mathrm{\substack{\text{Probability}\\\text{pick one of}\\\text{these copies}}}\\
&=\frac{1}{n}\underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}k}_{\substack{\text{Expected value of}\\\text{Binomial Distribution}\\\text{with parameter $p_i$}\\\mathbb{E}[\mathbb{B}(p_i,n)]=np_i}}\\
&=\frac{1}{n}np_i\\
&=p_i\leftarrow\underline{TATAAA}\text{!! Each data set $D'_l$ is drawn from P, but not independently.}
\end{aligned}
\end{equation*}
Advantages of Bagging
- Easy to implement
- Works well with many (high variance) classifiers
- Gives out-of-bag estimate of test error (unbiased)
For $(\vec{x}_i, y_i) \in D$, let
\begin{equation*}
z_i=\sum_{\substack{D_j\\(\vec{x}_i, y_i) \notin D_j}}\mathbf{1} \leftarrow \text{number of data sets w/o $(\vec{x}_i, y_i)$}
\epsilon_\mathrm{OOB}=\sum_{(\vec{x}_1, y_1) \in D_j}\frac{1}{z_i}\sum_{\substack{D_l\\(\vec{x}_i, y_i) \notin D_l}}l(h_{D_l}(\vec{x_i}),y_i)
\end{equation*}
- Estimates variance