## 18: Bagging

Also known as Bootstrap Aggregating (Breiman 96) \begin{equation*} \underbrace{\mathbb{E}[(h_D(x) - y)^2]}_\mathrm{Error} = \underbrace{\mathbb{E}[(h_D(x)-\bar{h}(x))^2]}_\mathrm{Variance} + \underbrace{\mathbb{E}[(\bar{h}(x)-\bar{y}(x))^2]}_\mathrm{Bias} + \underbrace{\mathbb{E}[(\bar{y}(x)-y(x))^2]}_\mathrm{Noise} \end{equation*}

### Bagging Reduces Variance

Reduce: $\mathbb{E}[(h_D(x)-\bar{h}(x))^2]$
We want $h_D \to \bar{h}$

#### Weak law of large numbers

The weak law of large numbers says (roughly) for i.i.d. random variables $x_i$ with mean $\mu$, we have, $\frac{1}{m}\sum_{i = 1}^{m}x_i \rightarrow \bar{x} \textrm{ as } m\rightarrow \infty$
Apply this to classifiers: Assume we have m training sets $D_1, D_2, ..., D_n$ drawn from $P^n$. Train a classifier on each one and average result: $$\hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} \to \bar{h} \qquad as\ m \to \infty$$
Good news: If $\hat{h}\rightarrow \bar{h}$ the variance component of the error must also becomes zero, i.e. $\mathbb{E}[(h_D(x)-\bar{h}(x))^2]\rightarrow 0$
##### Problem
We don't have $m$ data sets $D_1, ...., D_m$, we only have D.

#### Solution: Bagging (Bootstrap Aggregating)

Simulate drawing from P by drawing uniformly with replacement from the set D.
i.e. let $Q((\vec{x_i}, y_i)) = \frac{1}{n} \qquad\forall (\vec{x_i}, y_i)\in D$
Draw $D_i~Q^n$, i.e. $|D_i| =n$, and $D_i$ is picked with replacement
Q: What is $\mathbb{E}[|D\cap D_i|]^2$
Bagged classifier: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}$
Notice: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h}$(cannot use W.L.L.N here, W.L.L.N only works for i.i.d)
##### Analysis
Assume P is discrete, with $P(X=x_i) = p_i$ over some set $\Omega = {x_1, ... x_N}$ (N very large) (let's ignore the label for now for simplicity)
\begin{equation*} \begin{aligned} Q(X=x_i)&= \underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}}_{\substack{\text{Probability that are}\\\text{k copies of $x_i$ in D}}} \underbrace{\frac{k}{n}}_\mathrm{\substack{\text{Probability}\\\text{pick one of}\\\text{these copies}}}\\ &=\frac{1}{n}\underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}k}_{\substack{\text{Expected value of}\\\text{Binomial Distribution}\\\text{with parameter $p_i$}\\\mathbb{E}[\mathbb{B}(p_i,n)]=np_i}}\\ &=\frac{1}{n}np_i\\ &=p_i\leftarrow\underline{TATAAA}\text{!! Each data set $D'_l$ is drawn from P, but not independently.} \end{aligned} \end{equation*}

For $(\vec{x}_i, y_i) \in D$, let \begin{equation*} z_i=\sum_{\substack{D_j\\(\vec{x}_i, y_i) \notin D_j}}\mathbf{1} \leftarrow \text{number of data sets w/o $(\vec{x}_i, y_i)$} \epsilon_\mathrm{OOB}=\sum_{(\vec{x}_1, y_1) \in D_j}\frac{1}{z_i}\sum_{\substack{D_l\\(\vec{x}_i, y_i) \notin D_l}}l(h_{D_l}(\vec{x_i}),y_i) \end{equation*}