## 18: Bagging

Also known as Bootstrap Aggregating (Breiman 96). Bagging is an ensemble method.

### Bagging Reduces Variance

Remember the Bias / Variance decomposition: \begin{equation*} \underbrace{\mathbb{E}[(h_D(x) - y)^2]}_\mathrm{Error} = \underbrace{\mathbb{E}[(h_D(x)-\bar{h}(x))^2]}_\mathrm{Variance} + \underbrace{\mathbb{E}[(\bar{h}(x)-\bar{y}(x))^2]}_\mathrm{Bias} + \underbrace{\mathbb{E}[(\bar{y}(x)-y(x))^2]}_\mathrm{Noise} \end{equation*} Our goal is to reduce the variance term: $\mathbb{E}[(h_D(x)-\bar{h}(x))^2]$.
For this, we want $h_D \to \bar{h}$.

#### Weak law of large numbers

The weak law of large numbers says (roughly) for i.i.d. random variables $x_i$ with mean $\bar{x}$, we have, $\frac{1}{m}\sum_{i = 1}^{m}x_i \rightarrow \bar{x} \textrm{ as } m\rightarrow \infty$
Apply this to classifiers: Assume we have m training sets $D_1, D_2, ..., D_n$ drawn from $P^n$. Train a classifier on each one and average result: $$\hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} \to \bar{h} \qquad as\ m \to \infty$$ We refer to such an average of multiple classifiers as an ensemble of classifiers.
Good news: If $\hat{h}\rightarrow \bar{h}$ the variance component of the error must also vanish, i.e. $\mathbb{E}[(\hat{h}(x)-\bar{h}(x))^2]\rightarrow 0$
Problem:We don't have $m$ data sets $D_1, ...., D_m$, we only have D.

#### Solution: Bagging (Bootstrap Aggregating)

Simulate drawing from P by drawing uniformly with replacement from the set D.
i.e. let $Q(X,Y|D)$ be a probability distribution that picks a training sample $(\mathbf{x}_i,y_i)$ from $D$ uniformly at random. More formally, $Q((\mathbf{x_i}, y_i)|D) = \frac{1}{n} \qquad\forall (\mathbf{x_i}, y_i)\in D$ with $n=|D|$.
We sample the set $D_i\sim Q^n$, i.e. $|D_i| =n$, and $D_i$ is picked with replacement from $Q|D$.

Q: What is $\mathbb{E}[|D\cap D_i|]$?
Bagged classifier: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}$
Notice: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h}$(cannot use W.L.L.N here, W.L.L.N only works for i.i.d. samples). However, in practice bagging still reduces variance very effectively.
##### Analysis
Although we cannot prove that the new samples are i.i.d., we can show that they are drawn from the original distribution $P$. Assume P is discrete, with $P(X=x_i) = p_i$ over some set $\Omega = {x_1, ... x_N}$ (N very large) (let's ignore the label for now for simplicity)
\begin{equation*} \begin{aligned} Q(X=x_i)&= \underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}}_{\substack{\text{Probability that are}\\\text{k copies of $x_i$ in D}}} \underbrace{\frac{k}{n}}_\mathrm{\substack{\text{Probability}\\\text{pick one of}\\\text{these copies}}}\\ &=\frac{1}{n}\underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}k}_{\substack{\text{Expected value of}\\\text{Binomial Distribution}\\\text{with parameter $p_i$}\\\mathbb{E}[\mathbb{B}(p_i,n)]=np_i}}\\ &=\frac{1}{n}np_i\\ &=p_i\leftarrow\underline{TATAAA}\text{!! Each data set $D'_l$ is drawn from P, but not independently.} \end{aligned} \end{equation*}

There is a simple intuitive argument why $Q(X=x_i)=P(X=x_i)$. So far we assumed that you draw $D$ from $P^n$ and then $Q$ picks a sample from $D$. However, you don't have to do it in that order. You can also view sampling from $Q$ in reverse order: Consider that you first use $Q$ to reserve a "spot" in $D$, i.e. a number from 1,...,n, where i means that you sampled the $i^{th}$ data point in $D$. So far you only have the slot, $i$, and you still need to fill it with a data point $(x_i,y_i)$. You do this by sampling $(x_i,y_i)$ from $P$. It is now obvious that which slot you picked doesn't really matter, so we have $Q(X=x)=P(X=x)$.

##### Bagging summarized
1. Sample $m$ data sets $D_1,\dots,D_m$ from $D$ with replacement.
2. For each $D_j$ train a classifier $h_j()$
3. The final classifier is $h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})$.
In practice larger $m$ results in a better ensemble, however at some point you will obtain diminishing returns. Note that setting $m$ unnecessarily high will only slow down your classifier but will not increase the error of your classifier.