Also known as Bootstrap Aggregating (Breiman 96). Bagging is an

For this, we want $h_D \to \bar{h}$.

Apply this to classifiers: Assume we have m training sets $D_1, D_2, ..., D_n$ drawn from $P^n$. Train a classifier on each one and average result: $$\hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} \to \bar{h} \qquad as\ m \to \infty$$ We refer to such an average of multiple classifiers as an

i.e. let $Q((\mathbf{x_i}, y_i)) = \frac{1}{n} \qquad\forall (\mathbf{x_i}, y_i)\in D$

Draw $D_i~Q^n$, i.e. $|D_i| =n$, and $D_i$ is picked

Bagged classifier: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}$

Notice: $\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h}$(cannot use W.L.L.N here, W.L.L.N only works for i.i.d). However, in practice bagging still reduces variance very effectively.

\begin{equation*} \begin{aligned} Q(X=x_i)&= \underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}}_{\substack{\text{Probability that are}\\\text{k copies of $x_i$ in D}}} \underbrace{\frac{k}{n}}_\mathrm{\substack{\text{Probability}\\\text{pick one of}\\\text{these copies}}}\\ &=\frac{1}{n}\underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}k}_{\substack{\text{Expected value of}\\\text{Binomial Distribution}\\\text{with parameter $p_i$}\\\mathbb{E}[\mathbb{B}(p_i,n)]=np_i}}\\ &=\frac{1}{n}np_i\\ &=p_i\leftarrow\underline{TATAAA}\text{!! Each data set $D'_l$ is drawn from P, but not independently.} \end{aligned} \end{equation*}

- Sample $m$ data sets $D_1,\dots,D_m$ from $D$ with replacement.
- For each $D_j$ train a classifier $h_j()$
- The final classifier is $h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})$.

- Easy to implement
- Works well with many (high variance) classifiers
- Provides an
__unbiased__estimate of the test error, which we refer to as the*out-of-bag error*. For each training point $(\mathbf{x}_i,y_i)\in D$ compute the out-of-bag error as the average error obtained on this training point from all the classifiers that were trained**without**it.

For $(\mathbf{x}_i, y_i) \in D$, let \begin{align} z_i&=\sum_{\substack{D_j\\(\mathbf{x}_i, y_i) \notin D_j}}\mathbf{1} \leftarrow \text{number of data sets w/o $(\mathbf{x}_i, y_i)$}\\ \epsilon_\mathrm{OOB}&=\sum_{(\mathbf{x}_1, y_1) \in D_j}\frac{1}{z_i}\sum_{\substack{D_l\\(\mathbf{x}_i, y_i) \notin D_l}}l(h_{D_l}(\mathbf{x_i}),y_i) \end{align} - From the predictions of the individual classifiers in the bag we can estimate the variance of our ensemble. This can be very helpful to estimate the uncertainty with which the classifier makes a prediction. For example, if each one of the $m$ classifiers agrees on the label the ensemble is very certain. On the other hand, is only $51\%$ agree on the label but the other $49\%$ disagree, the classifier would be very uncertain.

- Sample $m$ data sets $D_1,\dots,D_m$ from $D$ with replacement.
- For each $D_j$ train a full decision tree $h_j()$ (max-depth=$\infty$) with one small modification: before each split randomly subsample $k\leq d$ features (without replacement) and only consider these for your split. (This further increases the variance of the trees.)
- The final classifier is $h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})$.

The Random Forest is one of the best, most popular and easiest to use out-of-the-box classifier. There are two reasons for this:

- The RF only has two hyper-parameters, $m$ and $k$. It is extremely
*insensitive*to both of these. A good choice for $k$ is $k=\sqrt{d}$ (where $d$ denotes the number of features). You can set $m$ as large as you can afford. - Decision trees do not require a lot of preprocessing. For example, the features can be of different scale, magnitude, or slope. This can be highly advantageous in scenarios with heterogeneous data, for example the medical settings where features could be things like
*blood pressure*,*age*,*gender*, ..., each of which is recorded in completely different units.