Spring 2022

For this, we want \(h_D \to \bar{h}\).

Apply this to classifiers: Assume we have m training sets \(D_1, D_2, ..., D_n\) drawn from \(P^n\). Train a classifier on each one and average result: $$\hat{h} = \frac{1}{m}\sum_{i = 1}^m h_{D_i} \to \bar{h} \qquad as\ m \to \infty$$ We refer to such an average of multiple classifiers as an

Simulate drawing from P by drawing uniformly with replacement from the set D.

i.e. let \(Q(X,Y|D)\) be a probability distribution that picks a training sample \((\mathbf{x}_i,y_i)\) from \(D\) uniformly at random. More formally, \(Q((\mathbf{x_i}, y_i)|D) = \frac{1}{n} \qquad\forall (\mathbf{x_i}, y_i)\in D\) with \(n=|D|\).

We sample the set \(D_i\sim Q^n\), i.e. \(|D_i| =n\), and \(D_i\) is picked __with replacement__ from \(Q|D\).

Bagged classifier: \(\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\)

Notice: \(\hat{h}_D = \frac{1}{m}\sum_{i = 1}^{m}h_{D_i}\nrightarrow \bar{h}\)(cannot use W.L.L.N here, W.L.L.N only works for i.i.d. samples). However, in practice bagging still reduces variance very effectively.

\begin{equation*} \begin{aligned} Q(X=x_i)&= \underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}}_{\substack{\text{Probability that are}\\\text{k copies of \(x_i\) in D}}} \underbrace{\frac{k}{n}}_\mathrm{\substack{\text{Probability}\\\text{pick one of}\\\text{these copies}}}\\ &=\frac{1}{n}\underbrace{{\sum_{k = 1}^{n}}{n\choose k}p_i^k(1-p_i)^{n-k}k}_{\substack{\text{Expected value of}\\\text{Binomial Distribution}\\\text{with parameter \(p_i\)}\\\mathbb{E}[\mathbb{B}(p_i,n)]=np_i}}\\ &=\frac{1}{n}np_i\\ &=p_i\leftarrow\underline{TATAAA}\text{!! Each data set \(D'_l\) is drawn from P, but not independently.} \end{aligned} \end{equation*}

There is a simple intuitive argument why \(Q(X=x_i)=P(X=x_i)\). So far we assumed that you draw \(D\) from \(P^n\) and then \(Q\) picks a sample from \(D\). However, you don't have to do it in that order. You can also view sampling from \(Q\) in reverse order: Consider that you first use \(Q\) to reserve a "spot" in \(D\), i.e. a number from 1,...,n, where i means that you sampled the \(i^{th}\) data point in \(D\). So far you only have the slot, \(i\), and you still need to fill it with a data point \((x_i,y_i)\). You do this by sampling \((x_i,y_i)\) from \(P\). It is now obvious that which slot you picked doesn't really matter, so we have \(Q(X=x)=P(X=x)\).

- Sample \(m\) data sets \(D_1,\dots,D_m\) from \(D\) with replacement.
- For each \(D_j\) train a classifier \(h_j()\)
- The final classifier is \(h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})\).

- Easy to implement
- Reduces variance, so has a strong beneficial effect on high variance classifiers.
- As the prediction is an average of many classifiers, you obtain a mean score
*and variance*. Latter can be interpreted as the uncertainty of the prediction. Especially in regression tasks, such uncertainties are otherwise hard to obtain. For example, imagine the prediction of a house price is \$300,000. If a buyer wants to decide how much to offer, it would be very valuable to know if this prediction has standard deviation +-\$10,000 or +-\$50,000. Bagging provides an

__unbiased__estimate of the test error, which we refer to as the*out-of-bag error*. The idea is that each training point was not picked and all the data sets \(D_k\). If we average the classifiers \(h_k\) of all such data sets, we obtain a classifier (with a slightly smaller \(m\)) that was not trained on \((\mathbf{x}_i,y_i)\) ever and it is therefore equivalent to a test sample. If we compute the error of all these classifiers, we obtain an estimate of the true test error. The beauty is that we can do this without reducing the training set. We just run bagging as it is intended and obtain this so called out-of-bag error for free.More formally, for each training point \((\mathbf{x}_i,y_i)\in D\) let \(S_i=\{k| (\mathbf{x}_i,y_i)\notin D_k\}\) - in other words \(S_i\) is a set of all the training sets \(D_k\), which do not contain \((\mathbf{x}_k,y_k)\). Let the averaged classifier over all these data sets be $$ \tilde h_i(\mathbf{x})=\frac{1}{|S_i|}\sum_{k\in S_i}h_k(\mathbf{x}). $$ The-of-bag error becomes simply the average error/loss that all these classifiers yield $$ \epsilon_\mathrm{OOB}=\frac{1}{n}\sum_{(\mathbf{x}_i, y_i) \in D}l(\tilde h_i(\mathbf{x_i}),y_i). $$ This is an estimate of the test error, because for each training point we used the subset of classifiers that never saw that training point during training. if \(m\) is sufficiently large, the fact that we take out some classifiers has no significant effect and the estimate is pretty reliable.

The algorithm works as follows:

- Sample \(m\) data sets \(D_1,\dots,D_m\) from \(D\) with replacement.
- For each \(D_j\) train a full decision tree \(h_j()\) (max-depth=\(\infty\)) with one small modification: before each split randomly subsample \(k\leq d\) features (without replacement) and only consider these for your split. (This further increases the variance of the trees.)
- The final classifier is \(h(\mathbf{x})=\frac{1}{m}\sum_{j=1}^m h_j(\mathbf{x})\).

The Random Forest is one of the best, most popular and easiest to use out-of-the-box classifier. There are two reasons for this:

- The RF only has two hyper-parameters, \(m\) and \(k\). It is extremely
*insensitive*to both of these. A good choice for \(k\) is \(k=\sqrt{d}\) (where \(d\) denotes the number of features). You can set \(m\) as large as you can afford. - Decision trees do not require a lot of preprocessing. For example, the features can be of different scale, magnitude, or slope. This can be highly advantageous in scenarios with heterogeneous data, for example the medical settings where features could be things like
*blood pressure*,*age*,*gender*, ..., each of which is recorded in completely different units.

Useful variants of Random Forests:

- Split each training set into two partitions \(D_l=D_l^A\cup D_l^B\), where \(D_l^A\cap D_l^B=\emptyset\). Build the tree on \(D_l^A\) and estimate the leaf labels on \(D_l^B\). You must stop splitting if a leaf has only a single point in \(D_l^B\) in it. This has the advantage that each tree and also the RF classifier become consistent.
- Do not grow each tree to its full depth, instead prune based on the leave out samples. This can further improve your bias/variance trade-off.