Video II

As usual, we are given a dataset $D = \{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n,y_n)\}$, drawn i.i.d. from some distribution $P(X,Y)$. Throughout this lecture we assume a regression setting, i.e. $y \in \mathbb{R}$. In this lecture we will decompose the generalization error of a classifier into three rather interpretable terms. Before we do that, let us consider that for any given input $\mathbf{x}$ there might not exist a unique label $y$. For example, if your vector $\mathbf{x}$ describes features of house (e.g. #bedrooms, square footage, ...) and the label $y$ its price, you could imagine two houses with identical description selling for different prices. So for any given feature vector $\mathbf{x}$, there is a distribution over possible labels. We therefore define the following, which will come in useful later on:

Alright, so we draw our training set $D$, consisting of $n$ inputs, i.i.d. from the distribution $P$. As a second step we typically call some machine learning algorithm $\mathcal{A}$ on this data set to learn a hypothesis (aka classifier). Formally, we denote this process as $h_D = \mathcal{A}(D)$.

For a given $h_D$, learned on data set $D$ with algorithm $\mathcal{A}$, we can compute the generalization error (as measured in squared loss) as follows:

\[ E_{(\mathbf{x},y) \sim P} \left[ \left(h_D (\mathbf{x}) - y \right)^2 \right] = \int\limits_x \! \! \int\limits_y \left( h_D(\mathbf{x}) - y\right)^2 \Pr(\mathbf{x},y) \partial y \partial \mathbf{x}. \] Note that one can use other loss functions. We use squared loss because it has nice mathematical properties, and it is also the most common loss function.

The previous statement is true for a given training set $D$. However, remember that $D$ itself is drawn from $P^n$, and is therefore a random variable. Further, $h_D$ is a function of $D$, and is therefore also a random variable. And we can of course compute its expectation:

We can also use the fact that $h_D$ is a random variable to compute the expected test error only given $\mathcal{A}$, taking the expectation also over $D$.

We are interested in exactly this expression, because it evaluates the quality of a machine learning algorithm $\mathcal{A}$ with respect to a data distribution $P(X,Y)$. In the following we will show that this expression decomposes into three meaningful terms.

Fig 1: Graphical illustration of bias and variance.

Source: http://scott.fortmann-roe.com/docs/BiasVariance.html

Fig 2: The variation of Bias and Variance with the model complexity. This is similar to the concept of overfitting and underfitting. More complex models overfit while the simplest models underfit.

Source: http://scott.fortmann-roe.com/docs/BiasVariance.html

The graph above plots the training error and the test error and can be divided into two overarching regimes. In the first regime (on the left side of the graph), training error is below the desired error threshold (denoted by $\epsilon$), but test error is significantly higher. In the second regime (on the right side of the graph), test error is remarkably close to training error, but both are above the desired tolerance of $\epsilon$.Figure 3: Test and training error as the number of training instances increases.

In the first regime, the cause of the poor performance is high variance.

**Symptoms**:

- Training error is much lower than test error
- Training error is lower than $\epsilon$
- Test error is above $\epsilon$

**Remedies**:

- Add more training data
- Reduce model complexity -- complex models are prone to high variance
- Bagging (will be covered later in the course)

**Symptoms**:

- Training error is higher than $\epsilon$

**Remedies**:

- Use more complex model (e.g. kernelize, use non-linear models)
- Add features
- Boosting (will be covered later in the course)