Lecture 12: Bias-Variance Tradeoff

As usual, we have a dataset D={(x1,y1),,(xn,yn)}. Each data point and its corresponding label are drawn from an unknown data distribution, i.e., (xi,yi)Pr(X,Y). Because xiRd and yR, this problem is a regression. It is important to note that each (xi,yi) pair is independent and identically distributed i.e. each draw of (xi,yi) is independent of the previous or the next draws and is made from an identical distribution.

Expected Label (given xRd): ˉy(x)=Ey|x[Y]=yyPr(y|x)y.
If we draw n points from our training distribution, i.e., DPn, we can use a Machine Learning algorithm to learn a classifier (a.k.a. hypothesis). Formally, this means hD=A(D). Key points to interpret this phrase:

Expected Test Error (given hD):
E(x,y)P[(hD(x)y)2]=xy(hD(x)y)2Pr(x,y)yx.
Note that we can use other loss functions. We use squared loss because it has nice mathematical properties, and it is also the most common loss function.

Expected Classifier (given A): ˉh=EDPn[hD]=DhDPr(D)D
where Pr(D) is the probability of drawing D from Pn. This is a good reminder that D is a random variable, and hD is also a random variable.

Expected Test Error (given A): E(x,y)PDPn[(hD(x)y)2]=Dxy(hD(x)y)2P(x,y)P(D)xyD
To be clear, D is our training points and the (x,y) pairs are the test points.

Decomposition of Expected Test Error

Ex,y,D[(hD(x)y)2]=Ex,y,D[[(hD(x)ˉh(x))+(ˉh(x)y)]2]=Ex,D[(ˉhD(x)ˉh(x))2]+2Ex,y,D[(hD(x)ˉh(x))(ˉh(x)y)]+Ex,y[(ˉh(x)y)2]
The middle term of the above equation is 0 as we show below Ex,y,D[(hD(x)ˉh(x))(ˉh(x)y)]=Ex,y[ED[hD(x)ˉh(x)](ˉh(x)y)]=Ex,y[(ED[hD(x)]ˉh(x))(ˉh(x)y)]=Ex,y[(ˉh(x)ˉh(x))(ˉh(x)y)]=Ex,y[0]=0
Returning to the earlier expression, we're left with the variance and another term Ex,y,D[(hD(x)y)2]=Ex,D[(hD(x)ˉh(x))2]Variance+Ex,y[(ˉh(x)y)2]
We can break down the second term in the above equation as follows: Ex,y[(ˉh(x)y)2]=Ex,y[(ˉy(x)y)2]Noise+Ex[(ˉh(x)ˉy(x))2]Bias2+2Ex,y[(ˉh(x)ˉy(x))(ˉy(x)y)]
The third term in the equation above is 0, as we show below Ex,y[(ˉh(x)ˉy(x))(ˉy(x)y)]=Ex[Eyx[ˉy(x)y](ˉh(x)ˉy(x))]=Ex[Eyx[ˉy(x)y](ˉh(x)ˉy(x))]=Ex[(ˉy(x)Eyx[y])(ˉh(x)ˉy(x))]=Ex[(ˉy(x)ˉy(x))(ˉh(x)ˉy(x))]=Ex[0]=0
This gives us the decomposition of expected test error as follows Ex,y,D[(hD(x)y)2]ExpectedTestError=Ex,D[(hD(x)ˉh(x))2]Variance+Ex,y[(ˉy(x)y)2]Noise+Ex[(ˉh(x)ˉy(x))2]Bias2
Variance: Captures how much your classifier changes if you train on a different training set. How "over-specialized" is your classifier to a particular training set (overfitting)? If we have the best possible model for our training data, how far off are we from the average classifier?

Bias: What is the inherent error that you obtain from your classifier even with infinite training data? This is due to your classifier being "biased" to a particular kind of solution (e.g. linear classifier). In other words, bias is inherent to your model.

Noise: How big is the data-intrinsic noise? This error measures ambiguity due to your data distribution and feature representation. You can never beat this, it is an aspect of the data.

Fig 1: Graphical illustration of bias and variance.
Source: http://scott.fortmann-roe.com/docs/BiasVariance.html

Fig 2: The variation of Bias and Variance with the model complexity. This is similar to the concept of overfitting and underfitting. More complex models overfit while the simplest models underfit.
Source: http://scott.fortmann-roe.com/docs/BiasVariance.html