Experimental Design

(How to use your data without cheating)

In experimental machine learning we need data to train (learn) models, and to test how good the models are. The training data (call it set A) needs to be different (disjoint) from the test data (call it set B). Otherwise we would be testing the learned model on data it had previously seen, and we would get a biased estimate of the model's generalized performance.

Most machine learning algorithms require choosing parameter values; very often this is done by setting aside some of the training data to evaluate the quality of different parameter settings. All together, we usually end up splitting our data into 3 disjoint sets:

A1: training set = used for learning model
A2: validation set = used for exploring and picking best parameters
B: test set = used for testing model generalization perf, testing hypotheses

The key thing to remember when planning experiments is that the test data is used to form conclusions but not to make decisions during model building. Basing decisions on test data results is frequently called "cheating" in the machine learning community, and often results in wrong conclusions.

[Rich Caruana succinctly summarizes the above as Machine Learning Law#1: "Because performance on data used for training often looks optimistically good, you should NEVER use test data for any part of learning."]

The distinction between forming conclusions and making training decisions can be subtle and confusing. Hopefully the following examples will clarify this distinction.

EXAMPLE 1:

Suppose we want to compare the performance of multiple algorithms for a given problem (e.g. decision trees, neural nets, and k-nearest neighbor). The question we want answered is which is the best learning method. Naturally, we should learn the *best* instantiation of each algorithm that we can and compare those. For example, we want to compare the best decision tree against the best neural net against the best k-nearest neighbor model. Each of the algorithms has one or more parameters we can play with to try and learn a better model. (Sample parameters might include splitting criteria and loss ratio (decision trees); learning rate and early stopping point (neural nets); k and distance metric (k-NN).) Before we can compare the best models we have to make a decision about the parameter setting configuration that produces each of those best models. Because we want to make our final comparison evaluation using the test data, the parameter setting decision should be made based on validation set performance. We should not use the training set for this because we'll get biased performance estimates (the models have seen this data before). Using the test set would be worse, because it would invalidate any conclusion we might form from the comparison.

Once we have the best decision tree, best neural net, and best k-NN (all determined by their performance on validation data), we can compare their performance on the test data and form a conclusion about which is better for the given problem.

EXAMPLE 2:

Suppose we have developed a new machine learning algorithm that we will call Shotgun. Since Shotgun is new we are curious how it behaves under different values for its parameter P. We take a data set and divide it into training, validation, and test. We use the training data to learn a Shotgun model; the validation data is used for choosing all the parameter values except for P; the test data is used to measure Shotgun's performance when P=0, P=0.1, P=0.2, . . . , P=1. We can now analyse the results on the test data and draw conclusions about how P affects Shotgun's performance.

Continuing on, suppose that the best P value is 0.9. If we now want to compare Shotgun to some other algorithm, say decision trees, we would want to set P=0.9 so that we are comparing best vs best. However, if we want to continue using the same problem in this comparison, we need to find some new test data (call it B2). Because we have made a training decision to set P=0.9 based on the test data results, we can no longer use that data to do a fair comparison between decision trees and Shotgun. To be fair, we would need to choose a P value based on validation set performance.

On the other hand, we can set P=0.9 if we are going to use different data for the comparison. Perhaps we have some additional data set aside for this problem (a final-final test set, B2). Or we can do the comparison using a different problem altogether.

To summarize, a conclusion from a previous experiment can only be used to make training decisions when the test data for the new experiment is different from the previous experiment's test data.

History:
2006-10-13 Initial draft by Art Munson