An Empirical Characterization of Co-training Stopping Conditions
Steven Baker
Assoc. Prof. Claire Cardie

Co-training is a machine learning method whereby standard supervised learning models are bootstrapped using a small amount of labeled and a large amount of unlabeled training data. Specifically, two classifiers are trained in parallel using the initial labeled data. They then both label unabeled data and a subset of the newly-labeled examples are selected to be added back into the training corpus as new labeled data. This algorithm has been proven to work given some independence assumptions about the two classifiers, and empirical study has shown that co-training can be effective on a variety of learning tasks in natural language processing.

Co-training is thought to be most useful when there is a small amount of labeled data. Unfortunately, the learning curve during co-training is often difficult to predict. For example, in experiments we conducted sometimes there is an improvement in classifier performance until a certain number of iterations, after which the performance decays. Examples of this phenomenon, which is in part dependent on experimental parameters, are shown later. Because of this unpredictability, it has been necessary that some portion of the data be held out to estimate performance over the course of co-training iterations so that the peak performance can be pinpointed. Unfortunately, this held-out set can have to be a substantial size to accurately measure performance, which means that the initial amount of labeled data that one could use in co-training is decreased even further. This means that in practice co-training might not yield better performances than merely training a single classifier using fully supervised methods and all of the data.

This research explores ways around the problem of needing to pinpoint performance maxima and thus requiring a held-out test set. If no held-out test set is needed (or only a very small one) and if co-training boosts performance on a given task, then it has a clear advantage over training a single model with no unlabeled data.