Topics in Machine Learning
CS778,  Spring 2003Cornell University

Course Description    Handouts    Data Sets     ML Links


  • Covtype data sets now available for download (80meg): cs778.covtype.class1.tar
  • Code for computing performance measures: perf2.c (perf.testdata)
  • Class now meets one day a week in Upson 5130, Fridays 1:30-3:00!
  • Feel free to attend the initial classes even if you may not take the course.
  • We are coordinating with CS674 Natural Language Processing (Cardie) and CS678 Advanced Topics in Machine Learning (Joachims) so that projects in these courses may overlap with the project in CS778.
  • CS778 is listed as 4 credits, but may be taken for fewer credits, or even as a 490 or 790 independent study.  Please see the instructor if you want to adjust the number of credits or take the course as independent research.

New Time: Friday 1:30pm-3:00pm

New Place: Upson 5130

Instructor: Rich Caruana (

Office: Upson 4157

Office Hours: Tue 4:30-5:30, Wed 2:30-3:30

Course Description

In this course we will compare the performance of different machine learning algorithms on a variety of test problems.  The goal is to figure out what learning methods work best on each problem. The learning methods we might use include:

  • Support Vector Machines (SVMs)

  • Artificial Neural Nets (ANNs)

  • Nearest Neighbor Methods (e.g., kNN)

  • Decision Trees (DTs)

  • Splines (e.g., MARS: Multivariate Adaptive Regression Splines)

  • Logistic Regression

  • Rule Learning

We will try to run each algorithm nearly optimally by tuning each algorithm's parameters.  We will measure performance using a variety of performance measures such as accuracy, squared error, precision and recall, and ROC.  We will use sound statistical tests to analyze the results.  In other words, we are going to dry to do the comparisons as thoroughly as we can.

There are many issues that arise when doing this kind of empirical research.  Some of these are:

  • How do we optimize each learning method?

  • What data sets should we use?

  • What do we do with missing values?  Some methods such as decision trees handle missing values easily, but most learning methods don't handle missing values.  How do we do a fair comparison between methods on data sets that have missing values?

  • What performance measures should we use to compare methods?

  • Some of the better performance measures such as ROC are only defined for two-class problems.  Also, some learning methods such as SVMs are best suited to two-class problems.  What do we do with problems that have more than two classes?

  • Some data sets are large enough that we can hold out a large final test set.  Many are not.  How do we structure cross-validation so that we can optimize each method and still do a final comparison between methods?

  • What statistical tests should we use to compare methods?

  • Should we use bagging or boosting?

  • What do we do if a data set is too big or the experiments too costly for some of the learning methods?

  • ???

There will be a half dozen lectures and papers to read to bring us all up to speed, but most of the classes will be more like group meetings than like lectures.  We'll use off-the-shelf code for most learning methods so that we don't have to implement everything from scratch.  

If all goes well, we'll publish a group paper on the results with each of us as co-authors.

This is a 700-level course.  You should not take this course if you do not already have some experience in machine learning (e.g., CS478, CS578, or equivalent) or statistical modeling. Please contact the instructor if you aren't sure that your background is adequate.

Textbooks:  There are no required textbooks.  The following texts might prove useful:

  • Machine Learning by Tom Mitchell
  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by T. Hastie, R. Tibshirani, J. Friedman.
  • Pattern Classification 2nd edition by Richard Duda, Peter Hart, & David Stork

Go to top


Go to top

Data Sets

Go to top

ML Links

Go to top