Advanced Topics in Machine Learning

CS678 - Spring 2003
Cornell University
Department of Computer Science

 
Time and Place
First lecture: January 21st, 2003
Last lecture: May 1st, 2003
  • Tuesday, 1:25pm - 2:40pm in Hollister Hall 314
  • Thursday, 1:25pm - 2:40pm in Hollister Hall 314

Exam: April 1st, 2003 (in class)

Instructor
Thorsten Joachims, tj@cs.cornell.edu, 4153 Upson Hall, office hour Wednesday 3:30-4:15
Syllabus
This 4 credit course extends and complements CS478 and CS578. The goal of the course is two-fold. The first part of the course is an in-depth introduction to advanced learning algorithms in the area of Kernel Machines, in particular Support Vector Machines and other margin-based learning methods like Boosting. It also includes an introduction to the relevant aspects of machine learning theory, enabling you to understand the current work in the field. This will provide the basis for the second part of the course, which will discuss current research topics in machine learning, providing starting points for novel research. In particular, the course will cover the following main topics:

Part 1:

  • Support Vector Machines and Related Methods: Perceptron, optimal hyperplane and maximum-margin separation, soft-margin, SVMs for regression, Gaussian Processes, Boosting, regularized regression methods
  • Learning with Kernels: properties, real-valued feature vectors, sequences and other structured data, Fisher kernels
  • Statistical Learning Theory: no free lunch, VC theory, PAC-Bayesian, bias/variance, error bounds, leave-one-out bounds 
  • Error Estimation and Model Selection: leave-one-out and cross-validation, holdout testing, bootstrap estimation 
Part 2:
  • Transductive Learning: How can one use unlabeled data to improve performance in supervised learning? What is the information contained in unlabeled data? What assumptions do we need to make? How can we design efficient algorithms?
  • Learning Complex Structures: What if the target function is more complex than in classification or regression? For example, the goal could be not a binary classification function, but an ordering (i.e. retrieval) function for information retrieval. Or, what if the input to the learning is not a classification, but merely pair-wise preferences like "A is preferred over B"?
  • Learning Kernels: The kernel defines the inductive bias of the learning algorithm and is key to achieving good performance. This makes selecting a kernel one of the most crucial design decisions. How can we automate the selection process? In particular, how can one construct a good kernel from data? What are the situations where this might work? What are the assumptions?

Methods and theory will be illustrated with practical examples, in particular from the areas of information retrieval and language technology.

Lecture Notes, Slides, and Handouts
Lecture notes and slides are also handed out in class:
Homework Assignments
Homework 1: Perceptron and Optimal Hyperplanes

Homework 2: Training SVMs and Leave-One-Out

Homework 3: Kernels and Statistical Learning Theory

Reference Material
We will provide reading material and hand it out in class. It will cover all material presented in this course. For further reading, we recommended the following books that each cover part of the syllabus:
  • Mitchell, "Machine Learning"
  • Devroye, Gyoerfi, Lugosi, "A Probabilistic Theory of Pattern Recognition"
  • Shawe-Taylor, Cristianini, "Introduction to Support Vector Machines"
  • Schoelkopf, Smola, "Learning with Kernels"
  • Herbrich, "Learning Kernel Classifiers"
  • Hastie, Tibshirani, Friedman, "The Elements of Statistical Learning"
  • Duda, Hart, Stork, "Pattern Classification"
  • Vapnik, "Statistical Learning Theory"
Readings
We will read some of the following papers in the second half of the course:

Learning Rankings:

  1. William W. Cohen, Robert E. Schapire, Yoram Singer, Learning to order things, Journal of Artificial Intelligence Research, 10, 1999. (Steven, 4/15)
  2. Y. Freund, R. Iyer, R. Schapire, and Y. Singer, An efficient boosting algorithm for combining preferences, ICML, 1998. (Scott, 4/17)
  3. T. Joachims, Optimzing Search Engines using Clickthrough Data, KDD Conference, 2002. (Thorsten, 4/10)
  4. R. Herbrich, T. Graepel, and K. Obermayer. Large Margin Rank Boundaries for Ordinal Regression. Advances in Large Margin Classifiers , pages 115-132, 2000. (Thorsten, 4/8)
  5. R. Caruana, S. Baluja, and T. Mitchell, Using the Future to `Sort Out' the Present: Rankprop and Multitask Learning for Medical Risk Evaluation, NIPS, 1995. (Rich, 4/10)

Transductive Learning / Learning from Labeled and Unlabeled Data:

  1. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3). pp. 103-134. 2000. (Mark, 4/22)
  2. T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. ICML, 1999. (Thorsten, 4/17)
  3. A. Blum and T. Mitchell. Combining Labeled and Unlabeled Data with Co-Training, COLT, 1998. (Andy, 4/24)
  4. O. Chapelle, J. Weston and B. Schölkopf, Cluster Kernels for Semi-Supervised Learning. NIPS, 2003. (Phil, 4/22)
  5. M. Szummer and T. Jaakkola, Partially labeled classification with Markov random walks, NIPS, 2001. (Filip, 4/22)
  6. A. Blum, S. Chawla, Learning from Labeled and Unlabeled Data using Graph Mincuts. ICML, 2001. (Alan, 4/24)

Learning to Learn / Learning Kernels:

  1. R. Caruana, Multitask Learning. Machine Learning 28(1): 41-75, 1997. (Rich, 4/29)
  2. Sebastian Thrun and Joseph O'Sullivan, Discovering Structure in Multiple Learning Tasks: The TC Algorithm, ICML, 1996. (Stefan, 4/29)
  3. N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor, On Kernel Target Alignment, JMLR. (Thorsten, 5/1)
  4. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers, NIPS, 1998. (Joshua, 5/1)

Other topics:

  1. B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Technical Report 99-87, Microsoft Research, 1999. To appear in Neural Computation, 2001.
    and  
    Ben-Hur et al., Support Vector Clustering. JMLR, 2, 2001.
  2. A. J. Smola and B. Schölkopf. A tutorial on support vector regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998. To appear in Statistics and Computing, 2001. (pages 1-14 only) (Jingbo, 4/24)
  3. B. Schölkopf, A. Smola, K. Müller, Kernel Principal Component Analysis, in:  B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods --- Support Vector Learning. MIT Press, Cambridge, MA, 1999. 327 -- 352. Short version or chapter in Support Vector Learning  for background. (Liviu, 4/17)
  4. John Platt, Large-Margin DAGs for Multi-Class Classification, NIPS 2000. (Alex, 4/15)
Prerequisites
Any of the following:
  • CS478
  • CS578
  • equivalent of any of the above
  • permission from the instructors
Grading
Grades will be determined based on a written exam after part 1, a final research project, homework assignments, and student presentations of selected papers.
  • 25%: Exam after Part 1
  • 30%: Final Project
  • 25%: Homework: (3 homeworks max, some programming, some non-programming)
  • 10%: Student Paper Presentation
  • 10%: Class Participation

Roughly: A=90-100; B=80-90; C=70-80; D=60-70; F= below 60