COM-S 678 Spring 2007

What:   Advanced Topics in Empirical Machine Learning
When:  MWF 1:25pm-2:15pm
Where: Upson 111
Who:    Rich Caruana
Why:    Time to write a textbook on empirical machine learning

Data Sets Spreadsheet

Lectures:

Jan 22: Administrivia and Introduction (Caruana)
Jan 24: Empirical Comparison of Learning Methods (Caruana) (slides)
Jan 26: Caruana & Niculescu-Mizil: Empirical Comparison of Learning Methods (Caruana)
Jan 29: Niculescu-Mizil & Caruana: Predicting Good Probabilities with Supervised Learning (Caruana)
Jan 31: Platt: Probabilistic Outputs for SVMs and Comparison to Regularized Likelihood Methods (Nikos Karampatziakis) (slides)
Feb 02: Data Sets and Learning Methods for High-D Empirical Study
Feb 05: Drish: Obtaining Calibrated Probability Estimates from SVMs (Amit Belani) (slides)
Feb 07: Zadrozny & Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates (Myle Ott) (slides)
Feb 09: Data Sets and Learning Methods for High-D Empirical Study
Feb 12: Provost & Fawcett: Analaysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions (Ramazan Bitirgen) (slides)
Feb 14: classes canceled due to snow
Feb 16: Fawcett & Niculescu-Mizil: Technical Note: PAV and the ROC Convex Hull (Lars Backstrom)
Feb 19: Empirical Comparison of Learning Methods (Caruana) (same slides as above)
Feb 21: Empirical Comparison of Learning Methods (Caruana) (same slides as above)
Feb 23: Class Project
Feb 26: Margineantu, D. D. and Dietterich, T. G. (2002):  Improved class probability estimates from decision tree models (Michael Friedman)
Feb 28: Dietterich, T. G., (1998): Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924. Postscript preprint. (Revised December 30, 1997).
Mar 02: Class Project
Mar 05: Model Compression (Caruana)
Mar 07: Data Mining in Metric Space (Caruana)
Mar 09: Statlog
Mar 12: Statlog
Mar 14: Lowds & Domingo: Naive Bayes Probability Estimation (Peter Majek)
Mar 16: Results of Class Project
Mar 19: Spring Break
Mar 21: Spring Break
Mar 23: Spring Break
Mar 26: George Forman: An Extensive Empirical Study of Feature Selection Metrics for Text Classification JMLR 3(Mar):1415-1438, 2003   (Artit)
Mar 28: Saul & Roweis: An Introduction to Locally Linear Embedding (Sergei Fotin)
Mar 30: PCA Tutorial: http://www.dgp.toronto.edu/~aranjan/tuts/pca.pdf  (Ainura)
Apr 02: Tatti: Distances between Data Sets Based on Summary Statistics (Nam Nguyen)
Apr 04: Kari Torkkola: Feature Extraction by Non-Parametric Mutual Information Maximization JMLR 3(Mar):1415-1438, 2003  (Chun-Nam)
Apr 06: Zhou, Foster, Stine, Ungar: Streaming Feature Selection Using Alpha-Investing  KDD 2005  (Amit Belani)
Apr 09: Tishby, Preira, Bialek: The Information Bottleneck Method,  Conference on Communication, Control, and Computing 1999  (Fan Yanga)
Apr 11: Friedman, Hastie, Tibshirani: Additive Logistic Regression: A Statistical View of Boosting Annals of Statistics (2000) www.cse.psu.edu/~zha/CSE598/paper1.pdf  (Daria Sorokina)  NOTE: It's a long paper, and youa re reading it on short notice, so read the intro, skim the paper, and be sure to take a look at the interesting discussion at the end.  Also might want to look at the text The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman chapters 10.1-10.6 and 10.9-10.13.
Apr 13: project discussion
Apr 16: project discussion
Apr 18: project discussion
Apr 20: Hyvarinen & Oja: Independent Component Analysis: Algorithms and Applications sections 1,2,3,4.1,4.2 (skim 4.2.1), 4.3,71 (Art Munson)
Apr 23: Galbraith & van Norden: The Resolution and Calibration of Probabililistic Economic Forecasts (Myle Ott)
Apr 25: 5-minute project summaries
Apr 27: Breiman: Prediction Games and arcing algorithms  and Reyzin & Schapire: How boosting the margin can also boost classifier complexity (Nikos Karampatziakis)
Apr 30:
May 02:

Slides & Handouts:

Survey of Empirical Methods (empirical.caruana.678.07.pdf)

Prerequisites: 478, or 578, or equivalent prior experience in machine learning.

Topics include:

Data Sets Desiderata: