CS 678

COM-S 678 Spring 2007

What:   Advanced Topics in Empirical Machine Learning
When: MWF 1:25pm-2:15pm
Where: Upson 111
Who:    Rich Caruana
Why:    Time to write a textbook on empirical machine learning

Data Sets Spreadsheet

add a few more potential data sets
fill in missing fields
preliminary sorting or classification by preference

Lectures:

Jan 22: Administrivia and Introduction (Caruana)
Jan 24: Empirical Comparison of Learning Methods (Caruana) (slides)
Jan 26: Caruana & Niculescu-Mizil: Empirical Comparison of Learning Methods (Caruana)
Jan 29: Niculescu-Mizil & Caruana: Predicting Good Probabilities with Supervised Learning (Caruana)
Jan 31: Platt: Probabilistic Outputs for SVMs and Comparison to Regularized Likelihood Methods (Nikos Karampatziakis) (slides)
Feb 02: Data Sets and Learning Methods for High-D Empirical Study
Feb 05: Drish: Obtaining Calibrated Probability Estimates from SVMs (Amit Belani) (slides)
Feb 07: Zadrozny & Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates (Myle Ott) (slides)
Feb 09: Data Sets and Learning Methods for High-D Empirical Study
Feb 12: Provost & Fawcett: Analaysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions (Ramazan Bitirgen) (slides)
Feb 14: classes canceled due to snow
Feb 16: Fawcett & Niculescu-Mizil: Technical Note: PAV and the ROC Convex Hull (Lars Backstrom)
Feb 19: Empirical Comparison of Learning Methods (Caruana) (same slides as above)
Feb 21: Empirical Comparison of Learning Methods (Caruana) (same slides as above)
Feb 23: Class Project
Feb 26: Margineantu, D. D. and Dietterich, T. G. (2002): Improved class probability estimates from decision tree models (Michael Friedman)
Feb 28: Dietterich, T. G., (1998): Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924. Postscript preprint. (Revised December 30, 1997).
Mar 02: Class Project
Mar 05: Model Compression (Caruana)
Mar 07: Data Mining in Metric Space (Caruana)
Mar 09: Statlog
Mar 12: Statlog
Mar 14: Lowds & Domingo: Naive Bayes Probability Estimation (Peter Majek)
Mar 16: Results of Class Project
Mar 19: Spring Break
Mar 21: Spring Break
Mar 23: Spring Break
Mar 26: George Forman: An Extensive Empirical Study of Feature Selection Metrics for Text Classification JMLR 3(Mar):1415-1438, 2003 (Artit)
Mar 28: Saul & Roweis: An Introduction to Locally Linear Embedding (Sergei Fotin)
Mar 30: PCA Tutorial: http://www.dgp.toronto.edu/~aranjan/tuts/pca.pdf (Ainura)
Apr 02: Tatti: Distances between Data Sets Based on Summary Statistics (Nam Nguyen)
Apr 04: Kari Torkkola: Feature Extraction by Non-Parametric Mutual Information Maximization JMLR 3(Mar):1415-1438, 2003 (Chun-Nam)
Apr 06: Zhou, Foster, Stine, Ungar: Streaming Feature Selection Using Alpha-Investing KDD 2005 (Amit Belani)
Apr 09: Tishby, Preira, Bialek: The Information Bottleneck Method, Conference on Communication, Control, and Computing 1999 (Fan Yanga)
Apr 11: Friedman, Hastie, Tibshirani: Additive Logistic Regression: A Statistical View of Boosting Annals of Statistics (2000) www.cse.psu.edu/~zha/CSE598/paper1.pdf (Daria Sorokina) NOTE: It's a long paper, and youa re reading it on short notice, so read the intro, skim the paper, and be sure to take a look at the interesting discussion at the end. Also might want to look at the text The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman chapters 10.1-10.6 and 10.9-10.13.
Apr 13: project discussion
Apr 16: project discussion
Apr 18: project discussion
Apr 20: Hyvarinen & Oja: Independent Component Analysis: Algorithms and Applications sections 1,2,3,4.1,4.2 (skim 4.2.1), 4.3,71 (Art Munson)
Apr 23: Galbraith & van Norden: The Resolution and Calibration of Probabililistic Economic Forecasts (Myle Ott)
Apr 25: 5-minute project summaries
Apr 27: Breiman: Prediction Games and arcing algorithms and Reyzin & Schapire: How boosting the margin can also boost classifier complexity (Nikos Karampatziakis)
Apr 30:
May 02:

Slides & Handouts:

Survey of Empirical Methods (empirical.caruana.678.07.pdf)

Prerequisites: 478, or 578, or equivalent prior experience in machine learning.

Topics include:

Statistical Testing
Bootstrap, Jacknife, ...
Previous Significant Empirical Comparisons: Statlog, ...
Performance Measures
e.g. does training SVM to optimize ordering really work?
Concept Drift
Explanation
Space vs. Time vs. Accuracy Tradeoffs
Predicting Probabilities, Calibration, ...
Beyond Binary Classification: multiclass SVMs, ...
Discriminative vs. Generative Models and Training
Decomposition of Squared Error into Bias, Calibration, and Refinement
Model Selection: AIC, BIC, cross validation, ...
Boosting and Additive Models
Curse of Dimensionality
Dimensionality Reduction (e.g., Information Bottleneck, PCA, ICA, ...)
???

Data Sets Desiderata:

high dimension: > 500 dimensions
large enough: 10k or more samples preferable
not too skewed (but could resample)
suitable for binary classification (may need to be transformed)
not all bags of words! images? ...?
local expertise?
different attribute types for different data sets
missing values?
readily available
used in previous data mining competitions so we have results to compare to?
used in papers so we have results to compare to?
???