CS6780 Advanced Machine Learning Course, T. Joachims, Cornell University

CS6780 Advanced Machine Learning Course, T. Joachims, Cornell University

	CS6780 - Advanced Machine Learning Spring 2019 Prof. Thorsten Joachims Cornell University, Department of Computer Science & Department of Information Science

	Time and Place First lecture: January 29, 2019 Last meeting: May 7, 2019 Time: Tuesday/Thursday, 2:55pm - 4:10pm Room: Gates G01 / Bloomberg 91 Exam: April 25 Project Report: May 13
	Course Description This course gives a graduate-level introduction to machine learning and in-depth coverage of new and advanced methods in machine learning, as well as their underlying theory. It emphasizes approaches with practical relevance and discusses a number of recent applications of machine learning in areas like information retrieval, recommender systems, data mining, computer vision, natural language processing and robotics. An open research project is a major part of the course. In particular, the course will cover the following topics: Supervised Batch Learning: model, decision theoretic foundation, model selection, model assessment, empirical risk minimization Decision Trees : TDIDT, attribute selection, pruning and overfitting Statistical Learning Theory : generalization error bounds, VC dimension Large-Margin Methods and Kernels: linear Rules, margin, Perceptron, SVMs, duality, non-linear rules through kernels Deep Learning : multi-layer perceptrons, deep networks, stochastic gradient Probabilistic Models: generative vs. discriminative, maximum likelihood, Bayesian inference Structured Output Prediction : undirected graphical models, structural SVMs, conditional random fields Latent Variable Models: k-means clustering, mixture of Gaussians, expectation-maximization algorithm, matrix factorization, embeddings Online Learning : experts, bandits, online convex optimization Causal Inference : interventional vs. observational data, treatment effect estimation The prerequisites for the class are: programming skills (at the level of CS 2110) and basic knowledge of linear algebra (at the level of MATH 2940) and probability theory (at the level of MATH 4710) and multivariable calculus (at the level of MATH 1920). Enrollment is limited to PhD students.
	Syllabus 01/29: Introduction [slides] [slides 6-up] [video] What is learning? What is machine learning used for? Overview of course, course policies, and contact info. 01/31: Supervised Batch Learning [slides] [slides 6-up] [video] Supervised learning for binary classification, multi-class classification, regression, and stuctured output prediction Instance space, target function, training examples Hypothesis space, consistency, and version space Classifying with a decision tree Representational power of decision trees TDIDT decision tree learning algorithm 02/05: Empirical Risk Minimization [slides] [slides 6-up] [video] Independently identically distributed (iid) data Risk, empirical risk, and Empirical Risk Minimization (ERM) Bayes decision rule and Bayes risk Model selection and overfitting 02/07: Statistical Learning Theory 1: Generalization Error Bounds [slides] [slides 6-up] [video] Model assessment (see slides from last lecture) Generalization error bound for finite H and zero error 02/12: Statistical Learning Theory 2: Generalization Error Bounds [slides from last lecture] [video] Generalization error bound for finite H and non-zero error Probably Approximately Correct (PAC) learning Theoretical characterization of overfitting 02/14: Linear Rules and Perceptron [slides] [slides 6-up] [video] Linear classification rules Batch Perceptron learning algorithm Online Perceptron learning algorithm Margin of linear classifier Perceptron mistake bound 02/19: Optimal Hyperplanes and SVMs [slides] [slides 6-up] [video] Optimal hyperplanes and margins Hard-margin Support Vector Machine Vapnik-Chervonenkis Dimension VC dimension of hyperplanes Symmetrization and error bound 02/21: Soft-Margin and Duality [slides] [slides 6-up] [video] Soft-margin Support Vector Machine Dual Perceptron Primal and dual SVM optimization problem Connection between primal and dual 02/28: Kernels [slides] [slides 6-up] [video] Bounds on the leave-one-out error of an SVM (see slides from last lecture) Input space and feature space Kernels as a method for non-linear learning Kernels for learning with non-vectorial data 03/05: Regularized Linear Models [slides] [slides 6-up] [video] Conditional maximum likelihood Maximum a posteriori estimates Logistic regression Ridge regression 03/07: Deep Network Models [slides] [slides 6-up] [video] Multi-layer perceptrons Relationship to kernels Stochastic gradient descent Convolution and pooling 03/12: Generative Models for Classification [slides] [slides 6-up] [video] Modeling the data generating process Multivariate naive Bayes classifier Linear discriminant analysis Multinomial naive Bayes classifier 03/14: Generative Models for Sequence Prediction [slides] [slides 6-up] [video] Hidden Markov Model (HMM) Estimation of HMM parameters Viterbi algorithm 03/19: Discriminative Training for Structured Output Prediction [slides] [slides 6-up] [video] Structural SVMs Conditional Random Fields Encoder/Decoder Networks 03/21: Online Learning: Expert Setting [slides] [slides 6-up] [video] Expert online learning model Halving algorithm and optimal mistake bound Regret Weighted Majority algorithm 03/26: Online Learning: Bandit Setting [slides] [slides 6-up] [video] Exponentiated Gradient algorithm (see slides from last lecture) Adversarial Bandit model EXP3 algorithm Stochastic Bandit model UCB1 algorithm 03/28: Clustering [slides] [slides 6-up] [video] Unsupervised learning tasks Hierarchical agglomerative clustering k-Means 04/09: Latent Variable Models [slides] [slides 6-up] [video] Mixture of Gausssians Expectation-maximization (EM) algorithm Derivation of EM Latent variable problems beyond mixtures 04/11: Recommender Systems [slides] [slides 6-up] [video] Uses of recommender systems Content vs. collaborative recommendation Low-rank matrix completion Selection biases in preference data 04/16: Counterfactual Evaluation and Learning [slides] [slides 6-up] [video] Contextual-bandit feedback Policy and utility Potential outcomes model On-policy vs. off-policy estimators 04/21: Counterfactual Evaluation and Learning [slides] [slides 6-up] [video] Counterfactual Risk Minimization POEM BanditNet 04/30: Learning to Rank [slides] [slides 6-up] [video] Behavior and ranking metrics LTR with expert labels LTR with implicit feedback Connection to counterfactual evaluation and learning 05/07: Wrap-Up [slides] [slides 6-up] [video] Main themes of class What we did not cover Follow-up courses
	Staff and Office Hours Please use the CS6780 Piazza Forum as the primary channel for questions and discussions. For peer feedback, we are using this CMT Instance for this course. For grades, we are using CMS. Office hours (join via Zoom): Mondays, 10:30am-11:30am: Ashudeep Singh (Rhodes 402) Mondays, 1:30pm-2:30pm: Aman Agarwal (Rhodes 412) Thursdays, 11:30am-12:30pm: Ashudeep Singh (Rhodes 400) Fridays, 11:00am-12:00pm: Thorsten Joachims (Gates 418) Fridays, 2:00pm-3:00pm: Aman Agarwal (Rhodes 402)
	Assignments and Exams Homework assignments can be downloaded from CMS and they are submitted using the mechanism specified in the homework handout. All assignments are due at the beginning of class on the due date. Assignments turned in late will be charged a 1 percentage point reduction of the cumulated final homework grade for each period of 24 hours for which the assignment is late. However, every student has a budget of 5 late days (i.e. 24 hour periods after the time the assignment was due) throughout the semester for which there is no late penalty. So, if you have perfect scores of 100 on all 4 homeworks and a total of 8 late days, your final homework score will be 97. No assignment will be accepted after the solution was made public, which is typically 3-5 days after the time it was due. You can submit late assignments via CMS. Regrade requests can be submitted within 7 days after the grades have been made available using the mechanism specified in the homework handout.
	Grading This is a 4-credit course. Grades will be determined based on a written exam, a research project, homework assignments, and class participation. 40%: Exam 35%: Research Project 20%: Homework (~4 assignments) 5%: Class Participation (e.g., lecture, piazza) To eliminate outlier grades for homeworks, the lowest grade is replaced by the second lowest grade when grades are cumulated at the end of the semester. All assignment, exam, and final grades (including + and - of that grade) are roughly on the following scale: A=92-100; B=82-88; C=72-78; D=60-68; F= below 60. For the research project, we will use peer review analogous to how scientific papers are reviewed for conferences and journals. This means that you will read and comment on other students work, which provides input for the TA's and the professor to determine the project grades. The quality of your reviewing also becomes a component of your own grade. Students taking the class S/U do all work, except for the project, and need to receive at least a grade equivalent to a D to pass the course.
	Reference Material The main textbooks for the class is: Kevin Murphy, "Machine Learning - a Probabilistic Perspective", MIT Press, 2012. (online via Cornell Library) We will also read original research papers and other sources from the following list: Tom Mitchell, "Machine Learning", McGraw Hill, 1997. Cristianini, Shawe-Taylor, "Introduction to Support Vector Machines", Cambridge University Press, 2000. (online via Cornell Library) Schoelkopf, Smola, "Learning with Kernels", MIT Press, 2001. (online) Bishop, "Pattern Recognition and Machine Learning", Springer, 2006. Ethem Alpaydin, "Introduction to Machine Learning", MIT Press, 2004. Devroye, Gyoerfi, Lugosi, "A Probabilistic Theory of Pattern Recognition", Springer, 1997. Duda, Hart, Stork, "Pattern Classification", Wiley, 2000. Hastie, Tibshirani, Friedman, "The Elements of Statistical Learning", Springer, 2001. Imbens, Rubin, Causal Inference for Statistical Social Science, Cambridge, 2015. (online via Cornell Library) Leeds Tutorial on HMMs (online) Manning, Schuetze, "Foundations of Statistical Natural Language Processing", MIT Press, 1999. (online via Cornell Library) Manning, Raghavan, Schuetze, "Introduction to Information Retrieval", Cambridge, 2008. (online) Vapnik, "Statistical Learning Theory", Wiley, 1998.
	Academic Integrity This course follows the Cornell University Code of Academic Integrity. Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work. Collaborations are allowed only if explicitly permitted. Violations of the rules (e.g. cheating, copying, non-approved collaborations) will not be tolerated. Respectful, constructive and inclusive conduct is expected of all class participants.