Syllabus for CS6787

Advanced Machine Learning Systems — Fall 2019

Term	Fall 2019	Instructor	Christopher De Sa
Course website	www.cs.cornell.edu/courses/cs6787/2019fa/	E-mail	[email hidden]
Schedule	MW 7:30pm – 8:45pm	Office hours	W 2:00pm – 3:00pm or by appointment
Room	Upson Hall 142	Office	Bill and Melinda Gates Hall 450

[Piazza site]

Description: So you've taken a machine learning class. You know the models people use to solve their problems. You know the algorithms they use for learning. You know how to evaluate the quality of their solutions.

But when we look at a large-scale machine learning application that is deployed in practice, it's not always exactly what you learned in class. Sure, the basic models, the basic algorithms are all there. But they're modified a bit, in a bunch of different ways, to run faster and more efficiently. And these modifications are really important—they often are what make the system tractable to run on the data it needs to process.

CS6787 is a graduate-level introduction to these system-focused aspects of machine learning, covering guiding principles and commonly used techniques for scaling up learning to large data sets. Informally, we will cover the techniques that lie between a standard machine learning course and an efficient systems implementation: both statistical/optimization techniques based on improving the convergence rate of learning algorithms and techniques that improve performance by leveraging the capabilities of the underlying hardware. Topics will include stochastic gradient descent, acceleration, variance reduction, methods for choosing hyperparameters, parallelization within a chip and across a cluster, popular ML frameworks, and innovations in hardware architectures. An open-ended project in which students apply these techniques is a major part of the course.

Prerequisites: Knowledge of machine learning at the level of CS4780. If you are an undergraduate, you should have taken CS4780, since it is a prerequisite. Optionally, knowledge of computer systems and hardware on the level of CS 3410 would be useful, but this is not a prerequisite.

Format: For half of the classes, typically on Mondays, there will be a traditionally formatted lecture. For the other half of the classes, typically on Wednesdays, we will read and discuss two seminal papers relevant to the course topic. These classes will involve presentations by groups of students of the paper contents (each student will sign up in a group to present one paper for 15-20 minutes) followed by breakout discussions about the material.

Final project parameters and course calendar may be subject to change.

Grading: Students will be evaluated on the following basis.

20%	Paper presentation
10%	Discussion participation
20%	Paper reviews — students must submit a review for every pair of papers we discuss, but not for the week when they presented a paper
10%	Programming assignments
40%	Final project

Paper review parameters: Paper reviews should be about one page (single-spaced) in length. The review guidelines should mirror what an actual conference review would look like (although you needn't assign scores or anything like that). In particular you should at least: (1) summarize the paper, (2) discuss the paper's strengths and weaknesses, and (3) discuss the paper's impact. For reference, you can read the NIPS reviewer guidelines, starting with the "Review content" section. Of course, your review will not be precisely like a real review, in large part because we already know the impact of these papers. You can submit any review up to two days late with no penalty. Students who presented a paper do not have to submit a review of that paper (although you can if you want).

Final project parameters (subject to change): The final project can be done in groups of up to three (although more work will be expected from groups with more people). The subject of the project is open-ended, but it must include:

the implementation of a machine learning system for some task,
exploring one or more of the techniques discussed in the course (or similar techniques subject to instructor approval),
to empirically evaluate the performance and compare it with some baseline method, in two ways:

statistical performance (e.g. iterations to converge to some accuracy threshold), and
hardware performance (e.g. throughput or wall-clock time).

There will be an in-class feedback activity on Wednesday, October 16, and you should prepare a two-minute pitch of your ideas by then. Project proposals are due on Monday, October 21. The project proposal should satisfy the following constraints:

The main body should be about one page in length.
It should describe the project you intend to do.
It should contain at least one citation of a relevant paper that we did not cover in class (but preferably more).
It should include some preliminary or exploratory work you've already done, that helps to support the idea that your project is feasible (this preliminary work can be very minimal, but should indicate that you've got started—or at least have a clear idea how to do so).
In addition to the one-page text proposal, it should contain one short experiment plan per person, which should consist of:
- a hypothesis
- a proxy statement which describes what metric you are going to use to measure the variables you care about
- a short protocol statement describing what you are going to do
- the results you expect to get
The experiment plan should not be longer than half a page, and may be much shorter.

The project will culminate in a project report of at least four pages, not including references. The project report should be formatted similarly to a workshop paper, and should use the ICML 2018 style or a similar style. An abstract for the report is due on Monday, December 2, and we will discuss the abstracts in class on that day. The final project report is due on Tuesday, December 10.

Course Calendar

Monday, September 2	Labor day. No lecture.
Wednesday, September 4	Lecture 1. [Slides] [Demo Notebook] [Demo HTML] Overview Course outline and syllabus Gradient descent Stochastic gradient descent: the workhorse of machine learning Theory of SGD for convex objectives
Monday, September 9	Lecture 2. [Slides] [Demo Notebook] [Demo HTML] Backpropagation and automatic differentiation Machine learning frameworks I: the user interface Overfitting Generalization error Early stopping Optional extra reading. Some older papers on SGD and backpropagation! Hinton, Geoffrey E. Learning distributed representations of concepts. Proceedings of the eighth annual conference of the cognitive science society. Vol. 1. 1986. Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Cognitive modeling 5.3 (1988): 1. Tong Zheng. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Proceedings of the International Conference on Machine Learning (ICML), 2004.
Wednesday, September 11	Paper Discussion 1a. Martin Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Preliminary White Paper, 2015. Since this is a white paper and is a bit longer than what we'll usually be reading, we will cover Sections 1, 2, 4.1, 6, and 9 only. Paper Discussion 1b. Rich Caruana, Steve Lawrence, and C Lee Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems (NeurIPS), 2001.
Monday, September 16	Due: Review of Paper 1a or 1b. Lecture 3. [Slides] [Demo Notebook] [Demo HTML] Our first hyperparameters: step size/learning rate, minibatch size Regularization Application-specific forms of regularization The condition number Momentum and acceleration Momentum for quadratic optimization Momentum for convex optimization
Wednesday, September 18	Paper Discussion 2a. Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the International Conference on Machine Learning (ICML), 2015. Paper Discussion 2b. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. Proceedings of the International Conference on Machine Learning (ICML), 2013.
Monday, September 23	Due: Review of Paper 2a or 2b. Lecture 4. [Slides] [Demo Notebook] [Demo HTML] The kernel trick Gram matrix versus feature extraction: systems tradeoffs Adaptive/data-dependent feature mappings Dimensionality reduction
Wednesday, September 25	Programming Assignment 1 released. Paper Discussion 3a. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NeurIPS), 2007. Paper Discussion 3b. Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford and Alex Smola. Feature Hashing for Large Scale Multitask Learning. Proceedings of the International Conference on Machine Learning (ICML), 2009.
Monday, September 30	Due: Review of Paper 3a or 3b. Lecture 5. [Slides] [Demo Notebook] [Demo HTML] Online versus offline learning Variance reduction SVRG Fast linear rates for convex objectives
Wednesday, October 2	Paper Discussion 4a. Justin Ma, Lawrence K. Saul, Stefan Savage and Geoffrey M. Voelker. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. Proceedings of the International Conference on Machine Learning (ICML), 2009. Paper Discussion 4b. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NeurIPS), 2013.
Monday, October 7	Due: Review of Paper 4a or 4b. Lecture 6. [Slides] [Demo Notebook] [Demo HTML] Hyperparameter optimization Assigning parameters from folklore Random search over parameters
Wednesday, October 9	Programming Assignment 1 due. Paper Discussion 5a. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research (JMLR), 2012. Paper Discussion 5b. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
Monday, October 14	Fall break. No lecture.
Wednesday, October 16	Programming Assignment 2 released. Due: Review of Paper 5a or 5b. Lecture 7. [Slides] Non-convex stochastic gradient descent Weakness of theoretical guarantees Deep learning as non-convex optimization One case where we can say something: convergence to a stationary point Activity: In-class discussion of course project ideas.
Monday, October 21	Due: Final Project Proposal. Lecture 8. [Slides] [Demo Notebook] [Demo HTML] Adaptive learning rates Algorithms other than SGD Inference Methods for sampling and statistical inference
Wednesday, October 23	Paper Discussion 6a. Ashia C Wilson and Rebecca Roelofs and Mitchell Stern and Nati Srebro and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017. Paper Discussion 6b. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations (ICLR), 2015.
Monday, October 28	Programming Assignment 2 due. Due: Review of Paper 6a or 6b. Lecture 9. [Slides] Major bottleneck for ML systems: parallelism Asynchronous execution Hogwild!
Wednesday, October 30	Paper Discussion 7a. Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), 2011. Paper Discussion 7b. Jeff Dean et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
Monday, November 4	Due: Review of Paper 7a or 7b. Lecture 10. [Slides] [Demo Notebook] [Demo HTML] Major bottleneck for ML systems: memory bandwidth and locality Low precision computation Vector computation Scan orders
Wednesday, November 6	Paper Discussion 8a. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. Proceedings of the International Conference on Machine Learning (ICML), 2015. Paper Discussion 8b. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
Monday, November 11	Due: Review of Paper 8a or 8b. Lecture 11. What happens on the inference side? Specialized low-cost models Compression Hardware for machine learning The dominance of GPUs Accelerators for machine learning Will all computation become matrix multiply?
Wednesday, November 13	Paper Discussion 9a. Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing Neural Networks with the Hashing Trick. Proceedings of the International Conference on Machine Learning (ICML), 2015. Paper Discussion 9b. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Proceedings of the International Conference on Learning Representations (ICLR), 2016.
Monday, November 18	Due: Review of Paper 9a or 9b. Lecture 12. Machine learning frameworks II: the rest of the story TensorFlow SciKit-Learn PyTorch Is Python the ML language of the future?
Wednesday, November 20	Paper Discussion 10a. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. Paper Discussion 10b. Martin Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
Monday, November 25	Due: Review of Paper 10a or 10b. Lecture 13. [Slides] [Demo Notebook] [Demo HTML] Final project abstract and report requirements Sparse matrix computations Structured matrices
Wednesday, November 27	Thanksgiving break. No lecture.
Monday, December 2	Class cancelled due to weather.
Wednesday, December 4	Due: Abstract for Final Project. Abstract swap and discussion.
Monday, December 9	Lecture 15. Guest Lecture (Anil Damle) Large scale numerical methods
Tuesday, December 10	Due: Final Project Report.