Term | Fall 2018 | Instructor | Christopher De Sa |
Course website | www.cs.cornell.edu/courses/cs6787/2018fa/ | [email hidden] | |
Schedule | MW 7:30pm – 8:45pm | Office hours | W 2:00pm – 3:00pm or by appointment |
Room | Upson Hall 142 | Office | Bill and Melinda Gates Hall 450 |
Description: So you've taken a machine learning class. You know the models people use to solve their problems. You know the algorithms they use for learning. You know how to evaluate the quality of their solutions.
But when we look at a large-scale machine learning application that is deployed in practice, it's not always exactly what you learned in class. Sure, the basic models, the basic algorithms are all there. But they're modified a bit, in a bunch of different ways, to run faster and more efficiently. And these modifications are really important—they often are what make the system tractable to run on the data it needs to process.
CS6787 is a graduate-level introduction to these system-focused aspects of machine learning, covering guiding principles and commonly used techniques for scaling up to large data sets. Informally, we will cover the techniques that lie between a standard machine learning course and an efficient systems implementation. Topics will include stochastic gradient descent, acceleration, variance reduction, methods for choosing hyperparameters, parallelization within a chip and across a cluster, popular ML frameworks, and innovations in hardware architectures. An open-ended project in which students apply these techniques is a major part of the course.
Prerequisites: Knowledge of machine learning at the level of CS4780. Optionally, knowledge of computer systems and hardware on the level of CS 3410 would be useful, but this is not a prerequisite.
Format: For half of the classes, typically on Mondays, there will be a traditionally formatted lecture. For the other half of the classes, typically on Wednesdays, we will read and discuss two seminal papers relevant to the course topic. These classes will involve presentations by groups of students of the paper contents (each student will sign up in a group to present one paper for 15-20 minutes) followed by breakout discussions about the material.
Final project parameters and course calendar may be subject to change.
Grading: Students will be evaluated on the following basis.
20% | Paper presentation |
10% | Discussion participation |
30% | Paper reviews — students must submit a review for every pair of papers we discuss |
40% | Final project |
Paper review parameters: Paper reviews should be about one page (single-spaced) in length. The review guidelines should mirror what an actual conference review would look like (although you needn't assign scores or anything like that). In particular you should at least: (1) summarize the paper, (2) discuss the paper's strengths and weaknesses, and (3) discuss the paper's impact. For reference, you can read the NIPS reviewer guidelines, starting with the "Review content" section. Of course, your review will not be precisely like a real review, in large part because we already know the impact of these papers. You can submit any review up to two days late with no penalty. Students who presented a paper do not have to submit a review of that paper (although you can if you want).
Final project parameters (subject to change): The final project can be done in groups of up to three (although more work will be expected from groups with more people). The subject of the project is open-ended, but it must include:
Monday, August 27 | Lecture 1. [Slides] [Demo Notebook] Topics:
|
Wednesday, August 29 |
In class: Sign-up for paper presentations.
Lecture 2. [Slides] [Demo Notebook] Topics:
|
Monday, September 3 | Labor day. No lecture. |
Wednesday, September 5 | Paper Discussion 1a. Tong Zheng. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Proceedings of the twenty-first international conference on Machine learning (ICML)., 2004 Paper Discussion 1b. Rich Caruana, Steve Lawrence, and C Lee Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in neural information processing systems, pages 402–408, 2001 |
Monday, September 10 |
|
Wednesday, September 12 | Due: Review of Paper 1a or 1b. Paper Discussion 2a. Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (JMLR), 2015 Paper Discussion 2b. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013 |
Monday, September 17 |
Due: Review of Paper 2a or 2b.
Lecture 4. [Slides] [Demo Notebook] Topics:
|
Wednesday, September 19 | Paper Discussion 3a. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2007 Paper Discussion 3b. Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford and Alex Smola. Feature Hashing for Large Scale Multitask Learning. In International conference on machine learning (ICML), 2009 |
Monday, September 24 |
Due: Review of Paper 3a or 3b.
Lecture 5. [Slides] [Demo Notebook] Topics:
|
Wednesday, September 26 | Paper Discussion 4a. Justin Ma, Lawrence K. Saul, Stefan Savage and Geoffrey M. Voelker. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In International conference on machine learning (ICML), 2009 Paper Discussion 4b. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013 |
Monday, October 1 |
Due: Review of Paper 4a or 4b.
Lecture 6. [Slides] [Demo Notebook] Topics:
|
Wednesday, October 3 | Paper Discussion 5a. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb): 281–305, 2012 Paper Discussion 5b. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012 |
Monday, October 8 | Fall break. No lecture. |
Wednesday, October 10 |
Due: Review of Paper 5a or 5b.
Lecture 7. [Slides] Topics:
|
Monday, October 15 |
Due: Final Project Proposal.
Lecture 8. [Slides] Topics:
|
Wednesday, October 17 | Paper Discussion 6a. John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12.Jul (2011): 2121-2159., 2011 Paper Discussion 6b. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015 |
Monday, October 22 |
Due: Review of Paper 6a or 6b.
Lecture 9. [Slides] Topics:
|
Wednesday, October 24 |
Paper Discussion 7a.
Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011
Paper Discussion 7b.
Jeff Dean |
Monday, October 29 |
Due: Review of Paper 7a or 7b.
Lecture 10. [Slides] [Demo Notebook] Topics:
|
Wednesday, October 31 | Paper Discussion 8a. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International conference on machine learning, 2015 Paper Discussion 8b. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Advances in neural information processing systems, 2015 |
Monday, November 5 |
Due: Review of Paper 8a or 8b.
Lecture 11. [Slides] [Demo Notebook] Topics:
|
Wednesday, November 7 | Paper Discussion 9a. Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing Neural Networks with the Hashing Trick. Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015 Paper Discussion 9b. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. ICLR, 2016 |
Monday, November 12 |
Due: Review of Paper 9a or 9b.
Lecture 12. [Slides] Topics:
|
Wednesday, November 14 | Paper Discussion 10a. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12. ACM, 2017 Paper Discussion 10b. Martin Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016 |
Monday, November 19 |
Due: Review of Paper 10a or 10b.
Lecture 13. [Slides] [Demo Notebook] Topics:
|
Wednesday, November 21 | Thanksgiving break. No lecture. |
Monday, November 26 | Due: Abstract for Final Project. Abstract swap and discussion. |
Wednesday, November 28 | Abstract swap continued. Lecture 14. Topics: discussion of advances in ML accelerator hardware. |
Monday, December 3 |
Lecture 15. Epilogue.
|
Wednesday, December 5 | No lecture. Due: Final Project Report. |