Syllabus for CS6787

Advanced Machine Learning Systems — Fall 2021

TermFall 2021InstructorChristopher De Sa
RoomPhillips Hall 219E-mail[email hidden]
ScheduleMW 7:30pm – 8:45pmOffice hoursW 2:00pm – 3:00pm
ForumEd DiscussionOfficeGates 450

So you've taken a machine learning class. You know the models people use to solve their problems. You know the algorithms they use for learning. You know how to evaluate the quality of their solutions.

But when we look at a large-scale machine learning application that is deployed in practice, it's not always exactly what you learned in class. Sure, the basic models, the basic algorithms are all there. But they're modified a bit, in a bunch of different ways, to run faster and more efficiently. And these modifications are really important—they often are what make the system tractable to run on the data it needs to process.

CS6787 is a graduate-level introduction to these system-focused aspects of machine learning, covering guiding principles and commonly used techniques for scaling up learning to large data sets. Informally, we will cover the techniques that lie between a standard machine learning course and an efficient systems implementation: both statistical/optimization techniques based on improving the convergence rate of learning algorithms and techniques that improve performance by leveraging the capabilities of the underlying hardware. Topics will include stochastic gradient descent, acceleration, variance reduction, methods for choosing hyperparameters, parallelization within a chip and across a cluster, popular ML frameworks, and innovations in hardware architectures. An open-ended project in which students apply these techniques is a major part of the course.

Prerequisites: Knowledge of machine learning at the level of CS4780. If you are an undergraduate, you should have taken CS4780 or an equivalent course, since it is a prerequisite. Knowledge of computer systems and hardware on the level of CS 3410 is recommended, but this is not a prerequisite.

Format: About half of the classes will involve traditionally formatted lectures. For the other half of the classes, we will read and discuss two seminal papers relevant to the course topic. These classes will involve presentations by groups of students of the paper contents (each student will sign up in a group to present one paper for 15-20 minutes) followed by breakout discussions about the material. Historically, the lectures have occurred on Mondays and the discussions have occurred on Wednesdays, but due to the non-standard timeline this semester, these course elements will be scheduled irregularly (see schedule below).

Grading: Students will be evaluated on the following basis.

20%Paper presentation
10%Discussion participation
20%Paper reviews
10%Programming assignments
40%Final project

Paper review parameters: Paper reviews should be about one page (single-spaced) in length. The review guidelines should mirror what an actual conference review would look like (although you needn't assign scores or anything like that). In particular you should at least: (1) summarize the paper, (2) discuss the paper's strengths and weaknesses, and (3) discuss the paper's impact. For reference, you can read the ICML reviewer guidelines. Of course, your review will not be precisely like a real review, in large part because we already know the impact of these papers. You can submit any review up to two days late with no penalty. Students who presented a paper do not have to submit a review of that paper (although you can if you want).

Final project parameters (subject to change): The final project can be done in groups of up to three (although more work will be expected from groups with more people). The subject of the project is open-ended, but it must include:

There will be an in-class feedback activity on Monday, October 18, and you should prepare a two-minute pitch of your ideas by then. Project proposals are due on Monday, October 25. The project proposal should satisfy the following constraints: The project will culminate in a project report of at least four pages, not including references. The project report should be formatted similarly to a workshop paper, and should use the ICML 2019 style or a similar style. An abstract for the report is due on Monday, November 29, and we will discuss the abstracts in class on Wednesday, December 1 (these abstracts may be submitted late until Tuesday night with no penalty). The final project report is due on Thursday, December 7, the lass day of instruction.


Course Calendar

Monday, August 30
In Person
Aug
29
Aug
30
Aug
31
Sep
1
Sep
2
Sep
3
Sep
4
Lecture #1: Overview.
[Slides] [Demo Notebook] [Demo HTML]
  • Overview
  • Course outline and syllabus
  • Learning with gradient descent
  • Stochastic gradient descent: the workhorse of machine learning
  • Theory of SGD for convex objectives: our first look at trade-offs
Wednesday, September 1
In Person
Aug
29
Aug
30
Aug
31
Sep
1
Sep
2
Sep
3
Sep
4
Lecture #2: Backpropagation & ML Frameworks.
[Slides] [Demo Notebook] [Demo HTML]
  • Backpropagation and automatic differentiation
  • Machine learning frameworks I: the user interface
  • Overfitting
  • Generalization error
  • Early stopping
Optional extra reading. Some older papers on SGD and backpropagation!

Presentation signup: due Friday. (Survey link)
Monday, September 6Labor Day: No classes.
Wednesday, September 8
In Person
Sep
5
Sep
6
Sep
7
Sep
8
Sep
9
Sep
10
Sep
11
Lecture #3: Hyperparameters and Tradeoffs.
[Slides] [Demo Notebook] [Demo HTML]
  • Our first hyperparameters: step size/learning rate, minibatch size
  • Regularization
  • Application-specific forms of regularization
  • The condition number
  • Momentum and acceleration
  • Momentum for quadratic optimization
  • Momentum for convex optimization
Released: Programming Assignment 1.
Monday, September 13
In Person
Sep
12
Sep
13
Sep
14
Sep
15
Sep
16
Sep
17
Sep
18
Paper Discussion 1a.
On the importance of initialization and momentum in deep learning.
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton.
Proceedings of the International Conference on Machine Learning (ICML), 2013.

Paper Discussion 1b.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Sergey Ioffe, Christian Szegedy.
Proceedings of the International Conference on Machine Learning (ICML), 2015.
Wednesday, September 15
In Person
Sep
12
Sep
13
Sep
14
Sep
15
Sep
16
Sep
17
Sep
18
Lecture #4: Kernels and Dimensionality Reduction.
[Slides] [Demo Notebook] [Demo HTML]
  • The kernel trick
  • Gram matrix versus feature extraction: systems tradeoffs
  • Adaptive/data-dependent feature mappings
  • Dimensionality reduction
Monday, September 20
In Person
Sep
19
Sep
20
Sep
21
Sep
22
Sep
23
Sep
24
Sep
25
Paper Discussion 2a.
Random features for large-scale kernel machines.
Ali Rahimi and Benjamin Recht.
In Advances in Neural Information Processing Systems (NeurIPS), 2007.

Paper Discussion 2b.
Feature Hashing for Large Scale Multitask Learning.
Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford and Alex Smola.
Proceedings of the International Conference on Machine Learning (ICML), 2009.

Due: Review of paper 1a or 1b.
Wednesday, September 22
In Person
Sep
19
Sep
20
Sep
21
Sep
22
Sep
23
Sep
24
Sep
25
Lecture #5: Online Learning and Variance Reduction.
[Slides] [Demo Notebook] [Demo HTML]
  • Online versus offline learning
  • Variance reduction
  • SVRG
  • Fast linear rates for convex objectives
Due: Programming Assignment 1.

Released: Programming Assignment 2.
Monday, September 27
In Person
Sep
26
Sep
27
Sep
28
Sep
29
Sep
30
Oct
1
Oct
2
Paper Discussion 3a.
Identifying Suspicious URLs: An Application of Large-Scale Online Learning.
Justin Ma, Lawrence K. Saul, Stefan Savage and Geoffrey M. Voelker.
Proceedings of the International Conference on Machine Learning (ICML), 2009.

Paper Discussion 3b.
Accelerating stochastic gradient descent using predictive variance reduction.
Rie Johnson and Tong Zhang.
In Advances in Neural Information Processing Systems (NeurIPS), 2013.

Due: Review of paper 2a or 2b.
Wednesday, September 29
In Person
Sep
26
Sep
27
Sep
28
Sep
29
Sep
30
Oct
1
Oct
2
Lecture #6: Hyperparameter Optimization.
[Slides] [Demo Notebook] [Demo HTML]
  • Hyperparameter optimization
  • Assigning parameters from folklore
  • Random search over parameters
Monday, October 4
In Person
Oct
3
Oct
4
Oct
5
Oct
6
Oct
7
Oct
8
Oct
9
Paper Discussion 4a.
Random search for hyper-parameter optimization.
James Bergstra and Yoshua Bengio.
Journal of Machine Learning Research (JMLR), 2012.

Paper Discussion 4b.
A System for Massively Parallel Hyperparameter Tuning.
Liam Li et al.
Proceedings of the 2nd Conference on Machine Learning and Systems, 2020.

Due: Review of paper 3a or 3b.
Wednesday, October 6
In Person
Oct
3
Oct
4
Oct
5
Oct
6
Oct
7
Oct
8
Oct
9
Lecture #7: Adaptive Methods & Non-Convex Optimization.
[Slides] [Demo Notebook] [Demo HTML]
  • Adaptive methods
  • AdaGrad
  • Adam
  • Non-convex optimization
Due: Programming Assignment 2.
Monday, October 11Indigenous Peoples' Day: No classes.
Wednesday, October 13
In Person
Oct
10
Oct
11
Oct
12
Oct
13
Oct
14
Oct
15
Oct
16
Paper Discussion 5a.
The Marginal Value of Adaptive Gradient Methods in Machine Learning.
Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro and Benjamin Recht.
In Advances in Neural Information Processing Systems (NeurIPS), 2017.

Paper Discussion 5b.
Adam: A method for stochastic optimization.
Diederik Kingma and Jimmy Ba.
Proceedings of the International Conference on Learning Representations (ICLR), 2015.

Due: Review of paper 4a or 4b.
Monday, October 18
In Person
Oct
17
Oct
18
Oct
19
Oct
20
Oct
21
Oct
22
Oct
23
Lecture #8: Parallelism.
[Slides] [Demo Notebook] [Demo HTML]
  • Hardware trends that lead to parallelism
  • Sources of parallelism in hardware
  • Data parallelism
  • Extracting parallelism at different places in the computation
  • Simple parallelism on multicore

In-class project feedback activity.
Wednesday, October 20
In Person
Oct
17
Oct
18
Oct
19
Oct
20
Oct
21
Oct
22
Oct
23
Paper Discussion 6a.
Map-reduce for machine learning on multicore.
Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun
In Advances in Neural Information Processing Systems (NeurIPS), 2007.

Paper Discussion 6b.
Hogwild: A lock-free approach to parallelizing stochastic gradient descent.
Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright.
In Advances in Neural Information Processing Systems (NeurIPS), 2011.

Due: Review of paper 5a or 5b.
Monday, October 25
In Person
Oct
24
Oct
25
Oct
26
Oct
27
Oct
28
Oct
29
Oct
30
Lecture #9: Distributed Learning.
[Slides]
  • Learning on multiple machines
  • SGD with all-reduce
  • The parameter server
  • Asynchronous parallelism on multiple machines
  • Decentralized and local SGD
  • Model and pipeline parallelism

Due: Final project proposals.
Wednesday, October 27
In Person
Oct
24
Oct
25
Oct
26
Oct
27
Oct
28
Oct
29
Oct
30
Paper Discussion 7a.
Large scale distributed deep networks.
Jeff Dean et al.
In Advances in Neural Information Processing Systems (NeurIPS), 2012.

Paper Discussion 7b.
Towards federated learning at scale: System design.
Keith Bonawitz, et al.
In Proceedings of the 2nd MLSys Conference (MLSys), 2019.

Due: Review of paper 6a or 6b.
Monday, November 1
In Person
Oct
31
Nov
1
Nov
2
Nov
3
Nov
4
Nov
5
Nov
6
Lecture #10: Low-Precision Arithmetic.
[Slides] [Demo Notebook] [Demo HTML]
  • Memory
  • Low-precision formats
  • Floating-point machine epsilon
  • Low-precision training
  • Scan order
Wednesday, November 3
In Person
Oct
31
Nov
1
Nov
2
Nov
3
Nov
4
Nov
5
Nov
6
Paper Discussion 8a.
Deep learning with limited numerical precision.
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.
Proceedings of the International Conference on Machine Learning (ICML), 2015.

Paper Discussion 8b.
BinaryConnect: Training Deep Neural Networks with binary weights during propagations.
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David.
In Advances in Neural Information Processing Systems (NeurIPS), 2015.

Due: Review of paper 7a or 7b.
Monday, November 8
In Person
Nov
7
Nov
8
Nov
9
Nov
10
Nov
11
Nov
12
Nov
13
Lecture #11: Inference and Compression.
[Slides] [Demo Notebook] [Demo HTML]
  • Efficient inference
  • Metrics we care about when inferring
  • Compression
  • Fine-tuning
  • Hardware for inference
Wednesday, November 10
In Person
Nov
7
Nov
8
Nov
9
Nov
10
Nov
11
Nov
12
Nov
13
Paper Discussion 9a.
Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding.
Song Han, Huizi Mao, and William J Dally.
Proceedings of the International Conference on Learning Representations (ICLR), 2016.

Paper Discussion 9b.
What is the State of Neural Network Pruning?
Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag.
Proceedings of the 2nd Conference on Machine Learning and Systems, 2020.

Due: Review of paper 8a or 8b.
Monday, November 15
In Person
Nov
14
Nov
15
Nov
16
Nov
17
Nov
18
Nov
19
Nov
20
Lecture #12: Machine Learning Frameworks II.
  • Large scale numerical linear algebra
  • Eager vs lazy
  • ML frameworks in Python
Wednesday, November 17
In Person
Nov
14
Nov
15
Nov
16
Nov
17
Nov
18
Nov
19
Nov
20
Paper Discussion 10a.
TensorFlow.js: Machine Learning for the Web and Beyond.
Daniel Smilkov et al.
Proceedings of the 2nd Conference on Machine Learning and Systems, 2019.

Paper Discussion 10b.
PyTorch: An Imperative Style, High-Performance Deep Learning Library.
Adam Paszke et al.
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.

Due: Review of paper 9a or 9b.
Monday, November 22
In Person
Nov
21
Nov
22
Nov
23
Nov
24
Nov
25
Nov
26
Nov
27
Lecture #13: Hardware for Machine Learning.
  • CPUs vs GPUs
  • What makes for good ML hardware?
  • How can hardware help with ML?
  • What does modern ML hardware look like?
Wednesday, November 24Thanksgiving Break: No classes.
Monday, November 29
In Person
Nov
28
Nov
29
Nov
30
Dec
1
Dec
2
Dec
3
Dec
4
Paper Discussion 11a.
In-datacenter performance analysis of a tensor processing unit.
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.
In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017.

Paper Discussion 11b.
A Configurable Cloud-Scale DNN Processor for Real-Time AI.
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengills, et al.
In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), 2018.

Due: Review of paper 10a or 10b.

Due: Final project abstract. Can be submitted late until Wednesday afternooon; will discuss in class on Wednesday.
Wednesday, December 1
In Person
Nov
28
Nov
29
Nov
30
Dec
1
Dec
2
Dec
3
Dec
4
Lecture #15: Large Scale ML on the Cloud.

Abstract discussion.
Monday, December 6
In Person
Dec
5
Dec
6
Dec
7
Dec
8
Dec
9
Dec
10
Dec
11
Lecture #16: Final Project Disussion.

Due: Review of paper 11a or 11b.
Tuesday, December 7Last day of instruction. No CS6787 lecture.

Due: Final project report.