Programming Assignment 2: Scaling with SGD

CS4787 — Principles of Large-Scale Machine Learning — Spring 2020

Project Due: March 4, 2020 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a partner project. You can either work alone, or work ONLY with your chosen partner. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be implementing and evaluating stochastic gradient descent on the same MNIST task as in Programming Assignment 1. You will also be evaluating the effects of different learning rates and batch sizes, as well as exploring empirically the impact of using different sampling schemes for stochastic gradient descent.

Background: So far in class, we've been talking about stochastic gradient descent implemented as the following

Algorithm 1 Stochastic Gradient Descent with Fixed $\alpha$ and Random Sampling

1:procedure SGD( $(f_1, \ldots, f_n), \alpha, w_0, T$ )

2:initialize $w \leftarrow w_0$

3:for $t = 1$ to $T$ do

4:sample $\tilde i$ uniformly at random from $\{1, \ldots, n\}$

5: $w \leftarrow w - \alpha \cdot \nabla f_{\tilde i}(w)$

6:end for

7:return $w$

8:end procedure

(Possibly we may want to use some averaging or random sampling of the iterates, but we'll ignore that for now.) One problem with Algorithm 1 is that since the training examples are chosen at random, they are unpredictable (or to use a more technical term, they have poor memory locality). This can make our results difficult to replicate, which can be bad when we want to scale our systems and test them automatically. Sampling at random can also be bad for the memory subsystem of the processor we are running on, especially if it depends on caches that try to predict which data to prefetch and make available for quick access (e.g. a CPU). We will discuss this more in detail later in the course, when we talk about hardware for machine learning. But for the moment, one (naive) way to address this problem is to just sample the training examples in a predictable order. And what order is more predictable than the order in which the training examples appear in memory? This motivates the development of the following sequential-scan-order version of SGD.

Algorithm 2 Stochastic Gradient Descent with Fixed $\alpha$ and Sequential Sampling

1:procedure SGD( $(f_1, \ldots, f_n), \alpha, w_0, \texttt{num\_epochs}$ )

2:initialize $w \leftarrow w_0$

3:for $t = 1$ to $\texttt{num\_epochs}$ do

4:for $i = 1$ to $n$ do

5: $w \leftarrow w - \alpha \cdot \nabla f_i(w)$

6:end for

7:end for

8:return $w$

9:end procedure

Here, $\texttt{num\_epochs}$ denotes the number of epochs, or passes through the dataset, used in training. The total number of iterations here will be $T = \texttt{num\_epochs} \cdot n / B$ . Algorithm 2 can run individual iterations faster than Algorithm 1 because its memory accesses are more predictable. (Although note that our convergence proofs from class do not apply to this modified sequential-sampling algorithm, and more care is needed to get convergence guarantees here.)

We can do the same thing with minibatching. The original version of minibatched SGD we discussed in class uses a random with-replacement subsample of size $B$ as follows:

Algorithm 3 Minibatch SGD with Fixed $\alpha$ and Random Sampling

1:procedure SGD( $(f_1, \ldots, f_n), \alpha, w_0, B, T$ )

2:initialize $w \leftarrow w_0$

3:for $t = 1$ to $T$ do

4:for $b = 1$ to $B$ do

5:sample $\tilde i_b$ uniformly at random from $\{1, \ldots, n\}$

6:end for

7: $w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{\tilde i_b}(w)$

8:end for

9:return $w$

10:end procedure

In order to make the accesses here more predictable, we can do the same thing as we did to baseline SGD, provided that the minibatch size $B$ evenly divides the training set size $n$ .

Algorithm 4 Minibatch SGD with Fixed $\alpha$ and Sequential Sampling

1:procedure SGD( $(f_1, \ldots, f_n), \alpha, w_0, B, \texttt{num\_epochs}$ )

2:initialize $w \leftarrow w_0$

3:for $t = 1$ to $\texttt{num\_epochs}$ do

4:for $i = 0$ to $(n / B) - 1$ do

5: $w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$

6:end for

7:end for

8:return $w$

9:end procedure

Once again, $\texttt{num\_epochs}$ denotes the number of epochs, or passes through the dataset, used in training. The total number of iterations here will be $T = \texttt{num\_epochs} \cdot n / B$ , and we assume that $B$ divides $n$ evenly. In this programming assignment, we will implement, and explore the performance of all of these algorithms, as well as some variants that use a different step size at each iteration.

Instructions: This project is split into three parts: the implementation of the basic functionality of SGD, the exploration of the effects of the hyperparameters (the step size and minibatch size), and evaluation of the systems cost of running SGD.

Part 1: Implementation.

Implement a function to compute the gradient of a batch of training examples for multinomial logistic regression (with l2 regularization) on MNIST.

You can also use this function to compute the gradient of a single training example by passing in a list with only a single index in it (i.e. a minibatch of size 1).
This function should be very similar to the function you implemented in the first programming assignment (except that it computes a gradient for a minibatch, rather than the whole dataset).

Implement the four stochastic gradient descent algorithms described above in the Background section.

For consistency, the functions in the starter code in mnist.py take in how long the algorithm needs to run for in terms of epochs, i.e. total passes through the dataset. You will need to modify Algorithms 1 and 3 from their descriptions above accordingly, using the above expressions $T = \texttt{num\_epochs} \cdot n$ for Algorithm 1 and $T = \texttt{num\_epochs} \cdot n / B$ for Algorithm 3. This ensures that the same number of total gradient samples are used for all the algorithms.

Run both non-minibatched SGD algorithms (Algorithms 1 and 2) with L2 regularization parameter $\gamma = 0.0001$ and step size $\alpha = 0.001$ for 10 epochs, and record the value of the parameters every 6000 update iterations (i.e. 60000 training examples gives 10 times per epoch).
Run both minibatched SGD algorithms (Algorithms 3 and 4) with L2 regularization parameter $\gamma = 0.0001$ , step size $\alpha = 0.05$ , and minibatch size $B = 60$ for 10 epochs, and record the value of the parameters every 100 batch update iterations (i.e. 10 times per epoch).
Evaluate the training error and test error of each recorded model exactly, using the entire MNIST training and test dataset. (You can use your code from Programming Assignment 1 to do this.)
Plot the resulting error against the number of epochs in two figures, one for Training error and one for Test error. How does the performance of the four methods compare?

Part 2: Exploration.

For stochastic gradient descent (Algorithm 1) evaluate some other values of the step size other than the one given to you in Part 1.
For SGD (Algorithm 1), find a value of the step size that allows the algorithm to reach a lower training error after 10 epochs than the step size given to you in Part 1. What effect does this changed step size have on the final test error?
For SGD (Algorithm 1), can you find a value of the step size that allows the algorithm to reach the same (or better) training error after 5 epochs than could be achieved in 10 epochs using the step size given to you in Part 1? (This may be the same step size that you used in the previous step.) What effect does this changed step size have on the final test error after 5 epochs?
Now do the same thing for minibatch SGD with sequential scan (Algorithm 4). Can you find a value for the batch size and learning rate that allows the algorithm to reach the same (or better) training error after 5 epochs than could be achieved in 10 epochs using the step size given to you in Part 1? What effect does this changed step size have on the final test error after 5 epochs?
For at least three different algorithm configurations you explored in this Part, plot the resulting error against the number of epochs in two figures, one for Training error and one for Test error, just as you did for the evaluation in Part 1.

If you found hyperparameters that improved the performance in Steps 2, 3, and 4, use those hyperparameters for these figures.

Part 3: Systems Evaluation.

Measure the runtime of all four algorithms using the configurations prescribed in Part 1. How does the runtime of the algorithms compare?

To measure the runtime of each algorithm, measure the wall clock time it takes it to run all its epochs, and average your result across five (5) total runs of the algorithm.

You should observe that some of the algorithms run an epoch in substantially less time than the other algorithms do. Can you explain why?

What to submit:

An implementation of the functions in main.py.
A lab report containing:
- A one-paragraph summary of what you observed during this programming assignment.
- Plots of the training and test error of the result of your training, as described in Part 1.
- Text describing how the performance of the four models compare, when evaluated against number of epochs.
- A list of step sizes you evaluated in Part 2.1.
- A step size that allows SGD to reach a lower training error as described in Part 2.2.
- A step size that allows SGD to reach a lower training error as described in Part 2.3.
- A step size/minibatch size that allows minibatch SGD to reach a lower training error as described in Part 2.4.
- Plots of the training and test error (at least three pairs of plots) as described in Part 2.5.
- The measured runtime of all four algorithms. Text describing how the runtimes of the algorithms compare.
- An explanation for why minibatch SGD with sequential sampling runs faster.

Setup:

Run pip3 install -r requirements.txt to install the required python packages