Programming Assignment 3: Accelerating SGD

CS4787 — Principles of Large-Scale Machine Learning — Spring 2020

Project Due: April 8, 2020 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a partner project. You can either work alone, or work ONLY with your chosen partner. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be testing some of the methods we discussed in class for accelerating stochastic gradient descent on the same MNIST task as in Programming Assignment 1. These methods are all used in practice for machine learning systems at even the largest scales, and the goal of this assignment is to give you some experience working with them so that you can build your intuition for how they work.

Background: In the last programming assignment, we looked at minibatched SGD with sequential sampling (Algorithm 4 from that project):

\begin{algorithm} \caption{Minibatch SGD with Fixed $\alpha$ and Sequential Sampling} \begin{algorithmic} \PROCEDURE{SGD}{$(f_1, \ldots, f_n), \alpha, w_0, B, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w \leftarrow w_0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \FOR{$i = 0$ \TO $(n / B) - 1$} \STATE $w \leftarrow w - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \ENDFOR \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}

As usual, $\texttt{num\_epochs}$ denotes the number of epochs, or passes through the dataset, used in training, and the total number of iterations here will be $T = \texttt{num\_epochs} \cdot n / B$, where we assume that $B$ divides $n$ evenly. When using stochastic gradients in this programming assignment, we are going to base our algorithms off of this one, by using for all our stochastic training algorithms both (1) minibatching and (2) sequential scan through the data.

Instructions: This project is split into three parts: the implementation and evaluation of momentum on top of gradient descent, the implementation and evaluation of momentum with stochastic gradients, and the exploration of Adam (an algorithm that uses adaptive learning rates).

Part 1: Momentum with gradient descent.

Implement a function to compute the gradient of a batch of training examples for multinomial logistic regression (with l2 regularization) on MNIST.

You can use the function you implemented in Programming Assignment 2 for this task.

Implement a function to compute the training/test error of multinomial logistic regression on MNIST.

You can use the function you implemented in Programming Assignment 1 for this task.

Implement a function to compute the training loss of multinomial logistic regression (with l2 regularization) on MNIST.
Implement a function to run gradient descent on multinomial logistic regression (with l2 regularization) on MNIST.

You can adapt the function you implemented in Programming Assignment 1 for this task.

Implement a function to run gradient descent using Nesterov's momentum on multinomial logistic regression (with l2 regularization) on MNIST. Here is some pseudocode for that algorithm.
\begin{algorithm} \caption{Nesterov's momentum GD} \begin{algorithmic} \PROCEDURE{Nesterov}{$f, \alpha, \beta, w_0, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w_0 \leftarrow w_0$ \STATE \textbf{initialize} $v_0 \leftarrow w_0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \STATE $v_t \leftarrow w_{t-1} - \alpha \nabla f(w_{t-1})$ \STATE $w_t \leftarrow v_t + \beta (v_t - v_{t-1})$ \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}
Using an L2 regularization parameter of $\gamma = 0.0001$ and step size $\alpha = 1.0$ for 100 epochs/iterations, record the value of the parameters every iteration (i.e. once per full gradient step), for the following three algorithms and parameter settings:

Gradient descent.
Nesterov's momentum with parameter $\beta = 0.9$.
Nesterov's momentum with parameter $\beta = 0.99$.

Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the three methods compare?
Measure the runtime of the two algorithms (Gradient descent and Nesterov's momentum) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?

To measure the runtime of each algorithm, measure the wall clock time it takes it to run all its epochs, and average your result across five (5) total runs of the algorithm.

Now test some other values of the hyperparameters $\alpha$ and $beta$ for both algorithms. How do the best settings you can come up with compare between gradient descent and Nesterov's momentum? List at least two hyperparameter settings you tried for each of the two algorithms, and describe what you observed by looking at the training error, test error, and training loss.

Part 2: Momentum with SGD.

Implement a function to compute minibatch sequential sampling SGD for multinomial logistic regression (with l2 regularization) on MNIST.

You can use the function you implemented in Programming Assignment 2 for this task.

Implement a function to compute minibatch SGD with momentum and sequential sampling for multinomial logistic regression (With l2 regularization) on MNIST. Here is some pseudocode for that algorithm.
\begin{algorithm} \caption{SGD with Momentum using Minibatching and Sequential Sampling} \begin{algorithmic} \PROCEDURE{MomentumSGD}{$(f_1, \ldots, f_n), \alpha, \beta, w_0, B, \texttt{num\_epochs}$} \STATE \textbf{initialize} $w \leftarrow w_0$ \STATE \textbf{initialize} $v \leftarrow 0$ \FOR{$t = 1$ \TO $\texttt{num\_epochs}$} \FOR{$i = 0$ \TO $(n / B) - 1$} \STATE $v \leftarrow \beta v - \alpha \cdot \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \STATE $w \leftarrow w + v$ \ENDFOR \ENDFOR \RETURN $w$ \ENDPROCEDURE \end{algorithmic} \end{algorithm}
Using an L2 regularization parameter of $\gamma = 0.0001$, step size $\alpha = 0.2$, and minibatch size $B = 600$ for 10 epochs, record the value of the parameters every 10 iterations (i.e. ten times per epoch), for the following three algorithms and parameter settings:

Stochastic gradient descent.
Momentum with SGD with parameter $\beta = 0.9$.
Momentum with SGD with parameter $\beta = 0.99$.

Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the three methods compare?
Measure the runtime of the two algorithms (SGD and SGD+Momentum) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?

To measure the runtime of each algorithm, measure the wall clock time it takes it to run all its epochs, and average your result across five (5) total runs of the algorithm.

Now test some other values of the hyperparameters $\alpha$ and $beta$ for both algorithms. How do the best settings you can come up with compare between SGD and SGD+momentum? List at least two hyperparameter settings you tried for each of the two algorithms, and describe what you observed by looking at the training error, test error, and training loss.

Part 3: ADAM.

Implement a function to compute the Adam optimization algorithm (using minibatching and sequential sampling) for multinomial logistic regression (with l2 regularization) on MNIST. Here is some pseudocode for that algorithm.
\begin{algorithm} \caption{Adam} \begin{algorithmic} \State \textbf{input:} learning rate factor $\alpha$, decay rates $\rho_1$ and $\rho_2$, initial parameters $w$ \State \textbf{input:} minibatch size $B$, small constant $\epsilon$ to avoid division by $0$ (e.g. $\epsilon = 10^{-5}$) \State \textbf{initialize } $r \leftarrow 0$ (a vector in $\R^d$) \State \textbf{initialize } $s \leftarrow 0$ (a vector in $\R^d$) \State \textbf{initialize timestep } $t \leftarrow 0$ \For{$k = 1$ \TO $\texttt{num\_epochs}$} \For{$i = 0$ \TO $(n / B) - 1$} \State $t \leftarrow t + 1$ \State sample a stochastic gradient $g \leftarrow \frac{1}{B} \sum_{b = 1}^B \nabla f_{i \cdot B + b}(w)$ \State accumulate first moment estimate $s_j \leftarrow \rho_1 s_j + (1 - \rho_1) g_j$ for all $j \in \{1, \ldots, d \}$ \State accumulate second moment estimate $r_j \leftarrow \rho_2 r_j + (1 - \rho_2) g_j^2$ for all $j \in \{1, \ldots, d \}$ \State correct first moment bias $\hat s \leftarrow \frac{s}{1 - \rho_1^t}$ \State correct second moment bias $\hat r \leftarrow \frac{r}{1 - \rho_2^t}$ \State update model $w_j \leftarrow w_j - \frac{\alpha}{\sqrt{\hat r_j + \epsilon}} \cdot \hat s_j$ for all $j \in \{1, \ldots, d \}$ \EndFor \EndFor \Return $w$ \end{algorithmic} \end{algorithm}
Using an L2 regularization parameter of $\gamma = 0.0001$, and minibatch size $B = 600$ for 10 epochs, record the value of the parameters every 10 iterations (i.e. ten times per epoch), for the following two algorithms and parameter settings:

Stochastic gradient descent with step size $\alpha = 0.2$. (You can re-use your results from Part 2 for this.)
Momentum with SGD with step size $\alpha = 0.01$, first moment decay $\rho_1 = 0.9$, and second moment decay $\rho_2 = 0.999$.

Evaluate the training error, test error, and training loss of each recorded model exactly, using the entire MNIST training and test dataset.
Plot the resulting error and loss against the number of epochs in three figures, one for Training error, one for Test error, and one for Training Loss. How does the performance of the two methods compare?
Measure the runtime of the two algorithms (SGD and Adam) under the above parameter settings. How does the time to run an iteration compare between the two algorithms?

To measure the runtime of each algorithm, measure the wall clock time it takes it to run all its epochs, and average your result across five (5) total runs of the algorithm.
For SGD, you can re-use your measurement from Part 2.

Now test some other values of the hyperparameters $\alpha$, $rho_1$, and $\rho_2$ for Adam. How do the best settings you can come up with compare between SGD (from Part 2) and Adam? List at least three hyperparameter settings you tried for Adam, and describe what you observed by looking at the training error, test error, and training loss.

What to submit:

An implementation of the functions in main.py.
A lab report containing:
- A one-paragraph summary of what you observed during this programming assignment.
- Plots of the training error, test error, and training loss of the result of your training, as described in Parts 1, 2, and 3.
- Text describing how the convergence performance of the various methods compares, as described in Parts 1, 2, and 3.
- Measurements of runtime for the various algorithms, as described in Parts 1, 2, and 3.
- Text describing how the runtime of the various algorithms compare, as described in Parts 1, 2, and 3.
- A list of hyperparameters you evaluated, as described in Parts 1, 2, and 3.
- Text describing how the best hyperparameter settings you tried compared among the different algorithms, and descriptions of what you observed while trying different hyperparameters for the algorithms, as described in Parts 1, 2, and 3.

Setup:

Run pip3 install -r requirements.txt to install the required python packages