Processing math: 100%

Programming Assignment 2: Deep Learning

CS6787 — Advanced Machine Learning Systems — Fall 2019

Project Due: October 28, 2019 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

Motivation. This project is designed to familiarize you with using ML frameworks for deep learning. In particular, there are the following learning goals:

Practice writing code that implements neural network training in an ML framework, such as TensorFlow, that supports backpropagation.
Practice replicating results from previously published work.
Evaluate multiple optimization methds for training.
Explore hand-tuning for hyperparameter optimization.

At several points in this project description, you will be presented an alternative choice to one or more of the project instructions. The goal of these choices is to give you more options for exploring aspects of ML frameworks that you might not be familiar with. These choices are formatted in green like this. When making one of these choices, consider your own learning goals, and select the option that will best advance them. Note that my reference solution will use only the main choice given in black text in this assignment description, so it may be somewhat more difficult for us to help you debug if you choose alternately. (But as graduate students taking a 6000-level class, this shouldn't scare you!) If you do make an alternate choice, be sure to indicate that in your project report.

The deliverables for this project will include code and a project report. At certain points in this assignment description, you will be asked to present something in your project report. These choices are formatted in crimson like this. In addition to the crimson-formatted deliverables, your project report should also include a short (1-2 paragraphs) summary of what you did.

In this project, you will be implementing and evaluating multiple learning methods for training deep neural networks on the MNIST digit recognition dataset using a machine learning framework. The MNIST dataset is perhaps the most popular dataset used with machine learning, especially in small "toy" applications or demonstrations. The goal of the MNIST dataset is to predict what digit is represented by an image of a hand-written digit. It has 60000 training examples, and 10000 test examples, each of which is a $28 \times 28$ a grayscale image representing one of the 10 digits.

Since MNIST is such a common dataset, most ML frameworks will have built-in capabilities to load it. For example, in TensorFlow V2, you can use


        import tensorflow as tf
        mnist = tf.keras.datasets.mnist
        (Xs_tr, Ys_tr), (Xs_te, Ys_te) = mnist.load_data()

to load the MNIST training and test datasets.

Part 1: Identifying a Neural Network from the literature. One of the first successful neural networks for MNIST was LeNet-5. By doing a literature search, answer the following questions, and present your findings in your report.

What is the paper in which LeNet-5 was first proposed?
What is architecture of LeNet-5? That is, what layers does it have, how large are they, how are they connected, and what activation function is used?
What accuracy should one expect to get when implementing LeNet-5 on MNIST? That is, what accuacy was achieved in previous work?
What training algorithm was used to
How many parameters does LeNet-5 have? Show your work.

Part 2: Training a network. LeNet-5 is a bit dated, and doesn't actually represent modern deep learning practices. So, instead of training LeNet-5, let's train a simple "toy" convolutional neural network on the MNIST dataset. In TensorFlow, implement a feedforward neural network with the following layers:

A 2D convolution layer using $(3 \times 3)$ filter size, with 32 channels, and a ReLU activation.
A 2D MaxPool layer with a $(2 \times 2)$ downsampling factor.
Another 2D convolution layer using $(3 \times 3)$ filter size, with 32 channels, and a ReLU activation.
Another 2D MaxPool layer with a $(2 \times 2)$ downsampling factor.
A dense layer with a 128-dimensional output and a ReLU activation.
A softmax layer with a 10-dimensional output and a softmax activation, which is the final layer of the network and maps to a distribution over the 10 classes of the MNIST dataset.

Train this network using a categorical cross-entropy loss the following hyperparameters.

Use the default initialization for your ML framework.
Use a train/validation split with 10% of the training data assigned randomly to the validation set.
Run stochastic gradient descent with momentum, using a step size of $\alpha = 0.001$ , a momentum parameter of $\beta = 0.99$ , and a minibatch size of $B = 32$ . (These are mostly just default values for neural network training, and not chosen with much intelligence or tuning.)
Run for 5 epochs. (That is, 5 passes through the training data.)

Before training and after each epoch (i.e. a total of 6 times), measure the following:

The training loss.
The training error.
The validation error.

This setting should achieve validation accuracy and test accuracy of around 99%, and a slightly higher training accuracy. Plot these observed values (on a figure with iterations on the x-axis) and include the resulting figures in your report.

Also measure the amount of wall-clock time it took to train this model, and include this measurement in your report. (For reference, my implementation in TensorFlow V2 took about 2 minutes on my laptop, so if your implementation is substantially slower than this you may have a bug somewhere.)

Alternatively, you may choose to use a ML framework other than TensorFlow for this assignment. If you do, be sure to use a framework that supports automatic differentiation (otherwise, you will have to do much, much more work). Choices here include PyTorch, MXNet, and Flux.

In this part, we will explore how changing the minibatch size hyperparameter $B$ . Suppose that we were to decrease this hyperparameter to $B = 8$ , and then run the same experiment.

Make a hypothesis about how this change will affect the accuracy after training.
Make a hypothesis about how this change will affect the wall-clock time of training.

For each hypothesis, include it in your report, and explain your reasoning.

Now we will evaluate your hypotheses. Run the training process you ran above, now using $B = 8$ . Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.

In this part, we will explore how using momentum affects training. Suppose that we were to use plain SGD without momentum (equivalent to setting $\beta = 0$ ), and then run the same experiment (using the original minibatch size of $B = 32$ .

Make a hypothesis about how this change will affect the accuracy after training.
Make a hypothesis about how this change will affect the wall-clock time of training.

For each hypothesis, include it in your report, and explain your reasoning.

Now we will evaluate your hypotheses. Run the training process you ran above in Part 2, now using $\beta = 0$ . Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.

Part 5: The effect of network size. In this part, we will explore how changing the network size affects training. Suppose that we were to use a network with more channels. Specifically, suppose that we doubled the number of channels in each convolution layer of the network from 32 to 64 and doubled the output dimension of the dense layer from 128 to 256.

Make a hypothesis about how this change will affect the accuracy after training.
Make a hypothesis about how this change will affect the wall-clock time of training. Your hypothesis should include a guess of by what factor you would expect the wall-clock time of training to increase or decrease due to this change.

For each hypothesis, include it in your report, and explain your reasoning.

Now we will evaluate your hypotheses. Run the training process you ran above in Part 2, now using the increased size network (but with the original values of the minibatch size and momentum). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.

Part 6: Testing other optimizers. Above, we trained a network using momentum SGD. There are many other optimizers that are popular in machine learning. One of them is the Adam optimizer, which we will discuss presently in class.

Use the Adam optimizer with step size parameter $\alpha = 0.001$ and decay parameters $\rho_1 = 0.99$ and $\rho_2 = 0.999$ . If you are using TensorFlow, you may find the code tf.keras.optimizers.Adam(0.001, 0.99, 0.999) useful for doing this. Run the training process you ran above in Part 2 (using the original minibatch size of $B = 32$ and the original network architecture). Report your observations, including the same plots and wall-clock time measurement described in Part 2.

Project Deliverables. Submit your code containing all the functions your implemented. Additionally, submit a written lab report containing the following.

A 1-2 paragraph abstract summarizing what you did in the project.
Your plots and wall-clock time observations for Part 2.
Your hypotheses and explanations of the hypotheses for Part 3.
Your plots and wall-clock time observations for Part 3, and text explaining whether your hypotheses were validated or falsified.
Your hypotheses and explanations of the hypotheses for Part 4.
Your plots and wall-clock time observations for Part 4, and text explaining whether your hypotheses were validated or falsified.
Your hypotheses and explanations of the hypotheses for Part 5.
Your plots and wall-clock time observations for Part 5, and text explaining whether your hypotheses were validated or falsified.
Your plots and wall-clock time observations for Part 6.