Project Due: October 28, 2019 at 11:59pm
Late Policy: Up to two slip days can be used for the final submission.
Please submit all required documents to CMS.
Motivation. This project is designed to familiarize you with using ML frameworks for deep learning. In particular, there are the following learning goals:
At several points in this project description, you will be presented an alternative choice to one or more of the project instructions. The goal of these choices is to give you more options for exploring aspects of ML frameworks that you might not be familiar with. These choices are formatted in green like this. When making one of these choices, consider your own learning goals, and select the option that will best advance them. Note that my reference solution will use only the main choice given in black text in this assignment description, so it may be somewhat more difficult for us to help you debug if you choose alternately. (But as graduate students taking a 6000-level class, this shouldn't scare you!) If you do make an alternate choice, be sure to indicate that in your project report.
The deliverables for this project will include code and a project report. At certain points in this assignment description, you will be asked to present something in your project report. These choices are formatted in crimson like this. In addition to the crimson-formatted deliverables, your project report should also include a short (1-2 paragraphs) summary of what you did.
Overview. In this project, you will be implementing and evaluating multiple learning methods for training deep neural networks on the MNIST digit recognition dataset using a machine learning framework. The MNIST dataset is perhaps the most popular dataset used with machine learning, especially in small "toy" applications or demonstrations. The goal of the MNIST dataset is to predict what digit is represented by an image of a hand-written digit. It has 60000 training examples, and 10000 test examples, each of which is a \(28 \times 28\) a grayscale image representing one of the 10 digits.
Since MNIST is such a common dataset, most ML frameworks will have built-in capabilities to load it. For example, in TensorFlow V2, you can use
to load the MNIST training and test datasets.
import tensorflow as tf mnist = tf.keras.datasets.mnist (Xs_tr, Ys_tr), (Xs_te, Ys_te) = mnist.load_data()
Part 1: Identifying a Neural Network from the literature. One of the first successful neural networks for MNIST was LeNet-5. By doing a literature search, answer the following questions, and present your findings in your report.
Part 2: Training a network. LeNet-5 is a bit dated, and doesn't actually represent modern deep learning practices. So, instead of training LeNet-5, let's train a simple "toy" convolutional neural network on the MNIST dataset. In TensorFlow, implement a feedforward neural network with the following layers:
Also measure the amount of wall-clock time it took to train this model, and include this measurement in your report. (For reference, my implementation in TensorFlow V2 took about 2 minutes on my laptop, so if your implementation is substantially slower than this you may have a bug somewhere.)
Alternatively, you may choose to use a ML framework other than TensorFlow for this assignment. If you do, be sure to use a framework that supports automatic differentiation (otherwise, you will have to do much, much more work). Choices here include PyTorch, MXNet, and Flux.
Part 3: The effect of changing the minibatch size. In this part, we will explore how changing the minibatch size hyperparameter \(B\). Suppose that we were to decrease this hyperparameter to \(B = 8\), and then run the same experiment.
Now we will evaluate your hypotheses. Run the training process you ran above, now using \(B = 8\). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.
Part 4: The effect of momentum. In this part, we will explore how using momentum affects training. Suppose that we were to use plain SGD without momentum (equivalent to setting \(\beta = 0\)), and then run the same experiment (using the original minibatch size of \(B = 32\).
Now we will evaluate your hypotheses. Run the training process you ran above in Part 2, now using \(\beta = 0\). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.
Part 5: The effect of network size. In this part, we will explore how changing the network size affects training. Suppose that we were to use a network with more channels. Specifically, suppose that we doubled the number of channels in each convolution layer of the network from 32 to 64 and doubled the output dimension of the dense layer from 128 to 256.
Now we will evaluate your hypotheses. Run the training process you ran above in Part 2, now using the increased size network (but with the original values of the minibatch size and momentum). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.
Part 6: Testing other optimizers. Above, we trained a network using momentum SGD. There are many other optimizers that are popular in machine learning. One of them is the Adam optimizer, which we will discuss presently in class.
Use the Adam optimizer with step size parameter \(\alpha = 0.001\) and decay parameters \(\rho_1 = 0.99\) and \(\rho_2 = 0.999\). If you are using TensorFlow, you may find the code
tf.keras.optimizers.Adam(0.001, 0.99, 0.999) useful for doing this. Run the training process you ran above in Part 2 (using the original minibatch size of \(B = 32\) and the original network architecture).
Report your observations, including the same plots and wall-clock time measurement described in Part 2.
Project Deliverables. Submit your code containing all the functions your implemented. Additionally, submit a written lab report containing the following.