Project Due: Monday, March 4, 2024 at 11:59pm
Late Policy: Up to two slip days can be used for the final submission.
Please submit all required documents to CMS.
Motivation. This project is designed to familiarize you with using ML frameworks for deep learning. In particular, there are the following learning goals:
At several points in this project description, you will be presented an alternative choice to one or more of the project instructions. The goal of these choices is to give you more options for exploring aspects of ML frameworks that you might not be familiar with. These choices are formatted in violet like this. When making one of these choices, consider your own learning goals, and select the option that will best advance them. Note that my reference solution will use only the main choice given in black text in this assignment description, so it may be somewhat more difficult for us to help you debug if you choose alternately. (But as graduate students taking a 6000-level class, this shouldn't scare you!) If you do make an alternate choice, be sure to indicate that in your project report.
The deliverables for this project will include code and a project report. At certain points in this assignment description, you will be asked to present something in your project report. These choices are formatted in crimson like this. In addition to the crimson-formatted deliverables, your project report should also include a short (1-2 paragraphs) summary of what you did.
Overview. In this project, you will be implementing and evaluating multiple learning methods for training deep neural networks on the MNIST digit recognition dataset using a machine learning framework. The MNIST dataset is perhaps the most popular dataset used with machine learning, especially in small "toy" applications or demonstrations. The goal of the MNIST dataset is to predict what digit is represented by an image of a hand-written digit. It has 60000 training examples, and 10000 test examples, each of which is a \(28 \times 28\) a grayscale image representing one of the 10 digits.
Since MNIST is such a common dataset, most ML frameworks will have built-in capabilities to load it. For example, in PyTorch, you can use
import torch
import torchvision
transform = torchvision.transforms.ToTensor()
mnist_path = 'path/to/mnist/data'
mnist_train = torchvision.datasets.MNIST(root=mnist_path, train=True, transform=transform, download=True)
mnist_test = torchvision.datasets.MNIST(root=mnist_path, train=False, transform=transform)
to load the MNIST training and test datasets and normalize them to have values in the range [0,1]. This code will also automatically download the dataset, if it isn't present.
Part 1: Identifying a Neural Network from the literature. One of the first successful neural networks for MNIST was LeNet-5. By doing a literature search, answer the following questions, and present your findings in your report.
Part 2: Training a network. LeNet-5 is a bit dated, and doesn't actually represent modern deep learning practices. So, instead of training LeNet-5, let's train a simple "toy" convolutional neural network on the MNIST dataset. In PyTorch, implement a feedforward neural network with the following layers:
Also measure the amount of wall-clock time it took to train this model, and include this measurement in your report. (For reference, my implementation in TensorFlow V2 took about 2 minutes on my laptop, and my PyTorch implementation took a similar amount of time, so if your implementation is substantially slower than this you may have a bug somewhere.)
Alternatively, you may choose to use a ML framework other than PyTorch for this assignment. If you do, be sure to use a framework that supports automatic differentiation (otherwise, you will have to do much, much more work). Choices here include TensorFlow, MXNet, and Flux.
Part 3: The effect of changing the minibatch size. In this part, we will explore how changing the minibatch size hyperparameter \(B\). Suppose that we were to decrease this hyperparameter to \(B = 8\), and then run the same experiment.
Now we will evaluate your hypotheses. Run the training process you ran above, now using \(B = 8\). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.
Part 4: The effect of momentum. In this part, we will explore how using momentum affects training. Suppose that we were to use plain SGD without momentum (equivalent to setting \(\beta = 0\)), and then run the same experiment (using the original minibatch size of \(B = 32\).
Now we will evaluate your hypotheses. Run the training process you ran above in Part 2, now using \(\beta = 0\). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.
Part 5: The effect of network size. In this part, we will explore how changing the network size affects training. Suppose that we were to use a network with more channels. Specifically, suppose that we doubled the number of channels in each convolution layer of the network from 32 to 64 and doubled the output dimension of the linear layer from 128 to 256.
Now we will evaluate your hypotheses. Run the training process you ran above in Part 2, now using the increased size network (but with the original values of the minibatch size and momentum). Report your observations, including the same plots and wall-clock time measurement described in Part 2. Were your hypotheses validated or falsified? Explain.
Part 6: Testing other optimizers. Above, we trained a network using momentum SGD. There are many other optimizers that are popular in machine learning. One of them is the Adam optimizer, which we will discuss presently in class.
Use the Adam optimizer with step size parameter \(\alpha = 0.001\) and decay parameters \(\rho_1 = 0.99\) and \(\rho_2 = 0.999\). If you are using PyTorch, you may find the code torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.99, 0.999))
useful for doing this. Run the training process you ran above in Part 2 (using the original minibatch size of \(B = 32\) and the original network architecture).
Report your observations, including the same plots and wall-clock time measurement described in Part 2.
Project Deliverables. Submit your code containing all the functions your implemented. Additionally, submit a written lab report containing the following.