Programming Assignment 4: Training Neural Networks

CS4787 — Principles of Large-Scale Machine Learning — Spring 2021

Project Due: Monday, April 19, 2021 at 11:59pm

Late Policy: Up to two slip days can be used for the final submission.

Please submit all required documents to CMS.

This is a group project. You can either work alone, or work ONLY with the other members of your group, which may contain AT MOST three people in total. Failure to adhere to this rule (by e.g. copying code) may result in an Academic Integrity Violation.

Overview: In this project, you will be learning how to train a deep neural network using a machine learning framework. While you've seen in the homework that differentiating even simple deep neural networks by hand can be tedious, machine learning frameworks that run backpropagation automatically make this easy. This assignment instructs you to use TensorFlow. Although this is not the most popular machine learning framework at the moment (that's PyTorch), the experience should transfer to whatever machine learning frameworks you decide to use in your own projects. (And for those of you who are familiar with PyTorch, this will give you some experience of how TensorFlow does it.) These frameworks and learning methods drive machine learning systems at the largest scales, and the goal of this assignment is to give you some experience working with them so that you can build your intuition for how they work.

This assignment is designed and tested with TensorFlow 2, using Keras (the default front-end for TensorFlow 2). If you run into issues with the code, please check you have the right version of TensorFlow installed.

In this assignment, you are going to explore training a neural network on the MNIST dataset, the same dataset you have been working with so far in this class. MNIST is actually a relatively small dataset to use with Deep Learning, but it's a good first dataset to use to start playing around with these frameworks and learning how they work. While image datasets like MNIST usually use convolutional neural networks (CNNs), here for simplicity we'll mostly look at how a fully connected neural network performs on MNIST, since this is closest to what we've discussed in class.

Please do not wait until the last minute to do this assignment! While I have constructed it so that the programming part will not take so long, actually training the networks can take some time, depending on the machine you run on. It takes my implementation about five minutes to train all the networks (without any hyperparameter optimization).

Instructions: This project is split into three parts: the training and evaluation of a fully connected neural network for MNIST, the exploration of hyperparameter optimization for this network, and the training and evaluation of a convolutional neural network (CNN) which is more well-suited to image classification tasks.

Part 1: Fully Connected Neural Network on MNIST.

Implement a function, train_fully_connected_sgd, that uses TensorFlow and Keras to train a neural network with the following architecture.
- A dense (fully connected) layer with output size \(d_1 = 1024\) and ReLU activation function.
- A second dense (fully connected) layer with output size \(d_2 = 256\) and ReLU activation function.
- As the last layer, a dense (fully connected) layer with output size \(c = 10\) and softmax activation function.
Run your function to this network with a cross entropy loss (hint: you might find the sparse_categorical_crossentropy loss from Keras to be useful here) using stochastic gradient descent (SGD). Use the following hyperparameter settings and instructions:
- Learning rate \(\alpha = 0.1\).
- Minibatch size \( B = 128 \).
- Run for 10 epochs.
- Split the training set into training and validation components, with 10% of the 60000 training examples used for validation. (Hint: one of the arguments to the model.fit method in TensorFlow will cause it to do this automatically.)
Your code should save the following statistics:
- The training loss after each epoch.
- The training accuracy after each epoch.
- The validation loss after each epoch.
- The validation accuracy after each epoch.
- The test loss at the end of training, on the 10000-example MNIST test set.
- The test accuracy at the end of training, on the 10000-example MNIST test set.
- The total wall-clock time used by the training algorithm.
Now modify the function you designed above in Part 1.1 (train_fully_connected_sgd) to support learning with momentum. Train your network with momentum SGD using the following hyperparameter settings and instructions
- Learning rate \(\alpha = 0.1\).
- Momentum \(\beta = 0.9\).
- Minibatch size \( B = 128 \).
- Run for 10 epochs.
- Split the training set into training and validation components, with 10% of the 60000 training examples used for validation. (Hint: one of the arguments to the model.fit method in TensorFlow will cause it to do this automatically.)
You should save the same statistics as listed above.
Now implement a function, train_fully_connected_adam, that trains the same neural network using Adam. Train your network using the following hyperparameter settings and instructions
- Learning rate \(\alpha = 0.001\).
- First moment decay \(\rho_1 = 0.99\).
- Second moment decay \(\rho_2 = 0.999\).
- Minibatch size \( B = 128 \).
- Run for 10 epochs.
- Split the training set into training and validation components, with 10% of the 60000 training examples used for validation. (Hint: one of the arguments to the model.fit method in TensorFlow will cause it to do this automatically.)
You should save the same statistics as listed above.
Finally, let's explore whether batch normalization can help improve our accuracy or convergence speed here. Implement a function, train_fully_connected_bn_sgd, that trains the same neural network using momentum SGD and batch normalization. Add a batch norm layer after each linear layer in the original network. Train this network using the following hyperparameter settings and instructions:
- Learning rate \(\alpha = 0.001\).
- Momentum \(\beta = 0.9\).
- Minibatch size \( B = 128 \).
- Run for 10 epochs.
- Split the training set into training and validation components, with 10% of the 60000 training examples used for validation. (Hint: one of the arguments to the model.fit method in TensorFlow will cause it to do this automatically.)
You should save the same statistics as listed above.
For each of the four training algorithms you ran above, plot the following two figures:
- One figure that displays training loss, validation loss, and final test loss (should be displayed as a horizontal line) versus epoch number.
- A second figure that displays training accuracy, validation accuracy, and final test accuracy (should be displayed as a horizontal line) versus epoch number.
This is a total of eight figures.
Report the wall-clock times used the the algorithm for training. How does the performance of the different algorithms compare? Please explain briefly.

Part 2: Hyperparameter Optimization.

For the SGD with momentum algorithm, use grid search to select the step size parameter from the options \(\alpha \in \{1.0, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001\} \) that maximizes the validation accuracy. Report the values of the validation accuracy and validation loss you observed for each setting of \(\alpha\), and report the \(\alpha\) that the grid search selected. Did the step size found by grid search improve over the step size given in the instructions in Part 1?
Now choose any one of the four algorithms from Part 1, and choose any three hyperparameters you want to explore (e.g. the momentum, the layer width, the number of layers, et cetera). For each hyperparameter, choose a grid of possible values you think will be good to explore. Report your grid, and justify your selection. Then run grid search using the grid you selected. Report the best parameters you found, and the corresponding validation accuracy, validation loss, test accuracy, and test loss.
Now use random search to explore the same space as you did above in Part 2.2. For each hyperparameter you explored, choose a distribution over possible values that covers the same range as the grid you chose for that hyperparameter in Part 2.2. Report your distribution, and justify your selection. Then run random search using the distribution you selected, running at least 10 random trials. Report the best parameters you found, and the corresponding validation accuracy, validation loss, test accuracy, and test loss.
How did the performance of grid search compare to random search? How did the total amount of time spent running grid search compare to the total amount of time spent running random search?

Part 3: Convolutional Neural Networks.

Implement a function, train_CNN_sgd, that uses TensorFlow and Keras to train a convolutional neural network with the following architecture.
- A 2D convolution layer with 32 filters, a kernel size of (5,5), and ReLU activation function.
- A 2D max pooling layer with pool size (2,2). (This downscales the image by a factor of 2.)
- A second 2D convolution layer with 64 filters, a kernel size of (5,5), and ReLU activation function.
- A 2D max pooling layer with pool size (2,2). (This downscales the image by a factor of 2.)
- A dense (fully connected) layer with output size 512 and ReLU activation function.
- As the last layer, a dense (fully connected) layer with output size \(c = 10\) and softmax activation function.
Run your function to this network with a cross entropy loss (as before), using the Adam optimizer and the following hyperparameter settings and instructions:
- Learning rate \(\alpha = 0.001\).
- First moment decay \(\rho_1 = 0.99\).
- Second moment decay \(\rho_2 = 0.999\).
- Minibatch size \( B = 128 \).
- Run for 10 epochs.
- Split the training set into training and validation components, with 10% of the 60000 training examples used for validation. (Hint: one of the arguments to the model.fit method in TensorFlow will cause it to do this automatically.)
You should save the same statistics as listed above.
Plot the following two figures:
- One figure that displays training loss, validation loss, and final test loss (should be displayed as a horizontal line) versus epoch number.
- A second figure that displays training accuracy, validation accuracy, and final test accuracy (should be displayed as a horizontal line) versus epoch number.
This is a total of two figures.
Report the wall-clock times used the the algorithm for training. How does the performance compare to the performance of the fully connected network you studied in Part 1?

Hints! You may find the following functions useful for doing this assignment.

tf.keras.models.Sequential
Layers found in the tf.keras.layers module
Optimization algorithms found in the tf.keras.optimizers module
The Keras model's compile and fit methods

What to submit:

An implementation of the functions in main.py.
A lab report containing:
- A one-paragraph summary of what you observed during this programming assignment.
- Plots of the training/validation/test loss/accuracy of the various training algorithms, as described in Parts 1 and 3.
- Wall clock times for each algorithm used for training.
- A paragraph explanation of how the performance of the different algorithms in Part 1 and Part 3 compares.
- A table that lists the values of the validation accuracy/loss you observed in Part 2.1, the step size that the grid search selected, and a comparison to the step size given in the Part 1 instructions.
- A description of the grid you used in Part 2.2, and a justification for why you used that grid.
- The best hyperparameter values you found in your grid search in Part 2.2, and the corresponding validation/test loss/accuracy.
- A description of the distribution you used in Part 2.3, and a justification for why you used that distribution.
- The best hyperparameter values you found in your random search in Part 2.3, and the corresponding validation/test loss/accuracy.
- A short paragraph comparison of the performance of random search and grid search, as described in Part 2.4.

Setup:

Run pip3 install -r requirements.txt to install the required python packages