CS578 Fall 2001
Empirical Methods in Machine Learning and Data Mining 
Homework Assignment #3
Due: Thursday, November 15, 2001

The goal of this assignment is to experiment with artificial neural
nets trained with backpropagation, early stopping, and N-fold cross
validation.  For this assignment use the same data set used in HW2.
The data is still available from the web page if you want a clean 
copy.  The goal is to predict the same boolean variable (col 1) from 
the other 143 inputs (cols 2-144).  

Because it is difficult to achieve accuracy better than baseline on
this problem, you will use ROC Area as a performance measure.  C-code
for calculating ROC Area is available on the course web page.

You may implement backprop yourself, or use a commercial/public domain
implementation.  Note that it probably will take more time to install,
learn to use, and possibly modify someone else's implementation than
to program a simple backprop yourself, so we encourage you to code
backprop yourself.  In fact, implementing bp counts as extra credit.
If you implement bp, you might want to start with your knn code
because this code already knows how to read/store the data and how to
compute accuracy and RMSE.

If you decide to use someone else's package, it is up to you to make
sure that it will support the experiments needed for this assignment.
One public domain package you might want to consider that runs on a 
variety of platforms is SNNS: Stuttgart Neural Network Simulator.
(Note: We have not used this package before.)

EXPERIMENTS:

 0: Scale each attribute so that the min value of the attribute is
    is 0 and the max value is 1: new_val = (val-min)/(max-min).
    Your code from HW2 might help you here.  You should do this
    for the output as well so you can use sigmoid output units.

    Draw one random sample of 5,000 points from the data set to 
    use for the experiments.  Save the remaining 15,000+ points 
    as a final, final test set.

 1: For neural nets, you need train sets (backprop sets), early
    stopping sets (technically part of the train set), and test 
    sets.  Use N-fold CV from the 5,000 cases for the train/test 
    sets.  The early stopping set should be held out of the train
    set.  One way to do this is to split the data into 5 1k folds.
    Do backprop on folds 1-3 (75% of the train data), use fold
    4 for early stopping (25% of the train data), and test on
    fold 5.  Repeat this process 5 times for 5-fold CV. There 
    are other ways to do this and you don't have to use 5-fold CV.  
    Carefully explain how you choose to do N-fold CV.  A diagram 
    or table showing the splits might be helpful.

 2: Train fully-connected feedforward neural nets using vanilla
    backpropagation with momentum.  Every backprop implementation
    defines learning rate and momentum somewhat differently, and
    the definitions often vary when using batch mode (updating
    once per epoch -- full pass through the training set) or when
    updating per pattern, so you'll have to experiment with the 
    parameter  settings to find values that work well with your code.
    You can use batch mode, per pattern, or per group of pattern 
    updating.  (HINT: If the nets are fully trained after less than
    50-100 passes through the train set you're probably training too
    fast.  If the nets are taking more than 10^6 passes through the
    train set you're probably training slower than necessary.)  Feel
    free to use any number of hidden units that seems to work.  I
    often use 16 or 32 hidden units when starting a new problem.

 3: Compute accuracy, RMSE, and ROC Area on the train, stopping,
    and test sets. Show graphs of performance vs number of epochs
    for the train and early stopping sets.  The performances on the
    test sets should be reported at the early stopping point.  Are the
    early stopping points for accuracy, RMSE, and ROC Area the same?

EXTRA CREDIT -- do one or more of the following:

 - experiment with different numbers of hidden units.  You might try
   1,2,4,8,16,32,64,... or even 1,4,16,64,...  What size net yields
   best generalization performance?
 - N-fold CV leaves you with N or more trained neural nets. compare
   the average prediction of these nets with the performance of each
   of the nets alone to see which works better.  use the final test
   set of 15k+ cases as the test set for this comparison.
 - do a study of the effect of altering the learning rate and 
   momentum on the generaization performance of the nets
 - try nets with two or more hidden layers.  do they perform better
   than nets with one hidden layer?  are they harder to train?
 - is there a difference in generalization performance and training
   time between batch mode updates and per pattern or per group of
   pattern updating?
 - compare weight decay with early stopping.  does one perform better
   than the other?  is one easier to use than the other?
 - use cross-entropy to train the net.  does training on cross-entropy
   yield better accuracy than training in squared error?
 - do feature selection to find a subset of the features that seems
   to perform better than using all the features
 - do a sensitivity analysis to figure out what inputs the trained
   nets use most.  sensitivity analysis can be done by looking at
   derivatives of the output of the net with respect to the inputs,
   or by experimenting with injecting noise into the inputs one at a
   time
 - recode the inputs so that they are symmetric about 0 (i.e., go
   negative and positive).  does this change performance?
 - take variable type into account when coding inputs to the nets.
   does it help performance?
 - implement vanilla backprop with momentum for fully-connected 
   feedforward neural nets containing one hidden layer and trained
   with squared error
 
Hand in a brief summary of the results with enough documentation so
that we can see what you did and how you did it.  Do not write a
paper.  This is homework, not a class project.  You'll probably want
to use the neural net code for the class project, so effort spent now
to write good code or become familiar with the package you use should
pay off later.

Have fun.