Data sets for covtype converted to a binary classification problem
(class 1 vs classes 2-7).  There are 54 inputs and one binary output.
All inputs have been scaled to have zero mean and standard deviation
of one, including the boolean variables.  If converting the booleans
this way causes problems, let me know and I'll try to make a matching
set with the booleans unconverted.  The binary output is in the last
column (col 55).

The script that made the data sets is included in the directory if you
have any questions about how the data was created.

Train sets contain 100, 200, 500, 1000, 2000, 5000, 10000, 20000,
50000, 100000 cases.  I didn't make train sets with 200k cases since
things were starting to get big.

Foreach train set size there are ten files, numbered 1-10.  This let's
us run 10 trials at each size.  Was going to make 20, but I wanted to
keep the total size under 100meg.  10 should be enough.  What we need
is a program that creates each train and test set on the fly, using
the same seeds on all architectures.  Then we wouldn't have to pass
around these big files because we could generate the data sets on the
fly.  But this low-tech approach will do for now.  Most of our data
sets won't be this large.

Data sets are named:

data.test1.10000.gz
data.test1.100000.gz

data.train.100.1.gz
data.train.100.2.gz
data.train.100.3.gz
...
data.train.100.10.gz
data.train.200.1.gz
...
data.train.200.9.gz
data.train.200.10.gz
data.train.500.1.gz
...

The first two are the test sets.  I made one with only 10,000 cases
because some programs might have trouble with the 100,000 test set.
But if you can use 100k test cases, that's better.

All the files are compressed.  If you have lots of disk space, just
gunzip all of them ("gunzip *.gz").  If you don't have that much space
you probably want to uncompress them as you need them and then
recompress when you're done with them.

The program roc.auto.c is included in the directory.  This will
compute performance measures such as accuracy, rms, and roc area.
Instructions for compiling and using code are at the top of the c
file.  We'll also go over this in class.

Don't worry about getting everything perfect the first time.  It's
more important to get started and get some things working.  As you try
to do the experiments, keep track of issues, questions, and problems
and bring them to class.

-Rich.
