TABLE OF CONTENTS

1 INTRODUCTION
2 GETTING STARTED
3 MODEL LIBRARY AND FORMAT
4 OPTIONS EXPLAINED
5 REFERENCES
6 DOCUMENT HISTORY

-----------------------------------------------------------------------
INTRODUCTION

Shotgun is a Java implementation of ensemble selection [1].  The code
base includes many more options and functionality modes than the ones
described in [1].  These exist largely as a record of past experiments
and are not needed in most circumstances.

-----------------------------------------------------------------------
GETTING STARTED

This section assumes you have a model library saved in folder
'library/', and this library contains two test sets: test1 and test2.
We're going to use test1 as the hillclimbing set (HC set) and test2
for evaluating the performance of the final ensemble.  (As an aside,
this set goes by different names; the most common alternatives are
validation set and training set.  The latter refers to the fact that
the HC set is used to train the ensemble.  "Training set" in this
sense is frequently spotted in the actual source code, despite my
dislike for this usage.)

First, compile the code into a jarfile by running 'make jarfile'.  You
should now see a file with a name like "shotgun.XX.Y.jar".

To see all command usage for shotgun:

% java -jar shotgun.XX.Y.jar

To run shotgun without bagging (runs in a minute or 2), and optimizing
accuracy:

% java -jar shotgun.XX.Y.jar -sfr 200 -acc library test1

The "-sfr 200" flag warrants explanation.  This tells shotgun to use
forward selection with replacment ("fr") and to do a sort
intialization of the ensemble before hillclimbing (the "s").
Hillclimbing will take 200 steps.

If this works you will see files called perf.test1.1 and perf.test2.1
in your current directory that contain the results of the
hillclimbing.  The files contain performance measured by multiple
metrics, not just the one used for hillclimbing.  For example,

1 ACC 0.858 dt.covtype.shotgun.cv.dt.fold1.isobsted1024.progbayes
1 RMS 0.3227090101285506 dt.covtype.shotgun.cv.dt.fold1.isobsted1024.progbayes
1 ROC 0.9191309713669884 dt.covtype.shotgun.cv.dt.fold1.isobsted1024.progbayes
1 ALL 0.8099592472086845 dt.covtype.shotgun.cv.dt.fold1.isobsted1024.progbayes
...

The first column is the step index.  In the example, the first model
added to the ensemble is a boosted decision tree.  The second and
third columns show the metric and its performance at the current
ensemble size.  The last column is the name of the model.  Use grep to
look at one metric at a time.

To run shotgun with the options described in the ensemble selection
paper (basically adding bagging):

% java -jar shotgun.XX.Y.jar -sfr 200 -acc -bag 20 0.5 1 library test1

This has model bagging turned on; each of the 20 bags uses 50% of the
models.  Using bagging with 20 bags will cause shotgun to run about
20x longer than without.  With bagging, perf files are produced for
each bag, as well as a final summary file that shows the performance
after combining bags.

One thing to note is that if you are not using bagging, shotgun does
not pick a hillclimbing step as "best".  You will need to pick the
stopping point and look at the test2 performance at that point.
Typically this is done by finding the best performance for your metric
in the HC set (test1).  The script "sift.pl" in scripts/ automates
this task.  See its command usage for more information.

-----------------------------------------------------------------------
MODEL LIBRARY AND FORMAT

Shotgun expects the model library to have a particular format.  There
should be a directory dedicated to holding the models; subdirectories
correspond to different data splits.  We'll refer to different data
splits as different test sets.  The subfolder names are usually train,
test1, test2, ... testN.  Do not place other subfolders in the library
directory; shotgun will try to treat all subdirectories as test sets.

Inside each test folder there are a series of model files.  For
maximum flexibility, a model is represented by the predictions it
makes on points in the test set.  There is one prediction per line.
For example, if test set 1 contains 1000 points, then all model files
in test1/ contain 1000 lines, each containing a single number between
0 and 1.

Shotgun treats a prediction number as the "probability" of that point
being a positive class.

There are 3 requirements regarding naming of model files.  First, the
file extension for a model file should match its test set.  So, all
models under test1/ should end with .test1.  Second, a model present
in any test folder must be present in all test folders (modulo the
file extension).  Third, there must be a targets file for each test
set.  This file contains the truth labels for the points (either a 0
or 1 on each line).  So for test2, targets.test2 must exist.

-----------------------------------------------------------------------
OPTIONS EXPLAINED

The command line usage for many options in shotgun is pretty cryptic.
This section briefly explains the options which need more explanation.
Many of these are not used during normal use.

-d   float               -> weight decay

[I honestly don't know.  Alex?]

-bsp numbsp numpts seed  -> bootstrappping

Make bootstrap samples of the points and measure the average
performance of the model / ensemble over the different bootstraps.
Very expensive.  The arguments are number of bootstraps, number of
points in each bootstrap, and a seed for a random number generator.

-stepwise                -> resample models at every hillclimbing step.
                            Only meaningful if -bag option is used.

Instead of choosing a subset of models to use once per bag, choose a
different subset at each hillclimbing step.  We're not sure yet how
this compares to bagging without -stepwise.

-ir                      -> incrementally report perf stats after each bag
                            is added.  useful for plots accross # bags used.
                            Only meaningful if -bag option is used.

Good for generating learning curves as a function of number of bags.

-prune                   -> prune model library before hillclimbing

Experimentally seems to reliably yield a small improvement, but
expensive (especially with large libraries and/or expensive metrics).
Currently the amount to models to keep is determined by running
ensemble selection with different percentages of the library (top 5%
models, top 10% models, ...) and then using a heuristic to pick a
percentage that looks promising based on the hillclimbing set
performance.  Need a better/cheaper way to find the prune amount.

-t   [float|data]     -> set threshold to float or determine from training data

Many metrics use a threshold on prediction values.  For example,
accuracy by default treats predictions > 0.5 as positive class.  For
data sets with skewed class distributions this may not be optimal.
This option allows the user to set the threshold; alternatively, '-t
data' will set the threshold to 1-%positive.  See also -autothresh.

-l                       -> output performance in terms of loss

Normalize metric scores so that 0 is always best.

-cv folds bags inc seed  -> do cross validation

Shotgun automatically chooses the best percentage of models to use in
model bagging by doing cross-validation on the hillclimb set.  The
arguments here are the number of folds to use, the number of bags to
average for each fold, the increment amount for testing bag
percentages (e.g. 0.1), and a random seed.

-autothresh              -> set threshold automatically

For each model (or ensemble), set the threshold automatically to yield
best performance on the hillclimbing set with respect to the target
metric.  When deciding which model to add to the ensemble, a separate
threshold is set for each potential addition.  This can be a little
expensive.  If the target metric does not use a threshold, this option
is ignored.

-----------------------------------------------------------------------
REFERENCES

[1] Rich Caruana, Alex Niculescu, Geoff Crew, and Alex Ksikes,
"Ensemble Selection from Libraries of Models," The International
Conference on Machine Learning (ICML'04), 2004.

-----------------------------------------------------------------------
VERSION HISTORY
2006-03-08	Art Munson wrote first draft of README.