CS578 Fall 2001
Empirical Methods in Machine Learning and Data Mining 
Homework Assignment #1
Due: Thursday September 28, 2001

The goal of this assignment is to use the IND decision tree package to
do a few simple experiments with decision trees.  This will require
installing IND on a Unix platform, running experiments with the glass
and thyroid data sets that come with the package, and experimenting
with a third data set from the UC Irvine Machine Learning Repository
that you will select.

NOTE: Since you might have trouble installing the software, it is
important to do Steps 1-5 ASAP to verify that the code works for you.

STEP 1: Download IND to a Unix machine from the link on the course web
page: www.cs.cornell.edu/courses/cs578/2001fa/578home.html.

STEP 2: Execute the following to untar the file: "tar -xvf hw1.tar"
This will create a directory called cs578.hw1.

STEP 3: Cd into the directory and read the README files.  This text is
README.hw1.  After this, read README.install and README.setenv.

STEP 4: Run the script install.script by executing "./install.script".

The installation is not bulletproof.  We've tested it on several Unix
environments, and modified the code and makefiles to minimize
problems, but it may still fail to compile in some environments.

Note that it is normal to get warnings during compilations and while
using the code.  These warnings do not mean the software is broken.
They're because we are using unmodified code from the 80's and early
90's on modern Unix environments.

If you get errors and it fails to compile, please do your best to
debug and fix simple problems yourself.  If you can't get it working,
contact Alex Niculescu or Rich Caruana after class or during office
hours.

If things went well, you have installed IND and a set of simple unix
utilities collectively called unixstat.  Be sure to set your path
variables each time you create a new session.
STEP 5: Execute "inddemo glass 100 c4 1 | more" in the subdirectoy
ind/IND/Data/glass.  You should get results that look like:

[caruana@4157caruana glass]$ inddemo glass 100 c4 1 | more
tgen -e -ir -Pnull -s1,2 -Sfull glass.attr glass.bld glass.treec
tprune -fn glass.attr glass.treec
tclass -e -sl glass.attr glass.tree glass.tst
Percentage accuracy for tree 1 = 59.6491 +/- 4.5949
Mean square error for tree 1 = 0.780598
Expected accuracy for tree 1 = 97.0858
Leaf count for tree 1 = 14, expected = 14.000000

+28+39+4+0+9+6+14 2
Ba < 0.335: +28+39+4+0+8+6+0 2
|   K < 0.01: +1+1+0+0+0+6+0 6
|   K >= 0.01: +27+38+4+0+8+0+0 2
|   |   Mg < 1.985: +0+3+0+0+8+0+0 5
|   |   |   Al < 1.38: +0+2+0+0+0+0+0 2
|   |   |   Al >= 1.38: +0+1+0+0+8+0+0 5
|   |   Mg >= 1.985: +27+35+4+0+0+0+0 2
|   |   |   Al < 1.375: +27+12+1+0+0+0+0 1
|   |   |   |   Mg < 3.785: +27+7+1+0+0+0+0 1
|   |   |   |   |   Si < 71.36: +0+2+0+0+0+0+0 2
|   |   |   |   |   Si >= 71.36: +27+5+1+0+0+0+0 1
|   |   |   |   |   |   Mg < 3.605: +17+0+0+0+0+0+0 1
|   |   |   |   |   |   Mg >= 3.605: +10+5+1+0+0+0+0 1
|   |   |   |   |   |   |   Na < 13.035: +0+3+0+0+0+0+0 2
|   |   |   |   |   |   |   Na >= 13.035: +10+2+1+0+0+0+0 1
|   |   |   |   |   |   |   |   Si < 72.695: +8+0+0+0+0+0+0 1
|   |   |   |   |   |   |   |   Si >= 72.695: +2+2+1+0+0+0+0 1
|   |   |   |   |   |   |   |   |   RI < 1.5176: +2+0+0+0+0+0+0 1
|   |   |   |   |   |   |   |   |   RI >= 1.5176: +0+2+1+0+0+0+0 2
|   |   |   |   Mg >= 3.785: +0+5+0+0+0+0+0 2
|   |   |   Al >= 1.375: +0+23+3+0+0+0+0 2
|   |   |   |   Ca < 8.415: +0+20+0+0+0+0+0 2
|   |   |   |   Ca >= 8.415: +0+3+3+0+0+0+0 2
|   |   |   |   |   Mg < 3.105: +0+3+0+0+0+0+0 2
|   |   |   |   |   Mg >= 3.105: +0+0+3+0+0+0+0 3
Ba >= 0.335: +0+0+0+0+1+0+14 7

              3  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
              2  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000
              2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
              1  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
              1  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
              2  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
              7  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
              1  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
...

STEP 6: The script inddemo is in cs578/bin.  Read it to get a flavor
of how the IND software is run.  The important steps are:

 - the lines that use perm and linex and your seed to randomly select
   the number of training cases you specified from the glass.dta file
   and put them in a file called glass.bld, and then put the remaining
   cases in the file glass.tst

 - the line that calls mktree to grow a tree of the type you specified
   using the descritpion of the attributes in the glass.attr file and
   the training cases in glass.bld

 - the line that calls tprint to show the tree and the counts at each
   node in the tree

 - the line that calls tclass on the test cases in glass.tst and then
   massages the output so you can see the predictions side-by-side
   with the true target values taken from glass.tst with the colex
   command

If you set up the MANPATH variable, you can learn more about mktree,
tprint, tclass, colex, and linex by executing "man mktree" or "man
colex".  Other programs you might be interested in are tgen, tprune,
and perm.

STEP 7: Use IND to grow decision trees of types cart, c4, id3, smml,
and mml on different sized training samples using the glass and
thyroid data sets.  Run multiple experiments at each size using
different random seeds to get the average performance for each size.
Report on the accuracy of the tree on both the train and test sets,
the RMSE (root mean squared error) on both sets, and the tree size.
Calculate both means and variances.  Graphs would be a good way to
present some of the results.  Comment on or explain the results.  You
may want to write small scripts or programs to run the experiments and
process the results.  It's easier in the long run than doing it
manually!  Include this code in what you hand in.

STEP 8: Use a search engine to find the UC Irvine Machine Learning
Repository.  Skim the data sets that are available, and pick one.
Copy the data set, create a stem.attr file for it, and run experiments
similar to those in Step 7 with it.  For variety, you might want to
select a data set which is large, or one which has attribute types
different from the ones in the glass and thyroid data sets.  Beware of
data sets with missing values.  (See EXTRA CREDIT below.)

STEP 9: Which tree type(s) are more accurate, larger, more
intelligible?  Are trees with better RMSE always more accurate?

EXTRA CREDIT: do one of the following experiments.  All of these can
be accomplished using options in tgen and do not require modifying the
code. See the man pages (e.g., "man tgen") for a description of the
various options.

 - experiment with lookahead
 - use a UCI data set with missing values and use tgen -U option
 - manually control the depth of the decision with tgen -d and
   see how this affects accuracy for different size train sets
 - pick something else that looks fun

Hand in a brief summary of the results with enough supporting
documentation so that we can see what you did and how you did it.  Do
not write a paper or anything like that.  This is homework, not a
class project.  Our goal is to get you to experiment with decision
trees, not give you a writing assignment!  You'll be using IND later
in the class project, so effort spent now to become familiar with IND
should pay off later.  You are allowed to get help from other students
installing the software and learning how to use it.  But you should
run the experiments and write any supporting code yourself.

There is a tutorial for IND in ind/IND/Doc in the files that look like
ind0-15.ps.  You don't need to read this, but it is there if you want
it.

Have fun.