CS578 Fall 2001 Empirical Methods in Machine Learning and Data Mining Homework Assignment #1 Due: Thursday September 28, 2001 The goal of this assignment is to use the IND decision tree package to do a few simple experiments with decision trees. This will require installing IND on a Unix platform, running experiments with the glass and thyroid data sets that come with the package, and experimenting with a third data set from the UC Irvine Machine Learning Repository that you will select. NOTE: Since you might have trouble installing the software, it is important to do Steps 1-5 ASAP to verify that the code works for you. STEP 1: Download IND to a Unix machine from the link on the course web page: www.cs.cornell.edu/courses/cs578/2001fa/578home.html. STEP 2: Execute the following to untar the file: "tar -xvf hw1.tar" This will create a directory called cs578.hw1. STEP 3: Cd into the directory and read the README files. This text is README.hw1. After this, read README.install and README.setenv. STEP 4: Run the script install.script by executing "./install.script". The installation is not bulletproof. We've tested it on several Unix environments, and modified the code and makefiles to minimize problems, but it may still fail to compile in some environments. Note that it is normal to get warnings during compilations and while using the code. These warnings do not mean the software is broken. They're because we are using unmodified code from the 80's and early 90's on modern Unix environments. If you get errors and it fails to compile, please do your best to debug and fix simple problems yourself. If you can't get it working, contact Alex Niculescu or Rich Caruana after class or during office hours. If things went well, you have installed IND and a set of simple unix utilities collectively called unixstat. Be sure to set your path variables each time you create a new session. STEP 5: Execute "inddemo glass 100 c4 1 | more" in the subdirectoy ind/IND/Data/glass. You should get results that look like: [caruana@4157caruana glass]$ inddemo glass 100 c4 1 | more tgen -e -ir -Pnull -s1,2 -Sfull glass.attr glass.bld glass.treec tprune -fn glass.attr glass.treec tclass -e -sl glass.attr glass.tree glass.tst Percentage accuracy for tree 1 = 59.6491 +/- 4.5949 Mean square error for tree 1 = 0.780598 Expected accuracy for tree 1 = 97.0858 Leaf count for tree 1 = 14, expected = 14.000000 +28+39+4+0+9+6+14 2 Ba < 0.335: +28+39+4+0+8+6+0 2 | K < 0.01: +1+1+0+0+0+6+0 6 | K >= 0.01: +27+38+4+0+8+0+0 2 | | Mg < 1.985: +0+3+0+0+8+0+0 5 | | | Al < 1.38: +0+2+0+0+0+0+0 2 | | | Al >= 1.38: +0+1+0+0+8+0+0 5 | | Mg >= 1.985: +27+35+4+0+0+0+0 2 | | | Al < 1.375: +27+12+1+0+0+0+0 1 | | | | Mg < 3.785: +27+7+1+0+0+0+0 1 | | | | | Si < 71.36: +0+2+0+0+0+0+0 2 | | | | | Si >= 71.36: +27+5+1+0+0+0+0 1 | | | | | | Mg < 3.605: +17+0+0+0+0+0+0 1 | | | | | | Mg >= 3.605: +10+5+1+0+0+0+0 1 | | | | | | | Na < 13.035: +0+3+0+0+0+0+0 2 | | | | | | | Na >= 13.035: +10+2+1+0+0+0+0 1 | | | | | | | | Si < 72.695: +8+0+0+0+0+0+0 1 | | | | | | | | Si >= 72.695: +2+2+1+0+0+0+0 1 | | | | | | | | | RI < 1.5176: +2+0+0+0+0+0+0 1 | | | | | | | | | RI >= 1.5176: +0+2+1+0+0+0+0 2 | | | | Mg >= 3.785: +0+5+0+0+0+0+0 2 | | | Al >= 1.375: +0+23+3+0+0+0+0 2 | | | | Ca < 8.415: +0+20+0+0+0+0+0 2 | | | | Ca >= 8.415: +0+3+3+0+0+0+0 2 | | | | | Mg < 3.105: +0+3+0+0+0+0+0 2 | | | | | Mg >= 3.105: +0+0+3+0+0+0+0 3 Ba >= 0.335: +0+0+0+0+1+0+14 7 3 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 2 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 2 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 2 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 7 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ... STEP 6: The script inddemo is in cs578/bin. Read it to get a flavor of how the IND software is run. The important steps are: - the lines that use perm and linex and your seed to randomly select the number of training cases you specified from the glass.dta file and put them in a file called glass.bld, and then put the remaining cases in the file glass.tst - the line that calls mktree to grow a tree of the type you specified using the descritpion of the attributes in the glass.attr file and the training cases in glass.bld - the line that calls tprint to show the tree and the counts at each node in the tree - the line that calls tclass on the test cases in glass.tst and then massages the output so you can see the predictions side-by-side with the true target values taken from glass.tst with the colex command If you set up the MANPATH variable, you can learn more about mktree, tprint, tclass, colex, and linex by executing "man mktree" or "man colex". Other programs you might be interested in are tgen, tprune, and perm. STEP 7: Use IND to grow decision trees of types cart, c4, id3, smml, and mml on different sized training samples using the glass and thyroid data sets. Run multiple experiments at each size using different random seeds to get the average performance for each size. Report on the accuracy of the tree on both the train and test sets, the RMSE (root mean squared error) on both sets, and the tree size. Calculate both means and variances. Graphs would be a good way to present some of the results. Comment on or explain the results. You may want to write small scripts or programs to run the experiments and process the results. It's easier in the long run than doing it manually! Include this code in what you hand in. STEP 8: Use a search engine to find the UC Irvine Machine Learning Repository. Skim the data sets that are available, and pick one. Copy the data set, create a stem.attr file for it, and run experiments similar to those in Step 7 with it. For variety, you might want to select a data set which is large, or one which has attribute types different from the ones in the glass and thyroid data sets. Beware of data sets with missing values. (See EXTRA CREDIT below.) STEP 9: Which tree type(s) are more accurate, larger, more intelligible? Are trees with better RMSE always more accurate? EXTRA CREDIT: do one of the following experiments. All of these can be accomplished using options in tgen and do not require modifying the code. See the man pages (e.g., "man tgen") for a description of the various options. - experiment with lookahead - use a UCI data set with missing values and use tgen -U option - manually control the depth of the decision with tgen -d and see how this affects accuracy for different size train sets - pick something else that looks fun Hand in a brief summary of the results with enough supporting documentation so that we can see what you did and how you did it. Do not write a paper or anything like that. This is homework, not a class project. Our goal is to get you to experiment with decision trees, not give you a writing assignment! You'll be using IND later in the class project, so effort spent now to become familiar with IND should pay off later. You are allowed to get help from other students installing the software and learning how to use it. But you should run the experiments and write any supporting code yourself. There is a tutorial for IND in ind/IND/Doc in the files that look like ind0-15.ps. You don't need to read this, but it is there if you want it. Have fun.