Project 7: CART

Boosting took a long time to be truly understood.
... cynics say we didn't see the forest for all the trees ...

Files you'll edit:
id3tree.m Returns a decision tree using the minimum entropy splitting rule (implemented for you in entropysplit.m)
evaltree.m Evaluates a decision tree on a test data.
prunetree.m Prunes a tree using a validation set.
forest.m Builds a forest of id3trees.
evalforest.m Learns and evaluates a forest.
boosttree.m Applies adaboost to your id3tree.
evalboost.m Learns and evaluates a boosted classifier
Files you might want to look at:
entropysplit.m This function implements minimum entropy splitting using information gain/entropy.
visualhw7.m This function visualizes trees for you.
randsample.m This function takes a random sample.

Introduction

In this assignment you will implement a decision tree algorithm and then use it for bagging and boosting. Please first update your svn directory by calling svn update in the svntop directory. As Octave does not support pointers and is really only good with matrices, we will represent a tree through a matrix T. A tree T with q nodes must be of dimensions 6 $\times$ q, where each column represents a different node in the tree. The ith row represents the following information (the very first column is the root node):

  1. prediction at this node
  2. index of feature to cut
  3. cutoff value c (<=c : left, and >c : right)
  4. index of left subtree (0 = leaf)
  5. index of right subtree (0 = leaf)
  6. parent (0 = root)

Implementing CART

We have implemented a function entropysplit.m which searches through all features and returns the best (feature, cut-threshold) combination based on information gain (in this case entropy) impurity. You don't need to implement this, it is already done in the file entropysplit.m Please take a minute and familiarize yourself with the implementation. It will make subsequent tasks easier.

Your tasks:
  1. Implement the function id3tree.m which returns a decision tree based on the minimum entropy splitting rule. The function takes training data, test data, a maximum depth, and the weigh of each training example (used in entropy splitting). You can visualize your tree with the command visualhw7. Maximum depth and weight are optional arguments. If they are not provided you should make maximum depth infinity and equally weight each example. (Hint: To speed up your code make sure you initialize the matrix T with the command T = zeros(6, q) with some estimate of q.) You should use the function entropysplit.m

  2. Implement the function evaltree.m, which evaluates a decision tree on a given test data set. You can test the accuracy of your implementation with hw7tictoc.

  3. Decision trees (without depth limit) are high variance classifiers. Consequently, overfitting is a serious problem. One cure around it is to prune your tree to stop making "silly" splits that overfit to the training set. You prune a tree bottom up, by removing leaves that do not reduce the validation error. Implement the function prunetree.m, which returns a decision tree pruned for a validation data set. The accuracy on the validation set should be the same or higher after pruning.

  4. A more effective way to prevent overfitting is to use Bagging. Implement the function forest.m, which builds a forest of id3 decision trees. Each tree should be built using training data drawn by randomly sampling n examples from the training data with replacement, you can use the provided randsample function to do this. (Do not randomly sample features.)

  5. Implement the function evalforest.m, which evaluates a forest on a test data set. Remember that random forests make predictions by popular vote.

  6. Another option to improve your decision trees is to build trees of small depth (e.g. only depth=3 or depth=4). These do not have high variance, but instead suffer from high bias. You can reduce the bias of a classifier with boosting. Implement the function boosttree.m, which applies Adaboost on your id3tree functions.

  7. Implement the function evalboost.m, which evaluates an ensemble of boosted decision trees.