DECISION TREES INTERIOR NODES: tests on attributes. # of branches = # of values of attributes = # of splits LEAF NODES: classify objects into either discrete classes or into probabilities of each class DTs are best for classification problems. They are not well suited for the prediction of continuous values (e.g., regression). There are modifications to DTs that make them more suitable for regression problems. The concept class of DTs is disjunctions of conjunctions of attributes. Each path in a DT from the root test to a leaf node is a conjunction of attribute values. Each leaf node is a class. Since there can be multiple leaf nodes for each class, each class potentially is represented as a disjunction of these leaf nodes, i.e., as a disjunction of multiple conjunctive paths: CLASSi = (ATTRa = VALb) /\ (ATTRc = VALd) /\ ... /\ (ATTRe = VALf) \/ (ATTRg = VALh) /\ (ATTRi = VALj) /\ ... /\ (ATTRk = VALl) \/ ... \/ (ATTRm = VALn) /\ (ATTRo = VALp) /\ ... /\ (ATTRq = VALr) ===================================================================== MACHINE LEARNING OF DECISION TREES Can't exhaustively enumerate all DTs and test them to see which is best because the number of DTs is exponential in the number of attributes. Instead, we "grow" DTs using greedy (non-optimal) search methods. TDIDT = Top Down Induction of Decision Trees. Also known as recursive partitioning. Start with the null tree consisting of just a leaf node that contains all training cases. Then try to find the "best" split (attribute to test) to install at this leaf node. "Best" means that the leaf nodes resulting from the split on the attribute have the best class purity compared to all other possible splits. In particular, usually the class purity of the new leaf nodes will be better than the class purity of the parent leaf node that is being replaced with the test. One measure of class purity is Information Gain. InfoGain = decrease in entropy due to the split. Entropy(set of examples S) = SUMMATION [- Pc * LOG2 (Pc)] c in classes See graph of entropy function in Mitchell's text book on p.57. InfoGain = Entropy of Parent - Weighted Entropy of Children |CHILD| = Entropy(P) - SUMMATION ------- * Entropy(Child) children |P| There is a child for each value of the attribute used in the test, so this is a summation over the values of the attribute in the split. If InfoGain = 0, children have the same entropy as the parent, and there is no improvement in class purity due to the split. This is a poor split. If InfoGain = Entropy(Parent), then the children have 0 entropy, and the split has provided maximum purity. This is a great split. InfoGain is a criterion to use to find the best split (attribute test) to install at a leaf node. To grow a full decision tree, start with the null tree containing just a root leaf node. Try all the possible attributes and use InfoGain to find the "best". Install this best attribute at the root. This creates new leaf nodes, one for each value of that attribute. Examine each leaf node one at a time, and use InfoGain to find the best attribute to install at each leaf node. This adds a new level to the tree. Repeat. Decision tree growing is the recursive process of replacing leaf nodes with attribute tests (which creates additional leaf nodes). ===================================================================== DEALING WITH CONTINUOUS ATTRIBUTES: A bad way to handle continuous attributes to create a branch for each unique value fo the attribute. One good approach is to creat a binary split for continuous attributes by finding a threshold such that all cases <= to the threshold go left and all cases > threshold go right. One way to find a good threshold is to sort the attribute values, and consider placing the threshold midway between each successive value in the sort. If there are K unique values, you will need to consider K-1 thresholds. For each threhsold, compute the InfoGain, and use the threshold that yields the best InfoGain. A more sophisticated approach is to cluster the continuous values into a small number of disjoint sets such as quartiles, and use these as discrete splits. Research is still being done on how to efficiently find good partitionings of continuous attributes. ===================================================================== ALTERNATE SPLIT CRITERIA One problem with InfoGain is that it prefers splits that have many branches (i.e., attributes with many values) because the leaf nodes are more likely to yield purity. Consider a Social Security Number attribute. Each person has a unique SSN, so with N people we have N unique SSNs and each person ends up in a separate leaf node. Each leaf node has purity -- each person is a member of just one class. InfoGain computes 0 entropy for each leaf node, yielding maximum InfoGain. So the SSN attribute will always appear best and be selected. But this obviously doesn't generalize. We need to penalize the InfoGain so that the gain is reduced when it is the result of a split that creates many small leaf nodes. One approach is GainRatio. See Mitchell pp. 73-74. ===================================================================== QUESTION: Does it ever make sense to repeat a test in a path? Not if the attributes are boolean or discrete, because once an attribute is tested on a path, all the cases that go down a branch share the same attribute value, thus a subsequent test on the same attribute could not spearate the cases further. ===================================================================== QUESTION: When do we stop growing the decision tree? One approach is to continue growing the tree until the leaf nodes are empty, or until the leaf nodes are pure (i.e., contain items of just one class), or until there are no attribute tests left that will separate the cases (in which case the training set must be inconsistent). This approach leads to large DTs with many leaf nodes, many of which contain very small samples. Using small samples to assign the class of a leaf node is unreliable. This can yield high apparent accuracy/performance on the training set, but may generalize poorly to future unseen test cases. See the overfitting graph on p.67 of Mitchell. EARLY STOPPING: An alternate approach is to stop growing the DT when the leaf nodes contain fewer than K items. K might be a value like 10. This prevents us from expanding leaf nodes (adding new attribute tests) when the sample sizes are too small to allow us to reliably pick a good test and assign class labels to the resulting leaf nodes. More sophisticated variations on this theme include using statistical tests to judge when the sample size and mix is large enough to warrant adding a further test. Another approach to early stopping is to stop adding tests when no attribute test yields an improvement in class purity. This prevents us from adding tests that doe not appear to improve classification accuracy. One problem with this aproach is that it is difficult to judge if a test is useful without seeing the splits that might be installed after it. For example, if the XOR of two attributes perfectly correlates with the class, each of the two attributes in isolation will provide no improvement in class purity, but a path that contains both attributes will yield perfect purity. In general, early stopping is difficult when combinations of attributes can yield good classification, but those attributes in isolation look poor. This is a problem not only for early stopping, but shows a weakness of our greedy procedure for growing decision trees: the greedy process is biased towards attributes that work well alone, or in combination with the attributes already installed higher in the tree. The greedy process is not good at finding combinations of attributes that are poor in isolation but perform well in combination. One solution to this problem is lookahead. Grow the tree a few levels deeper than the level you are currently installing splits at, examine the info gain of the entire tree, and use this to pick the best attribute to install at the current level. Then recurse. This approach makes DT growing less greedy by allowing it to look ahead and see the consequences of the current test further down the tree. But it is expensive. If you allow lookahead equal to the maximum depth of the tree, it is equivalent to enumerating all possible trees to this depth, which usually is infeasible. POST PRUNING: A better approach to overgrowing the tree and overfitting, is to grow the tree to maximum size, and then prune it back. One way to prune leafs is to use a statistical test such as chi-square to measure when the sample size and class distribution of the node is statistically not warranted. An alternate approach is reduced error pruning: prune away any test that does not make the performance of the tree worse when it is pruned. TRAIN, TEST, AND VALIDATION SETS: An interesting issue when using post pruning, is whether to use the train set, or some other test set, to do the pruning. The performance of a tree and the splits in a tree usually look optimistic on the train set because the splits were chosen to optimize performance on the train set. Thus pruning using the train set tends to remove fewer nodes than pruning using an independent test set would. But you shouldn't use the final test set for pruning (because this means the final test set was used in the training procedure, and thus is no longer a fully independent test set), so you need to create a 2nd test set, usually called the validation set, just for pruning. This is unfortunate because typically you'd like to grow the tree using as large a data set as possible. Having to hold aside a validation set sometimes can hurt the grown tree enough that the subsequent benefits of pruning do not compensate for the loss in performance. There is no simple answer to this problem (yet). ===================================================================== MISSING VALUES We'll discuss missing values more later in the course. For now, it is sufficient to know that DTs deal reasonably well with missing values: when a case reaches a branch in the DT where it is missing the value of the test, the case can be sent down the branch that most of the other cases went down, or it can be sent down one branch, with that branch picked randomly with probability equal to the probability that other cases not missing values went down, or it can be split into fractional cases, each fraction being equal to the probability with which other cases went down the branches. ===================================================================== PACKAGES: ID3 was an early research implementation of decision trees. It is still around if you want code to work from. CART (Classification and Regression Trees) is a decision tree package created by statisticians that is popular in research communities with a stats bias. The book for CART is an excellent introduction to the issues associated with growing decision trees. C4.5 and C5 are packages created by Ross Quinlan and are the standard in the machine learning community. C5 can generate rules as well as decision trees. IND (INDuce) is a package available from NASA created by Wray Buntine. It is unsupported C code designed to run under Unix. I use this package in some of my research, and we may use it for the homework on decision trees. ===================================================================== ADVANTAGES OF DECISION TREES: - train fast - evaluate fast - compact models - intelligible if small - do feature selection - don't use all features - experts understand/accept them - easy to convert to rules - can handle missing values DISADVANTAGES OF DECISION TREES: - not good at regression (predicting continuous values) - not good at non-axis parallel splits - trees for problems with continuous attributes can be large - large trees are not intelligible - split ordering often counterintuitive to experts - not good at learning from many inputs (e.g., pixels)