DECISION TREES

INTERIOR NODES: tests on attributes. # of branches = # of values of
attributes = # of splits

LEAF NODES: classify objects into either discrete classes or into
probabilities of each class

DTs are best for classification problems.  They are not well suited
for the prediction of continuous values (e.g., regression).  There are
modifications to DTs that make them more suitable for regression
problems.

The concept class of DTs is disjunctions of conjunctions of
attributes.  Each path in a DT from the root test to a leaf node is a
conjunction of attribute values.  Each leaf node is a class.  Since
there can be multiple leaf nodes for each class, each class
potentially is represented as a disjunction of these leaf nodes, i.e.,
as a disjunction of multiple conjunctive paths:

CLASSi =    (ATTRa = VALb) /\ (ATTRc = VALd) /\ ... /\ (ATTRe = VALf)
         \/ (ATTRg = VALh) /\ (ATTRi = VALj) /\ ... /\ (ATTRk = VALl)
         \/ ...
         \/ (ATTRm = VALn) /\ (ATTRo = VALp) /\ ... /\ (ATTRq = VALr)

=====================================================================

MACHINE LEARNING OF DECISION TREES

Can't exhaustively enumerate all DTs and test them to see which is
best because the number of DTs is exponential in the number of
attributes.

Instead, we "grow" DTs using greedy (non-optimal) search methods.

TDIDT = Top Down Induction of Decision Trees.  Also known as recursive
partitioning.

Start with the null tree consisting of just a leaf node that contains
all training cases.  Then try to find the "best" split (attribute to
test) to install at this leaf node.  "Best" means that the leaf nodes
resulting from the split on the attribute have the best class purity
compared to all other possible splits.  In particular, usually the
class purity of the new leaf nodes will be better than the class
purity of the parent leaf node that is being replaced with the test.

One measure of class purity is Information Gain.  InfoGain = decrease
in entropy due to the split.

Entropy(set of examples S) = SUMMATION [- Pc * LOG2 (Pc)]
                             c in classes

See graph of entropy function in Mitchell's text book on p.57.

InfoGain = Entropy of Parent - Weighted Entropy of Children

                                   |CHILD|
         = Entropy(P) - SUMMATION  ------- * Entropy(Child)
                         children    |P|

There is a child for each value of the attribute used in the test, so
this is a summation over the values of the attribute in the split.

If InfoGain = 0, children have the same entropy as the parent, and
there is no improvement in class purity due to the split.  This is a
poor split.

If InfoGain = Entropy(Parent), then the children have 0 entropy, and
the split has provided maximum purity.  This is a great split.

InfoGain is a criterion to use to find the best split (attribute test)
to install at a leaf node.  To grow a full decision tree, start with
the null tree containing just a root leaf node.  Try all the possible
attributes and use InfoGain to find the "best".  Install this best
attribute at the root.  This creates new leaf nodes, one for each
value of that attribute.  Examine each leaf node one at a time, and
use InfoGain to find the best attribute to install at each leaf node.
This adds a new level to the tree.  Repeat.  Decision tree growing is
the recursive process of replacing leaf nodes with attribute tests
(which creates additional leaf nodes).

=====================================================================

DEALING WITH CONTINUOUS ATTRIBUTES: A bad way to handle continuous
attributes to create a branch for each unique value fo the attribute.
One good approach is to creat a binary split for continuous attributes
by finding a threshold such that all cases <= to the threshold go left
and all cases > threshold go right.  One way to find a good threshold
is to sort the attribute values, and consider placing the threshold
midway between each successive value in the sort.  If there are K
unique values, you will need to consider K-1 thresholds.  For each
threhsold, compute the InfoGain, and use the threshold that yields the
best InfoGain.

A more sophisticated approach is to cluster the continuous values into
a small number of disjoint sets such as quartiles, and use these as
discrete splits.  Research is still being done on how  to efficiently
find good partitionings of continuous attributes.

=====================================================================

ALTERNATE SPLIT CRITERIA

One problem with InfoGain is that it prefers splits that have many
branches (i.e., attributes with many values) because the leaf nodes
are more likely to yield purity.  Consider a Social Security Number
attribute.  Each person has a unique SSN, so with N people we have N
unique SSNs and each person ends up in a separate leaf node.  Each
leaf node has purity -- each person is a member of just one class.
InfoGain computes 0 entropy for each leaf node, yielding maximum
InfoGain.  So the SSN attribute will always appear best and be
selected.  But this obviously doesn't generalize.

We need to penalize the InfoGain so that the gain is reduced when it
is the result of a split that creates many small leaf nodes.  One
approach is GainRatio.  See Mitchell pp. 73-74.

=====================================================================

QUESTION: Does it ever make sense to repeat a test in a path?  Not if
the attributes are boolean or discrete, because once an attribute is
tested on a path, all the cases that go down a branch share the same
attribute value, thus a subsequent test on the same attribute could
not spearate the cases further.

=====================================================================

QUESTION: When do we stop growing the decision tree?  One approach is
to continue growing the tree until the leaf nodes are empty, or until
the leaf nodes are pure (i.e., contain items of just one class), or
until there are no attribute tests left that will separate the cases
(in which case the training set must be inconsistent).  This approach
leads to large DTs with many leaf nodes, many of which contain very
small samples.  Using small samples to assign the class of a leaf node
is unreliable.  This can yield high apparent accuracy/performance on
the training set, but may generalize poorly to future unseen test
cases.

See the overfitting graph on p.67 of Mitchell.

EARLY STOPPING: An alternate approach is to stop growing the DT when
the leaf nodes contain fewer than K items.  K might be a value like
10.  This prevents us from expanding leaf nodes (adding new attribute
tests) when the sample sizes are too small to allow us to reliably
pick a good test and assign class labels to the resulting leaf nodes.
More sophisticated variations on this theme include using statistical
tests to judge when the sample size and mix is large enough to warrant
adding a further test.

Another approach to early stopping is to stop adding tests when no
attribute test yields an improvement in class purity.  This prevents
us from adding tests that doe not appear to improve classification
accuracy.  One problem with this aproach is that it is difficult to
judge if a test is useful without seeing the splits that might be
installed after it.  For example, if the XOR of two attributes
perfectly correlates with the class, each of the two attributes in
isolation will provide no improvement in class purity, but a path that
contains both attributes will yield perfect purity.  In general, early
stopping is difficult when combinations of attributes can yield good
classification, but those attributes in isolation look poor.  

This is a problem not only for early stopping, but shows a weakness of
our greedy procedure for growing decision trees: the greedy process is
biased towards attributes that work well alone, or in combination with
the attributes already installed higher in the tree.  The greedy
process is not good at finding combinations of attributes that are
poor in isolation but perform well in combination.  One solution to
this problem is lookahead.  Grow the tree a few levels deeper than the
level you are currently installing splits at, examine the info gain of
the entire tree, and use this to pick the best attribute to install at
the current level.  Then recurse.  This approach makes DT growing less
greedy by allowing it to look ahead and see the consequences of the
current test further down the tree.  But it is expensive.  If you
allow lookahead equal to the maximum depth of the tree, it is
equivalent to enumerating all possible trees to this depth, which
usually is infeasible.

POST PRUNING:

A better approach to overgrowing the tree and overfitting, is to grow
the tree to maximum size, and then prune it back.  One way to prune
leafs is to use a statistical test such as chi-square to measure when
the sample size and class distribution of the node is statistically
not warranted.

An alternate approach is reduced error pruning: prune away any test
that does not make the performance of the tree worse when it is
pruned.

TRAIN, TEST, AND VALIDATION SETS: An interesting issue when using post
pruning, is whether to use the train set, or some other test set, to
do the pruning.  The performance of a tree and the splits in a tree
usually look optimistic on the train set because the splits were
chosen to optimize performance on the train set.  Thus pruning using
the train set tends to remove fewer nodes than pruning using an
independent test set would.  But you shouldn't use the final test set
for pruning (because this means the final test set was used in the
training procedure, and thus is no longer a fully independent test
set), so you need to create a 2nd test set, usually called the
validation set, just for pruning.  This is unfortunate because
typically you'd like to grow the tree using as large a data set as
possible. Having to hold aside a validation set sometimes can hurt the
grown tree enough that the subsequent benefits of pruning do not
compensate for the loss in performance.  There is no simple answer to
this problem (yet).

=====================================================================

MISSING VALUES

We'll discuss missing values more later in the course.  For now, it is
sufficient to know that DTs deal reasonably well with missing values:
when a case reaches a branch in the DT where it is missing the value
of the test, the case can be sent down the branch that most of the
other cases went down, or it can be sent down one branch, with that
branch picked randomly with probability equal to the probability that
other cases not missing values went down, or it can be split into
fractional cases, each fraction being equal to the probability with
which other cases went down the branches.

=====================================================================

PACKAGES:

ID3 was an early research implementation of decision trees.  It is
still around if you want code to work from.

CART (Classification and Regression Trees) is a decision tree package
created by statisticians that is popular in research communities with
a stats bias.  The book for CART is an excellent introduction to the
issues associated with growing decision trees.

C4.5 and C5 are packages created by Ross Quinlan and are the standard
in the machine learning community.  C5 can generate rules as well as
decision trees.

IND (INDuce) is a package available from NASA created by Wray Buntine.
It is unsupported C code designed to run under Unix.  I use this
package in some of my research, and we may use it for the homework on
decision trees.

=====================================================================

ADVANTAGES OF DECISION TREES:

 - train fast
 - evaluate fast
 - compact models
 - intelligible if small
 - do feature selection
 - don't use all features
 - experts understand/accept them
 - easy to convert to rules
 - can handle missing values

DISADVANTAGES OF DECISION TREES:

 - not good at regression (predicting continuous values)
 - not good at non-axis parallel splits
 - trees for problems with continuous attributes can be large
 - large trees are not intelligible
 - split ordering often counterintuitive to experts
 - not good at learning from many inputs (e.g., pixels)