Honest Decision Trees
|
|
Jeffrey M. Derstadt |
Johannes Gehrke |
{dobra, derstadt, johannes} @cs.cornell.edu |
BOOM 2001
Project Abstract
As data mining technology
matures and the technology transfer of scalable algorithms for the most
prominent data mining models have made transitions into products, we believe
that it is time to address important problems in the functionality of the
second generation of data mining systems. This paper addresses the fundamental
problem of bias in split attribute selection in classification tree construction.
An attribute selection criterion is unbiased if its choice of a split attribute
is only determined by the strength of the dependency between the split
attribute and the class label attribute; otherwise the criterion is biased.
If an attribute selection criterion is biased, it might not return the
"best" split attribute at a node of the classification tree, but rather
an attribute that has properties unrelated to its dependence with the class
label (for example, a large domain) that "confuse" the split criterion. Our first contribution is a thorough experimental and theoretical study that shows that most popular split selection criteria are biased towards attributes with a large domain. Our second contribution is a general methodology that provably eliminates this type of bias for any split selection criterion, while maintaining the good properties of the criterion. We describe scalable algorithms for efficient implementations of our methodology that make its application computationally efficient, and we demonstrate through a careful experimental study that our method has excellent behavior in practice.
|
Poster
Slides
|