Honest Decision Trees

Honest Decision Trees

Alin Dobra

Jeffrey M. Derstadt

Johannes Gehrke

{dobra, derstadt, johannes} @cs.cornell.edu

BOOM 2001

Cornell University

Project Abstract

As data mining technology matures and the technology transfer of scalable algorithms for the most prominent data mining models have made transitions into products, we believe that it is time to address important problems in the functionality of the second generation of data mining systems. This paper addresses the fundamental problem of bias in split attribute selection in classification tree construction. An attribute selection criterion is unbiased if its choice of a split attribute is only determined by the strength of the dependency between the split attribute and the class label attribute; otherwise the criterion is biased. If an attribute selection criterion is biased, it might not return the "best" split attribute at a node of the classification tree, but rather an attribute that has properties unrelated to its dependence with the class label (for example, a large domain) that "confuse" the split criterion.

Our first contribution is a thorough experimental and theoretical study that shows that most popular split selection criteria are biased towards attributes with a large domain. Our second contribution is a general methodology that provably eliminates this type of bias for any split selection criterion, while maintaining the good properties of the criterion. We describe scalable algorithms for efficient implementations of our methodology that make its application computationally efficient, and we demonstrate through a careful experimental study that our method has excellent behavior in practice.

Poster Slides

Department of Computer Science at Cornell

Cornell Database Group