Decision trees are one of the most widely used data mining models, and decision tree construction has a long history in machine learning, statistics, and pattern recognition. One of the main advantages of decision trees is that the resulting data mining model (the tree) can be easily understood by the data mining analyst. Recent studies have shown that the variable selection process in decision trees is biased, i.e., the predictor variable at a node of the tree might not actually be the predictor variable that is most important at this point. We have developed a computationally efficient, generic method that takes any traditional (biased) split selection method and generates an unbiased split selection method.
Our work addresses the problem of bias in split variable selection in classification tree construction. A split criterion is unbiased if the selection of a split variable X is based only on the strength of the dependency between X and the class label, regardless of other characteristics (such as the size of the domain of X); otherwise the split criterion is biased. Our work makes the following four contributions: