# Decision Tree Construction

## Bias Correction in Classification Tree Construction

Decision trees are one of the most widely used data mining models, and
decision tree construction has a long history in machine learning, statistics,
and pattern recognition. One of the main advantages of decision trees is that
the resulting data mining model (the tree) can be easily understood by the data
mining analyst. Recent studies have shown that the variable selection process in
decision trees is *biased*, i.e., the predictor variable at a node of the tree
might not actually be the predictor variable that is most important at this
point. We have developed a computationally efficient, generic method that takes
any traditional (biased) split selection method and generates an unbiased split
selection method.

Our work addresses the problem of bias in split variable selection in classification tree construction. A split criterion is unbiased if the selection of a split variable
*X* is based only on the strength of the dependency between *X* and the class label, regardless of other characteristics (such as the size of the domain of
*X*); otherwise the split criterion is biased. Our work makes the following four contributions:

- We give a definition that allows us to quantify the extent of the bias of a split
criterion;
- We show that the p-value of
**any** split criterion is a nearly unbiased
criterion;
- We give theoretical and experimental evidence that the correction is
successful;
- We demonstrate the power of our method by correcting the bias of the
*gini* gain.