One of my readers sent me a data mining question last week. He said that he had not been able to find any discussion about the statistical significance of a rule that comes out of a decision tree classifier. As he said,

"for example if I get the following rule out of a decision tree
IF AGE>32 AND AGE<=35
AND MARRIED 'T' AND
NUM_OF_CHILDREN=1.5 AND
SEX='MALE'
How can I be certain that this is not a spurious pattern happened there by chance? More importantly how can I *quantify* its statistical significance".

Using data mining to develop decision trees, especially of customer segmentation, is of course well developed. He raises an interesting question though. Regular readers will know that this pushes my know-how to the very limits (indeed over them), so I asked some of the analytic brain trust here at Fair Isaac (thanks Stuart, Mac).

To our (their) knowledge there is no formal test of statistical significance for a tree (or for a rule extracted from one). In order to have such a thing one would have to understand the distributional theory for a tree. The reason one can express the statistical significance of (say) the difference between two means is that the distribution of the mean is well understood but this is not so, for trees. The best one can do is obtain an unbiased estimate of the error rate for the tree. This will provide confidence that the rules expressed by the tree are not spurious. The best way to do this is via an independent test set – that is real data that is a valid subset of the data being analyzed that has not been used to develop or refine the tree. If an independent test set is unavailable, simple N-fold cross-validation will give an unbiased estimate of error rate. Alternatively you could take a Bayesian approach. For example, you can compute the probability that the dataset that you have was generated using the model. The idea being that if the probability is high that the data came from the model then the model accurately encodes the data.

So, now you know.

Technorati Tags: , ,