A reader's data mining question answered
One of my readers sent me a data mining question last week. He said that he had not been able to find any discussion about the statistical significance of a rule that comes out of…

One of my readers sent me a data mining question last week. He said that he had not been able to find any discussion about the statistical significance of a rule that comes out of a decision tree classifier. As he said,
"for example if I get the following rule out of a decision tree
IF AGE>32 AND AGE<=35
AND MARRIED 'T' AND
NUM_OF_CHILDREN=1.5 AND
SEX='MALE'
How can I be certain that this is not a spurious pattern happened there by chance? More importantly how can I *quantify* its statistical significance".
Using data mining to develop decision trees, especially of customer segmentation, is of course well developed. He raises an interesting question though. Regular readers will know that this pushes my know-how to the very limits (indeed over them), so I asked some of the analytic brain trust here at Fair Isaac (thanks Stuart, Mac).
To our (their) knowledge there is no formal test of statistical significance for a tree (or for a rule extracted from one). In order to have such a thing one would have to understand the distributional theory for a tree. The reason one can express the statistical significance of (say) the difference between two means is that the distribution of the mean is well understood but this is not so, for trees. The best one can do is obtain an unbiased estimate of the error rate for the tree. This will provide confidence that the rules expressed by the tree are not spurious. The best way to do this is via an independent test set – that is real data that is a valid subset of the data being analyzed that has not been used to develop or refine the tree. If an independent test set is unavailable, simple N-fold cross-validation will give an unbiased estimate of error rate. Alternatively you could take a Bayesian approach. For example, you can compute the probability that the dataset that you have was generated using the model. The idea being that if the probability is high that the data came from the model then the model accurately encodes the data.
So, now you know.
Technorati Tags: analytic application, analytics, predictive analytics
Popular Posts

Business and IT Alignment is Critical to Your AI Success
These are the five pillars that can unite business and IT goals and convert artificial intelligence into measurable value — fast
Read more
Average U.S. FICO Score at 717 as More Consumers Face Financial Headwinds
Outlier or Start of a New Credit Score Trend?
Read more
FICO® Score 10T Decisively Beats VantageScore 4.0 on Predictability
An analysis by FICO data scientists has found that FICO Score 10T significantly outperforms VantageScore 4.0 in mortgage origination predictive power.
Read moreTake the next step
Connect with FICO for answers to all your product and solution questions. Interested in becoming a business partner? Contact us to learn more. We look forward to hearing from you.