This post is the second of a three-part series highlighting some of the ways in which the set-up of a score validation can greatly influence model performance results — and sometimes inflate them. Our goal is to help potential users of scores and data evaluate their options and understand when they might be looking at skewed results. The first post in this series covered the impact on model performance metrics that can result from validating scores on narrow segments of the population, especially if defined on the basis of one of the scores being evaluated. In this post, we evaluate a related scenario: selection bias.
It is generally a straightforward exercise to do a validation to compare two or more scoring systems being evaluated for use in credit decisioning. When conducting such analyses of scoring systems, one key guideline is that the data sample for the analysis is representative of the population on which the models are expected to be applied going forward. However, it’s often the case that the available data for these analyses have been “truncated” because a prior screening process has been applied on the population of interest.
As an illustration, say you’re in charge of credit risk in credit card originations in your organization. You have been using FICO® Score 8 and adhered fairly strictly to a minimum FICO Score of 680 in determining whether to approve applicants. Now you are in the process of validating the effectiveness of the FICO Score 8, as well as evaluating a competing credit bureau score to potentially adopt going forward to replace your use of FICO Score 8. The typical approach to the test would be to select all of your accounts that have activated, have a similar maturity, and have sufficient maturity, say 18-24 months on books, to evidence their performance as good-paying or bad-paying accounts, and tag them accordingly. The FICO Score 8 from the time of decisioning is available on your database, and the competing score would need to be posted by doing a retrospective score pull from your credit bureau from the time of the booking decision. Having this information, various reports and metrics can be generated to exhibit the power of each score from the time of the credit decision to predict the future good/bad outcomes of accounts in the population. Metrics could include divergence, Gini coefficient, K.S. (Kolmogorov-Smirnoff) statistic, ROC (Receiver Operating Characteristic) and others. The score with the higher value on your favored metrics would be considered the “winner.”
The flaw here is that the population on which the test is undertaken only consists of booked accounts that were screened using the incumbent FICO® Score 8. Having used a fairly strict 680 cut-off, this validation sample is truncated in that there are no or very few accounts in the sample with FICO Score 8 values less than 680. The competing score under consideration, on the other hand, gets the advantage of having the breadth of its full score range being represented on the population, to the extent that it is uncorrelated with the incumbent score. The lower the acceptance rate from having used the incumbent score in making originations decisions, the more truncated the sample and the more muted the metrics observed on this score are likely to be.
This truncation bias phenomenon was recently presented in a paper published by FICO entitled “Homecourt Disadvantage: Truncation Bias and the Art of Comparing Consumer Credit Scoring Models”, by Ethan Dornhelm, Paul Panichelli, and Tom Parrent. The research results substantiated the truncation bias effect in two ways:
- Regardless of the model being evaluated (incumbent or “challenger”), the model’s predictiveness metrics are observed to be lower on a truncated sample than they would be on a non-truncated sample. This is because when lower scoring subjects are dropped from the validation, the remaining sample is more homogeneous and contains less explanatory information to allow either model to separate on the future good/bad outcome measure.
- Not only do each score’s predictiveness metrics decline when measured on the truncated samples, the incumbent score, the one used to effect the truncation, is the one whose metrics decline more. The model that the truncation is based on (e.g. the model that was used in decisioning) is especially disadvantaged when being compared with models that were not previously used in decisioning. In simple terms, the incumbent score has already knocked out future bad accounts falling below the score cutoff in use. The challenger score may not have identified those same future bad accounts and in fact may be a weaker model. But we’d never know this because these bad accounts identified by the incumbent model are no longer part of the analysis. The challenger benefits from the strength of the incumbent’s ability to identify bad accounts falling below the existing threshold.
Beyond simply having an awareness of selection bias, there are several approaches to model validation that can help avoid or at least mitigate the issue:
- Using a non-truncated sample:
- This could be achieved if none of the scores being evaluated were used in decisioning the accounts, for example for an organization that is new to scoring; or if the comparison is between two new score versions, to replace an older score version that was used in the past but is not a candidate for use going forward. In the latter case, it should still be kept in mind that if one score is more correlated to the incumbent score than the other, it may still be disadvantaged due to the truncation bias effect.
- Booking a small control group of accounts that fall below the cut-off score. The merits of this approach to avoid truncation bias need to be weighed against the risks and costs of booking low scoring applicants.
- Reject inference on applicants that were rejected due to the incumbent score cut-off: This entails including rejected applicants in the validation sample, and imputing whether they would have been good or bad paying accounts had they been booked. Reject inference requires careful application and has its own caveats, but if your organization has the expertise to undertake this with an understanding of the caveats, it provides a statistically viable approach for comparing scores on a “through-the-door” applicant population.
- Evaluating scores for account management purposes may be less subject to truncation bias since the incumbent score may be used more for strategic decisions like credit line increases or renewal terms, rather than for account closures.
- Mimicking the bias on the non-incumbent score: If the cut-off on the incumbent score led to rejecting a given percentage of the applicant population, determine the equivalent cut-off score on the other score(s) being considered, and truncate the population accordingly using the other score’s cut-off before conducting the validation and comparison. However, while this approach puts the scores being compared on an equal footing, the population being evaluated is now even less representative of the through-the-door population that would be the ideal basis for the test.
Avoiding truncation bias is not always easy but it’s a best practice among all top lenders. Taking these steps is critical to reaching the best conclusion. Ignoring truncation bias can easily lead to spurious results, weaker model selection and costly losses. Like a champion-challenger boxing match, the challenger may need to perform much more than marginally better to displace the champion. Among credit scores, the incumbent score may be a proven product, with demonstrated effectiveness through-the-cycle and well-established infrastructure and strategies developed under the assumption that it would continue to be used going forward. The expense of migrating to a new score might only be offset by its showing substantial predictive improvement relative to the incumbent score in your testing exercise. With truncation bias in play, a marginal “win” by the challenger score may not ultimately prove to be a win for ROI at all. That is the bottom line. No ROI, no reason to move.