The purpose is to give non-technical readers some background into some of the new and popular prediction and decision technologies, and remind more technical readers of some of the key strengths and weaknesses. No attempt is made to make direct comparisons of techniques since the features tend to be application-dependent. Nor is this information intended to be an exhaustive or complete discussion of each technique.
A Guide for the Non-Technical Reader
At FICO we sometimes classify analytic techniques as belonging to one of four areas—exploratory data nalysis, predictive modeling, optimization and decision analysis. Many of the underlying technologies described in this paper are not confined to one of these categories, and may in fact be used in multiple areas. As a result, the paper itself does not impose a classification scheme on the techniques discussed. Rather we have simply listed them in alphabetic order for ease of reference. However, for this introduction, we have listed each technique as belonging to one of the four categories described below in order to indicate its most common area of use.
Exploratory analysis (or undirected data mining) seeks to establish relationships in the data to gain insight. Within this exploration, no specific outcome is assumed. An example of this group of techniques would be cluster analysis, used to develop a strategic marketing segmentation. Other techniques in this category are factor and principal component analyses.
Predictive modeling (sometimes called directed data mining) seeks to identify and mathematically represent underlying relationships in historical data, in order to explain the data and make predictions or classifications about new data. Predictive models are frequently used as ways to summarize large quantities of data as well as to increase the value of data. In the financial services, telecommunications, direct marketing and e-commerce industries, they are commonly used as inputs to decisions. An example would be the use of logistic regression to classify prospects as good or bad credit risks. Other techniques in this category are boosting, collaborative filtering, discrete choice modeling, discriminant analysis, scorecards, log-linear models, neural networks, pattern recognition, regression, support vector machines, survival analysis and tree modeling methods. Expert systems and RFM also fit into this category, but are different in that they can be derived judgmentally without historical data.
Optimization techniques seek to efficiently and effectively search across a set of possible solutions to a problem (either constrained or unconstrained) with the goal of maximizing or minimizing a particular mathematical function. Techniques in this category are genetic algorithms, linear programming and non-linear programming. Although we do not highlight them within the sections, several of the predictive modeling and decision analysis techniques rely on optimization techniques to reach their results.
Decision analysis goes one step further. By modeling the decision itself, it allows for the optimal decision to be identified. The purpose of decision analysis is to assist decision makers in making better decisions in complex situations, usually under uncertainty. Components of decision analysis discussed in this paper include key concepts and tools, graphical decision models, multiple objective decision analysis, sequential decisions and utility theory. Since decision analysis delivers the most value when coupled with active, continuous learning from observations, the need for well-planned or designed data is critical in the building of a robust decision model. For this reason, it is important to point out the section on experimental design, which addresses the importance and approach to well-planned data collection.
At FICO, our years of experience with noisy and biased data and business constraints have led us to value domain expertise and analytic experience as key components in the modeling and strategy optimization process. An analytic technique, in and of itself, works only with the empirical data provided. Often, however, there is more contextual information that should be incorporated, either through automated capture of business intelligence or by the imposition of operational constraints. Such contexts might include the source of the data; its past and future reliability; its deployment mechanism; its cost; and the potential legal, operational or customer relationship impact of using certain types of data or using certain criteria for a given decision.
FICO favors techniques that allow for the incorporation of prior knowledge beyond that provided in a particular dataset in order to create a solution of greater value. You will note that some of the strengths and weaknesses listed for each technique allude to this point. While in other publications some technologies have been criticized for being naive, the scenarios discussed are frequently describing the naive analyst.
Each section is introduced with a brief one- to two-page discussion of the technique. Since you may not be familiar with some of these terms used in this paper. Phrases in italics are defined in the glossary. Some of these definitions were written to clarify the terms as they are used here and ignore their broader interpretation.
To place the techniques in context, we have indicated some of their most common uses. When appropriate, we have noted particular business problems to which techniques have been applied successfully.
Strengths and Weaknesses
We have included strengths and weaknesses for the techniques, where appropriate, although these are not exhaustive lists. Rarely could a weakness in one situation be a strength in another, but often a weakness (or strength) might be irrelevant for a particular application or set of data. For example, an inability to handle missing values is only a problem when there are missing values.
An ability to capture interactions in data is only a positive feature where these are suspected to exist. Multivariate normality assumptions are not a problem for linear regression if the data are, in fact, multivariate normal. Other issues to consider when evaluating analytic techniques include the use of categorical and/or continuous variables, the ease of interpretation of results, the robustness of solutions, the importance of sample size, the ability to handle multiple objectives and the ability to engineer solutions.
Revisions and Updates
We periodically (though not frequently) update this information, deleting some topics that have become less relevant and adding additional sections. In the current version, we have added new sections on Discrete-time Hazard Models, Ensemble Learning, Mathematical Programming, FICO’s new Xpress Optimization features, and expanded upon the Decision Trees section.
We hope you find this information useful and welcome your feedback on future improvements.