Categorically, artificial intelligence (AI) can appear be an odd juxtaposition of order and disorder — we direct the AI with algorithms, yet the system produces new insights seemingly magically. This two-part blog unpacks the mysteries of two very different AI techniques: supervised and unsupervised learning.
Supervised Learning: The Workhorse of AIMost of the well-known applications of machine learning and computational AI involve supervised learning. The modeler amasses a vast set of existing data (e.g., financial transactions, internet photographs, or the texts of tweets) and a base-level “ground truth” outcome that is already known, perhaps in retrospect or by expensive human investigation.
Equipped with any number of computational algorithms, the scientist becomes the “supervisor” whose code trains the model to reproduce, in the lab, the known outcomes with a low probability of error. The models are then deployed to live a happy life scoring credit risk and fraud likelihood, finding pictures of Chihuahuas and muffins, or flagging insulting tweets. Technically, each model computes a probabilistically weighted predicted outcome that we believe to be like those outcomes from the training examples. The state of the art for supervised learning is now well established; you can choose from dozens of comprehensive predictive analytics and neural network packages.
Unsupervised Learning: Inferences in the Absence of OutcomesBut what if there is no set of “true outcomes” known, or the ones at hand are restricted in quality or quantity? What can machine learning do for us then? This is the domain of the far trickier unsupervised learning, which draws inferences in the absence of outcomes.
Good unsupervised learning requires more care, judgement and experience than supervised, because there is no clear, mathematically representable goal for the computer to blindly optimize to without understanding the underlying domain.
The Challenge of Outlier DetectionA central task within unsupervised modeling is outlier detection: Which examples are most unlike most of their peers? Outlier detection and transaction fraud scoring provide an easy illustration:
- Which customers request money transfers with patterns substantially different from most of their peers?
- Which medical providers bill insurance for sets of claims most unlike their peers?
- Which transactions on an individual payment card are most different from a customer’s usual behaviors?
Because there are far fewer principles, and less didactic instruction and widely available software compared to classic supervised modeling, there are even more analytic “gotchas” requiring deep analytic scientist experience and judgement. Difficulties and considerations in outlier detection include:
- The need to define a metric or distance. Many techniques require defining a “metric” or “distance” function between pairs of observations. One problem is that the individual components of this feature vector have qualitatively different meanings – how can one balance adding or subtracting apples and oranges, and kumquats and kangaroos?Often this is done ad-hoc or, unfortunately, without intention as the underlying algorithm method assumes a metric. What should be done in the real-life scenario of a combination of quantitative and categorical features? Supervised modeling can often be blissfully ignorant of this problem, since the quantitative optimization with known targets tends to scale and transform each feature automatically, to the degree that it contributes predictive value.
In an unsupervised context, an explicit metric will have major influence on the scoring of outliers; this is imposed by the analytic scientist. Additionally, in a high-dimensional space, our intuitions about the properties of neighborhoods and neighborliness derived from our three-dimensional physical experience are very misleading: A randomly selected point in the training dataset is often not much further away than a point’s nearest neighbors. At FICO, we believe outlier statistics derived under these intuitions ought to be approached with caution.
- Computational burdens on scoring. How expensive is it, in terms of computation and in memory, to score new observations with the outlier model? Do any complex data structures need to be created for scoring? Do we need to retain a significant fraction (or all) of the training data set to score a new observation in production?
- Calibration and interpretation of score. If we have a number representing “degree of outlierness,” what does it mean? Does it have a well-behaved, approximately continuous score distribution under the natural data set, or is the distribution irregular, with significant delta functions or gaps? What happens when the dimension of the training set changes, i.e. are there major systematic trends?
- Feature cross-correlation. This is a subtle yet critical problem that gets little attention in the field. Frequently, the underlying features are designed to address a particular of the problem domain, but often there are a significant number of related, and therefore correlated, features covering some conceptual axes of the problem, but other aspects of behavior are represented by only a few features each. The effect on outlier scores may be severe. Can one balance this automatically, in a principled manner?
Requirements for Commercial-Grade Outlier DetectionBeyond clear technical issues, there are some higher-level properties that FICO scientists believe a state-of-the-art, commercial solution must address.
- Qualitative diversity of detected outlier behavior. Commonly, the quantitatively highest-ranking observations under some outlier statistic may all be a result of one particular “type” of outlier, for instance, a single modus operandi of fraud or abuse. However, the subject matter expert knows there are substantial varieties of anomalous behavior possible. A superior approach would generate a qualitative diversity of outlier cases. This is a tough problem for the less sophisticated practitioner and virtually unaddressed in public literature, yet is very important in commercial application. Fully compensating for generalized feature cross-correlation in a principled algorithm goes a long way toward fulfilling this goal.
- Qualitative versus quantitative outlierness: discerning “unknown unknowns.” Can we distinguish outliers that are a significant “quantitative” exaggeration of normal behavior from ones that are fundamentally distinct, in a qualitative sense, from the norm? Both ought to score high on an outlier statistic, but we want the second to score even higher.
Follow me on Twitter @ScottZoldi.