Skip to main content
Improving Model and Data Governance with Auto-Encoders

One of our latest innovations in fraud and cybersecurity addresses a fundamental issue affecting predictive analytics: Data doesn’t sit still.

What do I mean by that? As any data scientist can tell you, a model development project begins with the lengthy process of collecting, identifying, cleansing and normalizing data. This is often the longest part of the process of building a model. There are many checks of data integrity to ensure proper data quality.

The paradox is that the data we’re spending so much time working with isn’t necessarily the data we really care about. The data we really care about is the data that the model will analyse in the future — which will be different than the data we’re studying to build the model.

Analytics methodologies include data governance to ensure that a model developed on today’s data will make good predictions based on future data. For instance, a “holdout sample” of the data available for modelling will be kept aside in order to validate the mode. In the development of neural network models, these two samples are called the training sample and the testing sample.

Post-development, we monitor data statistics, score distributions and model performance to make sure that the data the model is scoring in production doesn’t vary too greatly from the development data used to develop the model.

That’s how it’s typically done today. But in order to improve our ability to rapidly diagnose and solve for data integrity issues and population changes, our team of data scientists have created a dynamic new approach using “auto-encoders.” I explored this topic in my presentation earlier this month at the Data Summit conference in New York.

Auto-encoders function in a similar way to neural networks. In fraud detection, neural networks take in raw data and, using a network of computational “neurons” or nodes, output a score. With auto-encoders, the process is similar but the output isn’t a score — it’s a version or “reconstruction” of the input data. Through unsupervised machine learning, the auto-encoder minimizes the reconstruction error, producing a data set that is more and more like the input. Once the auto-encoder has been learned, it provides a compressed distributed representation (encoding) of original data.

In short, an auto-encoder network is trained to output the input.

This auto-encoder model is important because it can indicate what types of future data has and has not been seen during model development. Where reconstruction errors are large, it means that the combination of data elements being passed through the model in production is different from that seen during model training. This could indicate that the scores used in decision making will be less accurate.

Let’s look at a couple of examples where this is useful.

Diagnosing a Data Feed

The creation of a neural network may involve terabytes of transaction data from multiple businesses. Standard statistical analysis is often too generic to find data integrity issues that are often undetectable by standard statistics.

The auto-encoder can easily identify transactions across businesses with different reconstruction errors which point to the key data integrity issues. This allows us to fix data aspects, minimize model score impacts, and if necessary create rules to remedy the data quality issue.

These large reconstruction errors ferret out minute data quality issues that can be very important to specific segments. We may find, for example, a shift in transaction amounts for Kazakhstan from one issuer. That’s not a change likely to impact the whole population, but in terms of customers transacting in Kazakhstan it can be quite significant.

Monitoring Unsupervised Models

Many unsupervised models are built with little to no historical data. As such, there is no ability to train an auto-encoder on historical data. Instead, when we build the unsupervised model we can create an accompanying auto-encoder model that will learn patterns in the production data and monitor for changes.

The pair of models can then be packaged together and installed in the production environment. The auto-encoder monitors the reconstruction error regularly and calculates it in batch mode. When the auto-encoder model tells us that the error has grown too large —  signaling that the production environment is varying considerably— a new version of the unsupervised model may need to be constructed. Thanks to the auto-encoder model, we have insight into what patterns are newly emerging, which can be quite insightful for future unsupervised model enhancement.

At FICO, we look for needles in the haystack around fraud and cybersecurity — the rare but important signs that something is wrong. Understanding the nearly undetectable changes and manipulations of the data presented to models helps us understand where models may not perform optimally, and raises our awareness of the new types of “needles” that show our adversaries at work.

related posts