By Shafi Rahman
A common question in today’s world is whether one can create models automatically. As I noted in one of my earlier blogs, FICO pioneered the use of Big Data and machine automation frameworks for building thousands of highly predictive models extremely rapidly by leveraging terabytes of a variety of data sources. Naturally, some folks wonder whether this is the beginning of the end of analytic scientists. On the other hand, more cautious ones are concerned about the black box nature of these automated modeling farms.
It is worth noting that almost all predictive models are algorithm-driven and based on sound machine learning concepts. So training such models is easy to automate and scale to any new business problem. Simplistically speaking, one just needs to point the algorithm to the right training dataset and pass the required training parameters. Sometimes, these algorithms on their own lead to over-fitted or poorly extrapolated models. So, researchers at FICO have gone one step further and developed mechanisms to automatically identify and fix most of them. Analytic scientists still need to review all models thus trained and manually intervene to fix them when the algorithms and automation let some poorly performing models pass through.
The harder part to automate is the experimental design and entity characterization. These are the pieces where the business and domain knowledge is imputed into the problem solving using the available data. Among the two, entity characterization is relatively easier to scale up and automate. FICO scientists have done this for 1:1 customer dialogue by abstracting the customer characterization in a manner that the automation could be applied to any transaction data source within that domain. Yet, for extending to solve similar problems in new domains, analytic scientist need to determine the applicability of the customer characterization based on available and possibly different types of data sources. They then make appropriate updates to the automation and run various tests before putting the updated automation in the production.
The hardest part by far to automate and scale is the design of the experiment. Questions like:
- What should be the data sources?
- How should you club them together?
- What should be the definition of the performance variable?
- How should you handle target leakage?
- What should be the most effective sampling scheme are very unique to the domain and the problem being solved?
These would always require an expert who understands the business domain as well as analytic modeling. If these are solved once for a particular specific problem, it is then easy to automate it to solve similar problems repeatedly. That is the extent of automation possible though.
In the era of Big Data and modeling automation, analytic scientists have become ever more important. There is access to a wealth of data sources of all sorts, and modeling automation draws a lot of inferences, but whether they make analytic or business sense can be determined only by experienced and seasoned analytic scientists. As my colleague Mike Farrar puts it, ours is a solution that demands the attention of serious, experienced analytic scientists but has the machine handle the drudgery at scale. This distinction, in my opinion, puts the analytic scientist ahead of the rest of the automatons.