I consider myself a sensible person with good time management skills, so it’s always beguiling to get to the bottom of a news story on a respectable site to be presented with an array of clickbait with alluring titles like “20 ways to stop hangovers your doctor won’t tell you” (made up by me) and “Architects took this water park too far” (real and no they didn’t – it’s just a big slide).
It got me wondering what a clickbait article for machine learning might look like – and then I realised that many machine learning terms are pure clickbait without any augmentation. Test yourself. Would you click on these headlines?
Sensational Details of Model Over-Training in Action
No, not that sort of model!
Legendary running coach Arthur Lydiard famously taught “Train don’t strain” and it is important to avoid the temptation to over-train or “overfit” a machine learning model.
It can be intuitive to train a model to the point where it exactly fits with the training data, but that will actually make it less predictive. In real terms, if you are building a supervised machine learning model to try to detect fraud, it is important to avoid the temptation to tweak it until it finds every tagged fraud in your data set.
You want your model to be effective on transactions it hasn’t seen. Overtraining gives too much focus on the noise in a particular sampled data set and not enough on the pattern you are trying to detect.
Knowing how far to train a model requires skill and experience. Done right it produces fantastic models which deliver great results both on paper and in production when they’re truly tested. Be wary of systems that advise retraining models all the time. They’re almost guaranteed to learn unimportant and irrelevant patterns and block swathes of good activity.
You Won’t Believe Details of these Unsupervised Relationships
No, not that sort of relationship!
One area of machine learning we at FICO find particularly exciting and have been researching for more than a decade is unsupervised techniques. They help solve a range or problems where prior “tagged” data is not available, or where algorithms need to find new patterns, sometimes on data that an algorithm or model might never had seen before.
Unsupervised techniques such as multi-layer self-calibrating outlier models — used in FICO’s Cyber Security solution — give enterprises the tools to monitor their networks and detect hacking attempts before data or IP are compromised. And if a compromise has happened, unsupervised techniques help protect the organisations that are then assaulted with the stolen identities.
PSD2 — arriving over the next two years — is another great opportunity, particularly while there is a lack of training data.
The Secret Reality of Living in a Random Forest
No, not that sort of forest!
There’s a lot of hype specifically about the efficacy of random forest models. These models work by growing a forest of slightly different decision trees and then seeing which responds better to your training data.
Whole academic theses (printed on actual forests) have and will be written about how good or bad the random forest approach is compared to other machine learning models. It has emerged as a popular technique — maybe partly because it is easier to comprehend than some of the other methods.
One problem is that random forests are inherently not explainable and that presents a real challenge with the introduction of GDPR next year. The legislation dictates, among other things, that enterprises must be able to explain their decisions. FICO have a number of approaches to add explainability to random forest and other machine learning models.
There is no single machine learning approach that will solve all problems, so it’s important to choose the right technique or set of techniques. Explainability is and will be important in many applications. Other considerations are speed, computational processing needs, storage needs and maintainability. FICO has experience of deploying machine learning at scale with applications such as FICO Falcon Fraud Manager capable of processing thousands of transactions a second with 10-20 ms latency.
The Sexy Features They Didn’t Want You to See
No, not those sorts of features!
Features are the lifeblood of predictive machine learning. They’re the parameters in the algorithm. When designing a model, the data scientist identifies features which might help solve the problem at hand and then they test their theory — again and again. Features can be simple, like the amount of a transaction, or they can be complex, such as composite calculations of network and device information.
Andrew Ng put it powerfully: “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
So maybe features aren’t sexy, but machine learning and data science definitely are. The Harvard Business Review thinks it’s the sexiest job of the 21st century. The reason for that is that machine learning is transformational and the data scientists who are part of the transformation will be those that combine domain expertise, precision and determination with creativity and originality.