In mid-March, David Lazer and his colleagues published a paper in Science that demonstrated that Google Flu Trends overestimated the number of cases of flu substantially. Google Flu Trends and other similar success stories have amplified the big hype around Big Data. After the publication, many more articles appeared taking potshots at Big Data. This is in sharp contrast to the almost juvenile euphoria about Big Data of the last two years. This negativity is as much unnecessary as the earlier hype was.
Our Chief Analytics Officer Dr. Andrew Jennings has been advocating for a more balanced approach to Big Data for quite a while. He wrote in 2012 that “it is dangerous to assume that more data is automatically better than less data.” I understand that he was referring to the volume and the variety of data that is available in Big Data paradigm. Such measured approach has helped us in consolidating Big Data for better analytics.
Using our time-tested analytic development methodology, along with new ways of leveraging Big Data tools and methods, has allowed us to make great progress in monetizing Big Data. Reviewing the Google Flu Trends “debacle” under these lenses provides a useful study of how to leverage Big Data for solving business problems. It is worth noting that the real issue is not with Big Data per se, but in ignoring the tenets of data analytics, something that some data scientists occasionally fumble with. I describe a few in the next paragraphs.
- A data source is only as good as the value it provides: A new data source, whether “Big” or not, doesn’t lead to a better model by default. Nor does it suddenly make a traditional dataset redundant. The predictive power of a new data source needs to be carefully examined and it should be used only if additional lift is observed. If the traditional data source provides the requisite predictive power, then no new data source is necessary. Sometimes using the new and the traditional together is the most effective approach, as the researchers showed in case of Google Flu Trends.
- Business insight is a must: Without business insight and domain understanding, effort to extract value out of vast amounts of new data sources made possible by Big Data will most likely fumble. Dr. Jennings predicted that more organizations would “realize that domain expertise is a critical asset that helps analysts develop the intuition to understand what their data is really telling them, which findings are truly important, and when additional analysis is necessary.” In this light, it would be interesting to see how Google Flu Trends behaves once medical understanding of flu, as well as reasons for searching flu related keywords, are incorporated in the model.
- All models grow weak with time: Another reality is that all models eventually wither away. A model that validates well while it’s built may cease to validate as time lapses. So paying attention to building a robust model is important. This problem doesn’t go away in the Big Data realm, and a data scientist can’t be cavalier towards this. There are various techniques that can be relied upon whether it is Big Data or small data. Ultimately, models need to be rebuilt before they stop validating. Google Flu Trends was eventually updated to address the prediction drift, but probably a bit too late.
We expect the negative hype around Big Data to eventually die down and hopefully a more even-tempered scientific discourse will emerge. There are learnings for both the critics and supporters of Big Data. Practitioners of Big Data analytics should explore best ways to leverage traditional analytic methods and new capabilities of the Big Data paradigm. The focus should be on developing an optimum synthesis of the approaches to create better and more predictive analytic models.