Skip to main content
Big Data: Should You Collect it All?

A few years ago, following his success in predicting the results of the 2008 election, poll analyst Nate Silver wrote a book called The Signal and the Noise.  The book outlined his approach to understanding the limitations of polling to represent an elections ultimate outcome.  By better understanding what and to whom pollsters ask questions, he devised a methodology to focus on polls that, in aggregate, best represented the voting electorate and, more importantly, avoid those polls that historically skewed their results to benefit particular constituencies.

The title of the book best summarized the lesson:  Identify the meaningful signals and avoid the noise.  This is precisely the problem that Big Data presents to analytic professionals (and non-professionals) today.

In a world where you can collect everything, should you? Probably not. Recently, ZDNet’s Stiligherrian wrote an article in which he discussed the Big Data “collect everything mentality” that grounds many Hadoop implementations versus data minimization (which is a bedrock principle of data protection law). He wrote:

“Big data's approach of collecting as much data as you can, even if it seems irrelevant, because it may reveal a previously unknown correlation, also collides with the "data minimization" principles of data privacy laws, which say that you only collect the data you need to do the job.”

This goes to the heart of potential danger many Big Data implementations unknowingly will face in the coming years.  Collecting more data in and of itself is a bad philosophy. It can make gleaning value from that data inherently more expensive and challenging.  But add in the legal constraints, and those that collect everything may have a real problem on their hands.

Instead businesses should focus should on identifying valuable and relevant data specifically for the challenges that their business is facing, before data collection gets out of hand.  For example, having all the historical temperature data for the Atlantic Ocean isn’t going to help any business figure out how to better address decreasing margins of Amish furniture sold in Poland.  But understanding how those fluctuations in ocean temperatures affect the quality of lumber in Ohio and Pennsylvania may.

This is where FICO® Big Data Analyzer comes in.  The tool is designed specifically to provide the new data professional, ranging from business users to analysts and data scientists, the data query and visualization capabilities and the heterogeneous data interconnect functionality to parse any data set and separate the proverbial wheat from the chaff.  That is, it identifies data relevant to solve particular problems or provide predictive value.

Big Data Analyzer isn’t designed in a silo, however.  It is a critical new tool in FICO® Analytic Modeler software family.  It works seamlessly with Analytic Modeler to provide optimized data sets to the analytic modeling software – to create decision trees, scorecards, sentiment analysis, etc. In this way, the data repository is optimized to wrangle, refine, analyze and model data and ultimately to deliver the most powerful insights.

Equally as important, Big Data Analyzer provides all the capabilities a data manager – Big Data or otherwise – would need to limit and optimize the data collected and stored.  These limits can also reduce business data collection requirements and data confidentiality liability.

related posts