By Mike Farrar
Big Data, Big Data, Big Data. It’s all over the popular business press – when the Harvard Business Review devotes its cover to Big Data, you know it has to be trendy. Big Data is going to let you do all sorts of amazing things. It’s going to be a competitive advantage, it’s going to enable efficient new processes, it’s going to let you track customer behavior, it’s going to get the crabgrass out of your lawn, and it’s even going to get your kid to study for her calculus exam.
Okay, okay. It might actually be able to do all of that. No guarantees on the calculus thing.
You need to extract the information locked away in all that Big Data if you’re going to make any use of it. That isn’t easy. If it were, it wouldn’t be Big Data. It’d just be plain old regular data.
The Trick, Part One: Extracting Value from Big Data
First, a couple of words about thermodynamics (bet you didn’t see that coming). There’s this notion of something called entropy, where basically everything in a completely entropic system is random anywhere you look at it. Everything is the same. It’s just a big humming cloud of noise.
The opposite of entropy is when things aren’t completely random and aren’t all the same everywhere you look. That’s what we look for – the signal in the noise. That signal is information, and information is valuable. The value of data lies in the information you can extract from it.
This is the curse of Big Data. Big Data isn’t really about the amount of storage space it takes up. It’s about information density. There’s a lot of noise in Big Data, and there isn’t a lot of signal. But since there’s so much data, if you dig through enough of it, you might pick up on that valuable, elusive signal.
Many companies share the predicament where they’ve stored all this data, the server farms are gulping electricity and taking up space, and nobody knows whether there’s actually any signal in there. They’ve stored it on the belief that there’s some signal, but they don’t exactly know how to get any value out of it.
Traditional analytic techniques don’t do so well with Big Data. Simple techniques like crosstabs, or more advanced but well-understood techniques such as factor analysis, are hopelessly resource-intensive to run on a huge dataset. Imagine running a crosstab that examines every single record in your Big Data. You can try it, but your competitors will be making progress while you wait.
The traditional sensible solution to a dataset that’s too large is working with a smaller sample of the data. Makes perfect sense, but Big Data doesn’t have much information density. Cut your sample too small and you may not be able to pick up on the faint signals of information you seek.
Beyond the struggles with data size, traditional techniques were developed to work best on structured data. Thus, much of the new data being stored is unstructured; it doesn’t parse easily or cleanly into traditional databases. URLs can be like this, and any natural-language data is almost inevitably unstructured.
So there we are: stuck with techniques that may not work on data they can’t work with, and the stuff is too darn big to work with anyway.
Great. Just great.
But don’t despair. There’s always hope. Human ingenuity triumphs over most obstacles, and we’re dealing with a problem that seriously geeks a legion of data scientists. New techniques are coming into play all the time. Non-Negative Matrix Factorization, Bayesian Networks, Restricted Boltzmann Machines, etcetera, etcetera – a nice compilation of techniques (both traditional and newer) can be found at https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms.
The race to push back the frontiers of Big Data will never end as long as it contains information to be exploited.
So much for the first part of The Trick. Now comes the hard part.
Look for our next post soon on The Trick, Part Two.