"A bounty hunting scam joins two men in an uneasy alliance against a third in a race to find a fortune in gold buried in a remote cemetery,” reads the plot summary of 1966's quintessential spaghetti western, The Good, the Bad and the Ugly.
The state of Big Data Analytics can feel like a different kind of spaghetti western, this one titled The Big, the Raw and the Complex.
In this version of the story, we’re still searching for buried gold — actionable insights about what can delight our customers and improve our bottom line — but instead of tracking down one unknown grave in one unnamed cemetery, we’re inspecting all the blades of grass on every hill and dale in our data landscape. If that survey contained a petabyte of information (one million gigabytes, or nearly 5,000 days of high definition video), then today’s high-end PC would require about 3 months just to read through that data, one time.
And while Big is good — more data gives us a chance to build stronger, more refined models — Big can be pretty tough when it demands a year or two of processing time from your solitary computer.
The second character is raw data, which doesn’t live by the traditional laws of the land. Classic statistical datasets are deliberately gathered into neatly structured blocks of numbers and categorical labels. But 80 percent of Big Data is nothing like that. It’s raw, untamed and rough around the edges. Email messages, customer calls and video streams are readily understood by us humans, and contain a great deal of potential signal, but they are meaningless to classical empirical algorithms. To efficiently find the gold, we need increasingly sophisticated text and media mining tools to translate those unruly, unstructured sources into better-behaved (if not strictly law-abiding) data.
The third character in our Big Data western is the hardest to wrangle, because it’s so hard to see exactly where he begins and ends, and whether he’s got any decent limb to rope. I’m talking here about the inherent complexity of some Big Data sources — like machine-generated logs, web click-streams, complex financial transactions, and social media connections — which have to be pieced together from various sources, carefully re-collated by time, location, agent, device or relationship, and finally assembled into some new structure that can clearly reveal an often complex underlying pattern.
For example, if a machine in our network went down this afternoon, what preceding signals can help us trace backward, from the server’s last dying whispers, to the intermediate symptoms, back to the ultimate route cause? Where was the first sign of trouble, and when? Could this be the result of a single user’s inadvertent misstep? Or was it a malicious, coordinated attack by many agents? A spurious, rogue process, or a simple hardware failure? Finding any of those clues can take a long time, even for the most skilled data analysts, and with no promise of reward.
Imagine a treasure map that’s been torn to bits, mailed separately to a hundred different post offices, translated at each into the local language, and then squirreled away by 100 individual bookkeepers, each with their own brand of filing systems. Happy treasure hunting!
Staring Down these Challenges
But all is not lost. Here are three things data scientists are doing today to confidently generate insight from the untamed Big Data frontier.
- Use modern, reliable, scalable data processing, via Hadoop and its many extensions. The MapReduce programming framework gives us a means to express many data management and mining processes, while Hadoop manages the parallel execution of those jobs across grids of potentially enormous size. With such tools, the impossibly Big becomes tractable.
- Rehabilitate unruly words into well-behaved numbers. Text mining algorithms can find meaningful associations between the language elements and easily quantified outcomes, so that you can translate raw, human-readable text into structured, modeling-friendly features. The best tools allow you to complement the machine’s tireless data crunching with your own knowledge of the landscape.
- Identify the uneasy alliances with link analysis. Having corralled our data into some presentable shape, link analysis tools allow us to discover, mine and visualize the hidden relationships that exists among actors, transactions, devices and events. These are vital tools for the analysis of topics as diverse as social networks and financial crimes.
Big Data is by its nature big, raw and complex. But with the appropriate strategies and the right set of tools, we can find the gold in the data … even without the messy gun fights.