Shafi Rahman and Eeshan Malhotra
Up until the recent past, analytic scientists had to resort to complex frameworks and data designs to solve their big data problems. This was necessitated due to lack of sufficiently abstracted frameworks. The advent of modern Big Data infrastructure (note capitalization) provides the necessary level of abstraction so that the analytic scientist can focus on efficient algorithms to solve the problem without worrying about the underlying complexities.
Apache Hadoop is a huge blessing. For a newbie trying to get a foot into the Big Data door though, Hadoop and the map-reduce paradigm can sometimes be quite overwhelming. MapReduce offers a unique abstraction that transcends algorithmic boundaries, but force-fitting all algorithms into this paradigm may lead to inefficient, and sometimes unintuitive algorithms. Judicious choice of Big Data technology would play a critical role in ensuring that the analytic scientist can keep their focus on the task at hand; rather than twisting and stretching their algorithm to fit into a given mold.
There are scenarios where Hadoop simply doesn’t provide sufficient level of abstraction. Tools that provide higher level of abstraction than Hadoop would simplify the algorithm development in such cases. For instance, carrying out real time querying of data stored in Hadoop is too complex to code up on your own using a map-reduce structure (even though it is possible to do so). Luckily there are tools like Hive that can make it fairly trivial to query such data, using familiar (often SQL-like) syntax.
A similar situation arises while working with network/graph data. MapReduce in its base form would be an inefficient way to work with such data and doesn’t fully leverage the similarities between a graph and structure of a distributed computational infrastructure. Tools like Giraph and GraphLab abstract out the map and reduce, and instead allow working with information transfer between nodes and edges of the graph. This is more intuitive in the problem space and a lot of existing algorithms from graph theory can be easily conceived in this framework.
Another scenario, that demands a departure from plain vanilla Hadoop, is when the algorithm might just fit into MapReduce, but by sacrificing efficiency. For instance, when one needs to iterate over the data repeatedly to arrive at an optimal solution, say, neural network feed-forward back-propagation algorithm, Hadoop may not be the best solution. Although it’s available as a readymade algorithm in Mahout, it is worth noting that if an analytic scientist had to design it themselves, map-reduce wouldn’t be the most effective choice. A loop aware extension of Hadoop, like HaLoop, can provide that efficiency. This is done by maintaining a cache of the data that is loop invariant, and allows for faster processing.
If you're an analytic scientist and the problem space you’re working in seems like something pervasive enough, there’s quite a chance that an abstract enough tool has already been designed for it. So, to efficiently devise a solution for your Big Data task, take a moment, and look around for the right tool, rather than pulling out your spanner at the first go!