Skip to main content
It’s the End of Data Warehousing as We Know It … And I Feel Fine

By Josh Hemann

Every couple of years, someone comes along and declares the end of something. In 1992 Francis Fukuyama declared the end of history, in 1999 Salesforce.com declared the end of software, and just last week, commercial Hadoop provider Cloudera declared that Hadoop would end the data warehousing era. (Although to be fair to CEO Mike Olsen, the headline on TechWeb declared it; he wasn’t actually quoted as saying it.)

Does anything truly end? No, it just transforms. We are now in a transformative era for data, data warehousing, analytics and enterprise software in general. Big Data is becoming part of enterprises’ strategic information architecture that deals with data volume, variety, velocity and complexity. And it is forcing changes to many traditional approaches. According to Gartner Group in its 2013 predictions, “this realization is leading organizations to abandon the concept of a single enterprise data warehouse containing all information needed for decisions. Instead they are moving towards multiple systems, including content management, data warehouses, data marts and specialized file systems tied together with data services and metadata, which will become the "logical" enterprise data warehouse.”

Of course, if you have been following this part of technology before Big Data became a catch-all phrase, you’d know this is not much of a prediction. For example, database pioneer Mike Stonebraker wrote about the end of relational databases in 2009. What he actually wrote about was that the “one size fits all” approach from relational database providers was no longer going to cut it, and new solutions would be needed to better address areas like unstructured text data or fast querying. And Stonebraker didn’t just rant on the topic: he founded three “next gen” database companies –  Vertica, VoltDB and SciDB – each aimed at tackling one of the nine major application areas he thought would require specialized data persistence.

OK, so back to Hadoop and the end of something… Low-value data that comes in large volumes very quickly seems like a great candidate for Hadoop-based systems because the cost-per-stored-terabyte is so cheap. Examples would be clickstream data and web server logs. But as the data becomes more valuable, storing them in Enterprise Data Warehouses (EDWs) that have evolved for the past 30+ years makes a lot of sense, because the entire ecosystem is much more well-understood and mature. Furthermore, sometimes the cheap commodity hardware story that makes Hadoop so cost-effective is not sufficient. When Oracle acquired Sun in 2010, it was precisely to deliver better integration of hardware and database software to optimize performance. And Intel recently jumped into the Hadoop fray with their own distribution that is optimized for use on their Xeon processors, with Cray just announcing last week a new supercomputing cluster built to run this Intel Hadoop distribution. Hardware actually does matter.

So what is “high value” data? From my point of view in working with businesses on being more customer-centric, high-value data come from  “transactions” mapping to important human actions (e.g., buying something at a store, selling a stock) that you have to persist correctly, immediately. Such data invariably feed many, many other systems and have numerous scheduled as well as (more importantly?) ad hoc queries run against them.

A recent Infochimps presentation provides a nice way to think about this value distinction with the following slide:


Infochimps

What jumped out at me seeing this slide is two things:

  1. The Enterprise Data Sources is confusingly mixing sources of data with data persistence. One can buy data from Axciom, so Axciom is the source, but a Teradata EDW is not the true source of the data. It is just the persistence layer in a much larger IT infrastructure that gets feeds from many sources (Point of Sale systems, inventory management systems, real estate teams, etc.)
  2. Mike Olsen wants you to believe that you can increasingly push out the Oracles of the world and replace that persistence capability with Hadoop. But you have to understand the larger Enterprise Data Sources infrastructure to know where all that data is coming from before you can hot swap in a Hadoop cluster and suddenly save money or get insights

    My point of view, at this point in time given where things are at technologically, is that the Non-Traditional Data Sources domain is perfect for Hadoop. These are generally high-volume, fast-changing, but low value data that in aggregate can be highly useful, but I really don’t want to pack tons of Facebook “likes” from dead people and barely understandable comments from YouTube videos into my multi-million dollar EDW. All of this lower-value, quickly changing data can go on cheap commodity hardware that uses Hadoop-based software to manage and query. Conversely, I don’t want to put my high-value customer transaction data, fed from a multi-million dollar Point of Sale system, onto cheap commodity hardware running Hadoop software that is rapidly changing, lacks functionality, and is harder to hire talent for compared to EDW solutions.

    It is not the end of the data warehousing era, just the blurry end as we know it.

    related posts