First a brief personal history of data warehousing time.
In 1987, I started working with relational database systems when they were in their infancy. We tried to build OLTP and OLAP systems on single relational databases, client server platforms. It didn’t go so well. By the early 90s we’d figured out that the reporting environment had to be separated from the transactional environment if we were going to meet performance expectations. Inmon and Kimbal came along with their two views of the data warehousing world and datawarehouses for OLAP were replicated from the OLTP and legacy systems to do executive reporting. For about 20 years we’ve been writing ETL to populate the EDW, then pre-aggregating tables, and refining BI tools to build sexy reports and dashboard for people who don’t know how to or can’t do the work but need the numbers to make better business decisions. But that process still takes too long and won’t handle all the data that is relevant to the business’ requirement to know and understanding our customers’ likes and dislikes so we can provide better service and ultimately entice them to spend more money with us buying our products and services.
Time marches on
Enter the latest hardware , super cheap servers with super cheap disk that can be clustered together into massive arrays of storage. Moore’s law seems to be holding as all this technology just keeps getting faster and better and cheaper .
Mean while everybody who matters to marketers is connected in some way to the network that has evolved over these years. The Internet that no one knew what to do with 20 years ago has become a ubiquitous essential tool for global communications, enabling McLuhan’s Global Village (Yes, years ago I actually read McLuhan). Now, we can text and tweet and Twitter to anyone on the planet in seconds. An impossible scenario even half a human generation ago.
Now we can collect transmissions and transactions off of every chip, card and device that is connected to this ubiquitous network. All of a sudden we have the Internet of things. Machines that talk to each other without our aid. Collecting the data we generate. Some of us have a vague idea, but most of us can’t even fathom the breadth of this data content and volume. And all this data to what end? Data that will allow us to monitor our homes, drive our cars, travel the world, design and build our better society. Enough data so that the marketers can predict what we are going to buy and when we are going to buy it before we even know ourselves. Brave New World meets Minority Report.
But there is so much data we need ways to parse it. Mine it. Searching, sorting, and tweezing out the nuggets that matter from the masses of html and snippets of text and escape sequences and smiley faced modicons. The separation of the valuable information from the ubiquitous noise. The wheat from the chafe.
I was recently in the Musee d’Orsay in Paris. Our guide took us to this painting :
After the main harvest is over, the peasants come in to the field to pick up the few grains of wheat that are left behind.
And that’s what the Big Data world is all about. After you and I are finished with our data and have moved on. Someone else will look at it for meaning and hidden value, picking out patterns in the masses of information and discerning if it has value.
Gleaning
It’s The Next Big Thing in the world of ETL and BI tools that will allow us to manage all the data that is coming at us from all the devices we have accumulated, and have become so dependent upon, in our Jetson like world.
Not much has changed in 150 years, we’re still gleaning, only now it’s data instead of wheat.