Recently, I was asked about the Lambda Architecture and how it applies to an organization. I have written 2 blogs on Lambda, so I thought it might be a good opportunity to organize the blogs and share my findings in one place here.
Building a Lambda Architecture Template
Before I go into Lambda Architecture, I would like to get ahead of myself by presenting a template that I built. I like the template because it mixes Big Data technologies, traditional relational technologies, and cloud enablement in one page. The template is very easy to use. You will see components of the Lambda Architecture (Speed layer, Batch layer, Serving layer, and I have added in a Cloud Scaling layer to make it cloud-ready) in the template, and you will also see some technology suggestions dotted within the template. Please DO NOT feel obligated to use the technology. You can CUSTOMIZE by putting similar technologies that fits you needs and budget.
Here’s the though process (the making of) behind the Lambda Architecture template:
The Lambda Framework
Lambda framework is a useful tool to conceptualize the flow of data and to facilitate the design of big data solutions. At Twitter, Nathan Marz came up with this framework aiming to address some of the data velocity requirements, and to serve data for decisions making. In the original framework, the architecture is broken down into three layers: batch layer, speed layer and serving layer. The layers are explained below.
This layer delivers streaming data. Specialized streaming and queuing technologies (such as Storm, Samza, Kafka, and Spark Streaming) are used to detect, extract, process, store, and present streaming data. Streaming data is a subset of all data; it represents data that is created or changed with a latency maximum of a second.
This layer delivers a broad range of batch data. The Hadoop MapReduce engine is the backend computing engine to process this batch data. The batch layer basically handle and process immutable data. Data will get accumulated, appended and stored.
The serving layer defines and plans for how data is being served for analytical purposes. The framework advocates a construct in which batch data is complemented by fast data to provide data that spans very recent activities and activities in the past. Serving that combined data would satisfy the majority of one’s analytics needs. The serving layer delivers the combined data in the form of queries. Essentially, the queries are the logical federation of the fast data and the batch data. Many analytical applications can make use of the queries and consume the combined data.
Pushing Lambda to the Next Level
While the Lambda framework is a useful tool for architects and engineers to conceptualize the big data stack and its components, the framework has its own limits:
The concepts are great. What about the Hadoop components? How do they fit into the big picture? The framework presents concepts such as fast data, batch data and the need for a serving layer. It does not provide suggestions as to which Hadoop component(s) to use in what situation. This is because the Hadoop environment is fast changing and new Hadoop projects are developed or incubated frequently to bridge gaps in the Hadoop ecosystem.
Perfect! I can architect a Big Data stack. How do I complement my existing data warehousing environment with this Big Data stack? The framework is mainly used in planning Big Data architecture while in reality, Big Data stack is used alongside with the data warehouse and BI to manipulate data of different nature. Lambda’s fast, batch and serving concepts can potentially be used in the data warehousing environment too but the framework hasn’t extended to that space yet.
Traditional end-to-end data management still applies to Big Data. How do you combine the traditional and the Lambda views? While the fast layer, batch layer and serving layer are good starting points for Big Data architecture. Architects and engineers also need to plan for the end-to-end journey of the data – data ingestion, data transformation, data consolidation, data provisioning and data analytics. The framework lacks this depth.
Don’t forget I have historical data on slower storage. Can I enrich the serving layer to include historical data too? The framework does not touch on historical data. In theory, historical data can be treated as a subtype of batch data. However, historical data may need its own treatment. Many organizations are storing historical data on Hadoop storage and on Cloud (private, public or hybrid), having a dedicated “historical layer” might be a good idea to ensure things are mutually exclusive.
With all those limitations and opportunities, I have gone on a quest to build on top of the Lambda framework and make it more robust. I had integrated the framework with traditional data warehousing steps – ingestion, transformation, consolidation, provision and analytics. The end result is the Lambda Architecture template I presented above. Enjoy the template.