Welcome to part 2 of this month’s Azure Partner Community blog series, about data platforms and advanced analytics. Read part 1.
- Register for the February 18 Azure Partner Community call
- Join the Azure Partners group on Yammer
- Sign up for the Azure Partner email newsletter
- Read other Azure Partner Community blog posts
by Jonathan Gardner
US Partner Technology Strategist for Microsoft Azure
A discussion about data platforms and advanced analytics, this month’s Azure Partner Community blog series focus, must of course include the topic of big data. In my conversations with partners, I usually find that we need to level-set about what “big data” is, and then cover the basics of advanced analytics. In this post, I’ll outline the foundation for these conversations.
I love talking to people about their environments and their data. The environments vary wildly in size and data type. But whether they really have “big data” is related to whether their data have one of these three V’s: Volume, Variety, or Velocity.
|For years, organizations have collected vast amounts of data, and continue to do so exponentially.
Here are examples of scientific data collection that demonstrate volume:
The amount of data being collected can reach into the hundreds of GB, TB, PB range. I recently saw this statistic: In 2010, Twitter generated more than 1 TB of tweets daily.
These examples are meant to be extreme, but I have worked with smaller organizations that have hundreds of TB of data. That qualifies as having big data.
Variety refers to the type of data that an organization collects.
An organization may have structured data from their ERP system and unstructured data that they are collecting for brand analysis from social media. These two data sets vary not only by type but in schema as well.
Organizations that want to make sense of these seemingly unrelated data types have big data. With these data types, customers can analyze complex questions. For example, many customers are looking at whether their presence on Twitter or their brand sentiment on Twitter and other social media platforms are affecting sales.
In the context of big data, velocity means that data that are typically small in size are entering the system at a rapid rate.
This is the type of data generated by sensors, Internet of Things devices, or SCADA systems.
These type of environments can generate 100,000 1kb tuples per second.
Data analytics pipeline and lambda architecture
There continues to be debate about additional ways to define big data, but what I’ve established so far in this discussion allows us to shift focus to how the data are actually processed.
The stages of the data analytics pipeline follow the logical flow of the data: ingest, processing, storage, and delivery. When we discuss the three V’s, it is clear that there are many different types of data, and the size that is needed to process can be quite large. Enter lambda architecture.
Lambda architecture was designed to meet the challenge of handing the data analytics pipeline through two avenues, stream-processing and batch-processing methods. These two data pathways merge just before delivery to create a holistic picture of the data. The streaming layer handles data with high velocity, processing them in real-time. The batch layer handles large volumes of data. Batch processing can take extended periods of time. By combining the layers, the streaming data can fill in the time gap missing in the batch layer. The image below illustrates this concept.
Lambda architecture and Microsoft Azure
With an understanding of lambda architecture, you can see that Microsoft has aligned Azure services to provide tools all along the pipeline. The below image outlines how Azure big data services fit into the lambda architecture.
Getting Started with Microsoft Big Data: Introduction to Big Data