By Michael Olschimke, co-founder and CEO at Scalefree International GmbH
Co-authored with Marc Finger from Scalefree.
Technical reviewed by Ian Clarke and Naveed Hussain – GBB (Cloud Scale Analytics) for EMEA at Microsoft and Jonas De Keuster from VaultSpeed.
In our first article of this blog series, we have introduced the requirements of a modern data analytics platform. The foundation for this framework is the Data Vault 2.0 System of Business Intelligence. This article presents the Data Vault 2.0 reference architecture based on data lakes and we discuss how we implement it on the Microsoft Azure cloud. However, the architecture is not limited to the Azure cloud: the last article defined the data analytics platform as a distributed solution that can span across multiple environments. For example, our clients have implemented distributed platforms that span across multiple cloud regions, implement a multi-cloud scenario or integrate cloud solutions with on-premises solutions.
In addition, the architecture is not limited to relational databases. Data Vault 2.0 is a concept increasingly used for NoSQL databases where semi-structured and unstructured data is processed.
The Data Vault 2.0 Reference Architecture
The following diagram provides the Data Vault 2.0 reference architecture:
The presented reference architecture uses a data lake for staging purposes and follows the hybrid Data Vault 2.0 architecture. The data from the source systems is loaded using ETL, Python scripts, or pipelines on the Azure platform into the data lake, which persists the data. This is typically done using Parquet or Avro files. From our experience, a functional structure of the data lake, as described in a solution from our knowledge base, has proven to be most effective.
This architecture is preferred over a transient staging area using relational databases for multiple reasons: first, the original idea of the staging area was to reduce the burden from the source system. The problem is that the structure of source systems changes over time. Thus, over (a long) time, structured data looks like semi-structured data.
Changing source structures require reengineering of the relational staging area to capture the data again. On the other hand, filesystems in a data lake, such as the underlying Azure blob storage don’t mind storing more or less files, or if the internal structure of those files change. Therefore, the burden is not with the source system, it is with the data analytics team: they have to adjust their data structures in downstream layers, but the source system can always offload its data, no matter what.
Note that one of the source systems in the above diagram is the master data management system (MDM). From the data analytics perspective, the MDM system is a source system that will be absorbed just like any other data source.
Before we discuss the next layers, let’s focus on the final layer: the information mart. Those of you who know what a data mart in legacy data warehousing is, actually know what an information mart is. It is used to deliver useful information to the end-user, just like a data mart. But if a data mart is used to deliver information and not raw data, why didn’t they call it an information mart? In Data Vault 2.0, the definition of the information mart matches the definition of the data mart in legacy data warehousing, it’s just a name change.
The information mart is defined by the end-user and is often modeled using a dimensional model, such as a star schema or snowflake schema. It could also be a flat-and-wide entity (fully denormalized entity) or a model in third-normal form. It is typically driven and defined by information requirements and not modeled by the source system (there are exceptions which we call “source marts”).
Careful readers of this article might have already realized that there exists a gap between the data lake and the information mart layer. The data lake is functionally oriented, that is, modeled by the source systems. On the other hand, the information mart layer is modeled by the information requirements. Both layers are modeled independently from each other and most often, are not the same.
The gap between the raw data in the data lake, modeled after the source systems, and the information in the information mart, modeled after the information requirements, is bridged by the so-called EDW layer. This layer includes two sub-layers: the Raw Data Vault and the Business Vault layer.
The Raw Data Vault layer models the raw data but breaks it down into the fundamental components of all enterprise data: business keys, relationships between business keys and descriptive data. In a Data Vault model, these fundamental components are modeled using hubs, links, and satellites. We leave the details for a subsequent article in this series but it’s important to know that this layer does not change the content of the data in any way – it just restructures the incoming data set into smaller, underlying components.
While breaking the enterprise data into smaller chunks might be an interesting exercise (it is), it doesn’t close the gap between the incoming raw data and the expected information. The only difference is that the raw data is not modeled after the source models anymore. However, the Raw Data Vault model is still functional oriented because it models the raw data (not the information) in the way the business is using the data. The latter is a reference to the business key: that is how the business user is identifying and monitoring the business objects and that is described by the raw data.
The Raw Data Vault only prepares to bridge the gap by simplifying the next step: the introduction of business rules in the Business Vault. From a modeling perspective, this layer models the results of the business logic (also known as “soft rules”) and, if virtualized using SQL views, the business logic itself in the select clause of the view definition. Typical examples of business rules include the standardization of data, data cleansing, re-computations and alignment of data.
The L-shape of the Raw Data Vault is due to the fact that some of the enterprise data is good enough for reporting. This is where this article gets “unrealistic” for the experienced readers of this article, given the data mess experienced in today’s enterprise environment. But yes, there might be clean, useful data out there, and this data might not require any business logic to be applied in reporting. Therefore, the Business Vault is a sparsely modeled layer: a Business Vault entity only exists if a business rule needs to be applied. The resulting information mart entities, such as dimensions and fact entities are derived from entities in both Data Vault layers, the raw data in the Raw Data Vault and pre-processed data in the Business Vault. To be clear: not all data from the Raw Data Vault must be processed by an entity in the Business Vault to be used in reporting. This sparse approach reduces the effort to build, document, maintain and re-engineer the Business Vault. Because in reality, a good amount of enterprise data is useful in reporting.
The similar L-shape of the data lake has other reasons: first, the data lake should have a dual-use. It’s not only used by the Data Vault team to build the data analytics platform, but also by the data scientists to develop more ad-hoc solutions. They might not want to use a relational database but use the data lake tools instead (such as data bricks, SparkSQL, etc.). Well, they often change their mind when they realize that the Business Vault provides a library of business rule results ready for consumption. These results can also be joined to the data flows on the data lake and help them to bridge the gap between raw data and information. In the end, we all face the same issues here and a subsequent article on data science will go into more details for those readers who are more interested in such ad-hoc solutions.
Another reason for the dual-use is that most enterprises cannot afford having two separate data lakes once they have higher volumes of data. Using the data lake for both the Data Vault and data science essentially means that the data analytics platform includes the data lake and invites both casual users and data scientists to use both raw data and information, or a mixture of both, on the platform.
The next reason for the L-shape of the data lake in the above architecture diagram is that today’s enterprise data doesn’t only contain structured data that fits into relational databases. Instead, enterprise organizations deal with semi-structured data such as JSON and XML data and unstructured data such as PDF documents or images. They are not a good fit for a relational database and stay on the data lake.
But it is also known that there is no unstructured data: instead, we just don’t understand the structures of the data yet. For example, face recognition could be applied to images of persons on the data lake in order to associate them with customer records. By doing so, we extract some structured data from the unstructured image: the customer business key. Or optical character recognition could be used to extract plain text from a PDF document that contains scanned pages. There are two options how we deal with these results in the above architecture: either we use it to store the results of our data lake-centric business logic in the Business Vault or we just link the unstructured data with the structured data. The option to be used depends on the specific use case.
The message queue on top of the architecture diagram is not only used to capture real-time data but also to deliver real-time information. We discuss this architecture in more detail in the next article of this blog series. Data governance and the automation components are both based on (or provide) metadata and thus are supporting components of the architecture.
Implementing Data Vault 2.0 on Azure
The previous section introduced the Data Vault 2.0 architecture. We like to call this a reference architecture because the actual implementation depends on many circumstances: existing components to be used in the project, specific requirements to the infrastructure, and the tool stack to be used all influence the actual architecture in a project.
Therefore, in most projects, the actual architecture looks different from the reference architecture to some degree. As a general rule, the deviations should be minimal, justified and it should make sure that the actual architecture allows for adjustments and extensions in the future. For example, even if a client at Scalefree declines the need for real-time data, we always make sure that the architecture can deal with real-time data in the future, when needs and requirements of the client change. We don’t spent much effort and time (and thus cost) on it, we just take care of the later extensibility in the overall design process of the architecture.
In some cases, when project teams don’t fully understand the Data Vault 2.0 architecture, they sometimes re-arrange the layers or extend the focus of each layer. The resulting architecture often lacks performance due to complexity. Therefore, adjustments to the architecture are possible but should be applied with care.
On the Azure platform, the cost to adjust the reference architecture to the tool stack can greatly reduced by the Azure cloud-scale analytics framework, a set of scalable, and repeatable templates provided by the Azure team. The stated goal of the framework is to serve data as a product, provide all the data products from the Azure ecosystem, enforce data governance and security and above all, put business users and their needs into the center of the solution, not technology.
The framework provides a number of templates, which can be applied to the implementation in a Data Vault 2.0 scenario. The following diagram shows how we typically adjust the Data Vault 2.0 reference architecture using these templates:
The diagram shows that the data lake capabilities are provided by the Data Landing Zone Template. This template provides services to persist data and execute relevant workloads that enable regional deployments, the separation of data ownership, cost management and data sharing within the organization. One of the services deployed by this template is the data lake. In this article, we refer to the data lake as a concept, not the Data Lake service in the Azure cloud (which implements the concept of a data lake). The template will deploy various zones, including a raw data lake zone, curated data lake zone and workspace data lake zone. The raw data lake zone is the one relevant for the data ingestion into the Raw Data Vault later and matches the data lake in the above diagram. The curated data lake zone and workspace data lake zone apply data quality and other transformations which represent business rules (soft rules) that are applied downstream from the Raw Data Vault layer. However, these zones are applicable in a Managed Self-Service BI solution or when dealing with semi-structured or unstructured data that should be processed in the data lake instead of the relational Synapse database (the Raw Data Vault can also be created as a concept on a data lake). We will provide some more details about this approach in a later article on data science with Data Vault 2.0 in the Azure cloud.
The components of the second layer in the Data Vault 2.0 architecture, the EDW layer, is provided by the Data Product Batch template. This template will deploy the relevant databases, first and foremost Azure Synapse. For the remainder of this article series, this will be used to provide examples. However, many clients do use a mix of technologies for the EDW layer, for example a CosmosDB for semi-structured or graph data and a Synapse database for the structured, relational data. The template will provide a number of different databases, including optional MySQL and PostgreSQL databases.
To provide the real-time capabilities of the Data Vault 2.0 architecture, the Data Product Streaming template is used. This template will deploy the streaming components, including the Azure Event Hub, IoT Hub and Stream Analytics services into the environment. Here, it is important to note that both, the batch data and the streaming data, should be loaded into the same EDW layer (the Synapse database for example). The Event Hub and IoT Hub is used to transport the message streams towards and within the data analytics platform while Stream Analytics is used to apply business logic in real-time for the real-time enabled Business Vault. This will be discussed in more detail in the next article on real-time processing with Data Vault 2.0 on Azure. The Data Product Analytics template provides services to analyze data data both in the EDW layer and the data lake. A common scenario is to use AzureML to perform data mining tasks on the structured data in Synapse or the semi-structured or unstructured data in the data lake. Note that the usage is not exclusive – in many cases, both data sets will
be combined for such analytical tasks. From a Data Vault 2.0 perspective, the services in this template are used to implement business logic in the Business Vault – the results are typically stored in the relational Business Vault, but unstructured results might also be stored in a NoSQL Business Vault on the data lake.
Important to note here, is that the data analytics platform is not limited to a Data Vault use case. Instead, the data lake that is used by the architecture to stage data, can also be used for other analytical use cases, for example data mining with AzureML, or managed Self-Service BI and any other data science task.
Network, governance and consumption services are provided by the Data Management Zone template. Microsoft Purview provides data governance capabilities and allows the data analytics team to define glossaries, classify sensitive data and define data assets and their relationships as metadata. Similar metadata is also setup to automate the generation of the Data Vault 2.0 model and the loading procedures. In addition, the template provides services such as Microsoft PowerBI for dashboarding. With those templates in mind, the following diagram illustrates a typical technology stack for the Data Vault 2.0 architecture based on the Azure cloud:
Note that the above technology stack should only serve as a blueprint for the platform. The advantage of the Data Vault 2.0 concept is that the solution can easily be extended by additional technologies across different environments.
We have discussed the Data Vault 2.0 reference architecture and the adoption to the Microsoft Azure cloud in this article. In our next article we will take a closer look at the data streaming and discuss the options for real-time processing in Data Vault 2.0.
About the Authors
Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big-Data consulting firm in Europe, empowering clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia and publishes on these topics on a regular basis. Marc Finger is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.