Data warehousing has been around for a long time now. There are many myths and prejudices that have built up over this time, but it remains a critical component for so many businesses.
Data warehouses provide insight into the business, all the way from satisfying regulatory requirements to predicting what will happen in the future and how the business should respond.
Recently, the focus has been more on the ‘data lake’. As the prevalence of data has grown, new technologies to support the petabyte-scale datasets have been developed. Many new big data concepts have been pushing the myth that once you have a data lake, you don’t need a data warehouse.
Even if you don’t buy into the marketing, people are rightly asking if the data lake, with all the capabilities that big data storage provides in the cloud, has superseded the data warehouse.
The prevalence of big data and the ability today to store huge volumes of data cheaply has led to many (if not all) businesses having a data lake. So if this is the case, do advances in the data lake concept mean the data warehouse is obsolete?
In this blog we’ll look at five major concepts as to why this myth developed. I’ll also discuss why the modern data environment needs both a data lake and a data warehouse to drive better business and consumer outcomes, and how Azure Synapse Analytics removes the technical barriers that have driven this myth.
So first let’s look at the five concepts that have allowed this myth to develop:
1. A data lake provides many benefits over a traditional data warehouse.
It stores all types of data at a granular level, with less of a need to worry about scale and cost. It supports fast data ingest and no ETL bottleneck with schema on read. It’s an environment with open, flexible access and enables new types of analytics over the old model.
2. Driven by old perceptions; data warehouses are slow, inflexible and it takes too long to get anything done.
In reality, this is usually due to business silos and slow processes rather than technology. The data warehouse provides a single source of integrated, conformed, and quality data that provides value and competitive advantage to the business. Because it is managed by IT who are typically driven by SLAs rather than business need, processes (red tape) and checks (delays) are built in.
Traditional data warehousing systems were more often than not appliance based, where scaling beyond the installed base is costly and slow, and where scale and compute are linked. This meant they didn’t have the flexibility to extend or scale to meet new business requirements.
3. “I don’t need a data warehouse, I can run reports from my data lake using one of the popular big data engines.”
This quote sounds good in theory, but usually fails for three main reasons: predictable cost, performance, and enterprise grade governance and security.
A data lake usually doesn’t provide the performance needed to the end user. True data warehouse technologies like Azure Synapse Analytics have been built to support fast response and high concurrency without degradation.
Data lake reporting engines lack predictable cost – these technologies typically need to scale up to support concurrency and workloads. Azure Synapse Analytics, which doesn’t need to scale up due to workload management capabilities. As for compliance, governance and security, this often isn’t applied in the same way on data lakes due to a combination of processes and technology.
4. Usability and cost
Data Lakes are now the de-facto landing area for data, and teams are now used to using big data technologies like Spark to manage the data. Even though the benefits of traditional data warehouse technologies are clear, a data warehouse layer is seen as separate layer that requires different skills, tools and expertise to manage the complexity. Cost conscious or budget limited IT departments ask why they need to pay to support a further layer of technology to support analytics.
5. The view that self-service BI can now do it all
It is true that heavy governance issues and the associated time delays have meant users have always built their own “departmental” solutions to meet their immediate and critical needs. Today with powerful BI tools like Power BI, it has never been easier to do self-service business intelligence. With advanced features such as Dataflows in Power BI Premium, this allows the power user to do it all, enabling an analytics culture.
If we compare traditional IT driven intelligence with modern self-service BI, we see a big difference in focus. Traditional BI is structured and requirements driven. It’s good at core, well known reporting but lacks flexibility and the ability to keep pace with the modern fast-paced business.
Modern BI gives the user the ability to do what they want, when they want in a flexible environment. The problem with modern self-service BI is when it is implemented in an uncontrolled manner. People create their own versions of metrics and KPIs. Agility becomes anarchy. It can very quickly lead to data duplication and data drift, with no single version of the truth for the business.
Self-service BI cannot be at the expense of clarity or confidence in the result.
Utilise the best of data warehouses and data lakes together
What we need is a blend of both practices – both a data lake and a data warehouse. Data warehouses, where data has to be managed in a controlled manner, provide a single version of the truth and in an easily accessible manner to the end user.
To implement this together with the strengths of the data lake requires the development of a Logical Data Warehouse or Modern Data Warehouse environment.
Part of the problem with the old Data Warehouse concept was that it was built on a single monolithic system, e.g. Teradata, Netezza or Exadata as your data warehouse platform, with the restrictions that these technologies bought.
Pivoting to the other extreme, using only data lake technologies brings a different set of problems. The Modern Data Warehouse ecosystem includes the data lake and data warehouse. Where different technologies are used in a flexible manner, with the right tool used for the right job.
Removing the barriers for modern data warehouses
Azure Synapse Analytics enables the modern data warehouse and removes all the technology barriers that were used as an argument against data warehousing in the past.
Utilising the power of the Azure Cloud with its integrated offering, removes the barriers between data lake storage and data warehouse analytics and concurrency. It provides a limitless analytics service that brings together these traditionally separate environments.
Azure Synapse provides a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs. Azure Synapse Studio provides the interface to orchestration and integration capabilities, Big Data Storage, the Data Warehouse engine, Spark clusters and integrated Power BI and Azure Machine learning capabilities.
With Azure Synapse, data professionals can query both relational and non-relational data at petabyte-scale using the familiar SQL language wherever it resides – in the Data Lake storage or the Data Warehouse – or through an integrated Spark cluster with their language of choice Python, Scala, R, .Net, or Spark SQL.
Azure Synapse Analytics allows the implementation of a logical data warehouse concept. It gives users the freedom to query data to meet the business needs, using either serverless on-demand or provisioned resources, all at scale. It is now very easy, and fast, to bring new use cases to the cloud. This is all done by using “right tool for the job” in an integrated environment. It provides the flexibility to scale storage and compute independently and as required to meet the business requirements, via provisioned or serverless on-demand resources, so you only pay for the resources you need.
Foster a culture of innovation
Coming back to the original concept, the argument that you don’t need a data warehouse once you have a data lake is now made redundant by Azure Synapse Analytics. We can now continue to provide the benefits of a traditional data warehouse. However, instead of persisting with some of the older practices associated with data warehouses, a new data environment can be built upon the data lake. This fosters a culture of innovation and collaboration. It also provides end-users access to the data they need, when they need it.
Azure Synapse Analytics is at the centre of a modern data warehouse architecture, including both a data lake store and modern data warehouse technologies to enable agile working and satisfy the businesses high value requirements.
Find out more
- Try Azure Synapse Analytics
- Implement a Data Warehouse with Azure SQL Data Warehouse
- How to Operationalise your Data Lake
- Azure Synapse Analytics Overview
- Cheat sheet for Azure Synapse Analytics
About the author
Greg is an accomplished IT professional and Solution Architect with over 25 years’ experience in Big Data, Data Warehousing, BI and Analytics, developed through tenures in organisations such as Microsoft, Teradata, Kognitio, Oracle and Ericsson.
He specialises in Analytical and IDW Architecture, with extensive experience of building solutions, technical design and implementing big data and analytical environments. He has an in-depth knowledge of Big Data and large-scale BI and Data Warehouse implementations in numerous enterprise environments.
Greg’s approach to work is results-driven, utilising extensive solution design and technical delivery experience together with the ability to work through complex solutions to solve difficult challenges. He has been acknowledged throughout his career for delivery rigour coupled with a calm, pragmatic attention to detail and an ability to quickly establish understanding and provide leadership on challenging projects.