Unify your data lakes with HDFS tiering in SQL Server Big Data Clusters
As the volume and variety of data has risen, it has become more common to store the data in disparate and diverse data sources. A challenge many organizations face today is how to gain insights from all of their data across many different data sources. With SQL Server 2019 Big Data Clusters, through innovative enhancements, we’re extending the data virtualization capabilities even more with a new feature called HDFS tiering.
HDFS tiering allows you to easily integrate and gain insights from all of your data by accessing unstructured data stored on remote data lakes. This can be done by mounting the remote HDFS/S3 compatible data source to your local HDFS data lake.
This new functionality is Microsoft’s latest major contribution to the Apache Hadoop open source project and will be available in the market first in SQL Server 2019 Big Data Clusters.
Before we look closer at HDFS tiering, let’s quickly look at SQL Server Big Data Clusters as a data platform.
SQL Server Big Data Clusters
SQL Server Big Data Clusters is a complete data platform for analytics and AI with a local HDFS data lake built-in for storing high volume and/or unstructured data. In the big data cluster, you can use two different compute engines for querying and machine learning: Apache Spark™ and SQL Server.
Currently in SQL Server Big Data Clusters, you can use HDFS tiering to mount the following storages: Azure Data Lake Storage Gen2, AWS S3, Isilon, StorageGRID, and Flashblade. We are expanding this list to include other major HDFS/S3 compatible storage solutions both on-premises and in the cloud.
Now let’s take a closer look at HDFS tiering
HDFS tiering
HDFS tiering allows you to mount a remote storage to your big data cluster and instantly gain access to the remote data from either Apache Spark™ or SQL Server, seamlessly.
When the mount command is issued, the mount credentials are used to authenticate to the remote storage and copy the remote file and directory metadata including permissions to the local HDFS. This operation is relatively quick since only metadata is copied. There is no data movement!
After completion of the mount operation, you gain immediate access to your remote data. On the first read operation, the data that was read will be cached locally by default. This means that subsequent reads of the same data will experience better performance since the data will be read from the local cache.
The default cache size is set to two percent of the total storage capacity in the local HDFS data lake and the cache for a specific mount will be emptied when a mount is refreshed or deleted.
Create a mount with a single command
Creating an HDFS tiering mount in SQL Server Big Data Clusters can be done with one command:
azdata bdc hdfs mount create –remote-uri <URI to remote storage> –mount-path <local HDFS path>
Watch this video for a demo of how HDFS tiering can be used in SQL Server Big Data Clusters.
It has never been this easy to gain instant access to remote data and limitless storage in the cloud from your local big data cluster. However, ease of use is not the only value gain with HDFS tiering:
- Save costs and reduce data movement
Instead of copying large amounts of data from one data lake to another, and maintaining additional integration pipelines for data movement, HDFS tiering allows you to leave the data in in cheaper object stores, and get faster turnaround time with on-demand reads and caching.
- Secure sharing of big data
HDFS tiering makes it easier to securely share your organizations big data across teams to ensure you get the most value out of your data. Upon mounting, the remote permissions are copied to your local data lake, which means that the remote permissions will always be honored every time the remote data is accessed. In addition to this, HDFS tiering supports secure mount operations using OAuth access keys to authenticate to the remote data source. Azure Active Directory support for mounting against Kerberos and Azure Active Directory joined data sources is coming soon.
- Portability across compute engines
Analyzing all your data across different data lakes provides the freedom to use the compute engine that best fits a given use case. In the big data cluster, you can use SQL Server and Apache Spark™ out of the box, for your data processing and analysis. HDFS tiering enables both compute engines to process data in your local and mounted data lakes seamlessly.
Join our customers and experience the benefits of HDFS tiering yourself.
“HDFS tiering has saved us lots of time and money in development costs. We have lots of data stored in Azure Data Lake Storage Gen2. With HDFS tiering we can simply mount to the data in those locations without having to create and maintain a separate integration process.” – Lance Milton, Application Management Advisor – Data Integration at ENGIE North America
To learn more about how you can unify your data lakes with HDFS tiering in SQL Server Big Data Clusters, please visit the HDFS tiering documentation. And if you’re interested in the technical details of how we built this new dynamic mounting functionality in HDFS, we encourage you to read more on the Jira page.