Designing a modern data catalog at Microsoft to enable business insights

Nov 13, 2019   |  

Customer support/admin meeting at a digital consulting firm.

In this digital age, data is a strategic asset, and creating a modern data catalog is key to finding and connecting siloed enterprise data, which enables people to ask important questions and gain business insights. Microsoft used feedback from end users to develop an intuitive data catalog that incorporates discoverability, traceability, and access management. To gain adoption for the catalog, the team generated excitement through a structured marketing campaign that used internal social channels, posters and digital signage.

In this digital age, data is a company’s lifeblood. Growing recognition of data’s true value has spawned new industries like online advertising and e-commerce. Other traditional enterprises add value to their products by using data to gain insights and offer services that their customers need. What does this mean for technical decision makers and their leadership? If you are committed to digitally transforming your company, treat your data as a strategic asset.

For many enterprises, this is more easily said than done. The larger the company, the more barriers data professionals encounter as they try to access needed data:

  • Where is the data in each department? Who owns it?
  • Which tables have curated content?
  • Is the right data being used?
  • Is the data trustworthy?

The promise of using new technologies such as machine learning (ML) to deliver best-in-class user experiences falls short when the underlying data—the fuel that drives the AI modeling engine—lacks quality. As the saying goes, garbage in generates garbage out.

This inability to discover and manage data holistically in the enterprise stems from the fact that departments, infrastructure, teams, and data have been traditionally siloed. This silo structure makes it difficult for people to locate the data they need to answer their business questions. Any query that requires data stored outside a single silo involves tracking down the appropriate group, requesting and gaining authorization from the data owner, looking up data in unfamiliar physical tables, and attempting to manually connect information from different sources in a manner that produces the correct output. Most people consider this process so onerous that, often, their important business questions are left unanswered.

This case study describes how we at Microsoft view our data as a strategic asset. Here, we discuss the approach that Microsoft Digital is taking to create a connected data catalog that reduces the time required to discover and use trustworthy data for creating business insights.

Treat your data as a strategic asset

Today, companies can no longer afford to limit their data usage to single, siloed sources. To improve business, technical decision makers like you need connected data that spans departments. Data is the core driver for your success and your competitive advantage, but only when you use it to its full potential. Connecting trusted, curated data from multiple sources will help you better understand your customers and their needs. Access to such information might confer a competitive advantage today but extracting more meaning from your data will soon be critical to your company’s survival. Enterprises must gain insights into customers’ business challenges, and how well a company addresses those needs will soon define its success or failure.

As a technical decision maker, how can you help move your organization’s data from siloed, unintuitive tables to an easily discoverable, connected data pool? One that enables the enterprise to gain strategic insights that drive its digital transformation? Further, how do you give people access to this array of data without violating regulations like the General Data Protection Regulation (GDPR), the Sarbanes-Oxley Act (SOX), or the Health Insurance Portability and Accountability Act (HIPAA)?

At Microsoft, a key aspect of how we’re addressing the challenge of making trusted, connected data more readily available to users is by creating a modern data catalog. Our strategy is to develop a catalog that enables discovery of the right connected data and enhances data governance.

The problem with silos

The tech industry has a newfound awareness of the value of data and how it can be used to understand what customers need and want. The value of data for machine learning depends on the quality of it, and a team’s ability to find the right data, especially as it sits across silos in an enterprise.

There are a few barriers to accessing accurate and relevant data:

  • Siloed data across multiple organizations: In Microsoft Digital, data exists across business areas and processes like HR, finance, and supply chain. To find the right data across these organizations, you have to 1) track down the appropriate group; 2) request authorization to view the data; 3) look up data in unfamiliar tables; and 4) put the data together from different sources.
  • Poor data quality: Teams have to focus on cleaning and standardizing the data before they can use it.

Most broadly, there is no data marketplace where teams can find what they need. Microsoft Digital saw this as an opportunity to capitalize on data across organizations and to reduce the time required to discover and use trustworthy data. In line with the mission of creating end-to-end experiences, Microsoft Digital has created a modern data catalog that finds and connects siloed enterprise data.

The modern data catalog: enabling discoverability and governance

At a minimum, a data catalog is an information map that lists the data residing in one or many sources. However, it can be so much more. As part of our mission to modernize our data assets, the Microsoft Digital team has designed a common, cross-department data catalog that connects data sets across the company, reduces the time to discover and use trustworthy data, and helps enforce proper data governance. In the following sections, we discuss some of the key discoverability and governance features we’re building into our data catalog.

Making data more accessible and discoverable

To enable Microsoft Digital to help leadership use our data to ask the big questions—the ones whose answers lead us to strategic insights—we needed to run data searches that can connect data from multiple departments. A data catalog is instrumental in performing such cross-department searches. In addition to listing many disparate data sources, our new data catalog offers an intuitive interface that simplifies the entire data discovery process. Our data catalog:

  • Spans multiple departments, providing a holistic view into virtually any aspect of the company’s operations.
  • Provides “recipes” with instructions that vastly simplify how to use connected data sets for key business scenarios.
  • Displays intuitive logical names (such as Customer) instead of physical table names (such as CUST09) to improve discovery.
  • Highlights quality level, certification, and source-of-truth indicators for each data asset and the corresponding scenarios in which it is accessed to help users assess its appropriateness for use.
  • Expands data to include advanced analytics through ML components and models.
  • Supports real-time scenarios by including discovery of existing data APIs.

Figure 1 illustrates the strategy that we’re employing with our data catalog: we’re expanding its scope of included data and enhancing its functionality so that anyone can use it to ask business questions and gain valuable insights.

Illustration that depicts our data-catalog strategy, which will enable users to get answers to data questions quickly and easily.
Figure 1. Our data-catalog development strategy involves a multiphase approach that ultimately will enable business users to obtain the answers to their data questions with the click of a button
  • Phase 0 is siloed data. Each department’s data is locked into its own silo, delaying discovery of and access to data.
  • Phase 1 is connected data for developers. This is where developers work with a data catalog to identify and access cross-department data to develop data-sharing structures.
  • Phase 2 is quality, curated data for analysts and data scientists. This is where analysts and data scientists use the data-sharing structures to access quality data and follow instructions in “recipes” to define query objects that answer business questions.
  • Phase 3 is ready-made data objects for everyone. This is where business users select the appropriate object to get the answer to a given question. The object automatically accesses the connected data.

What this means for you

Break the silos. Developing a data catalog might not be a priority if you’re limiting your data search to a single set of siloed data, but these simpler queries won’t answer the big questions or support predictive analytics. Using such advanced techniques requires working across departments and incorporating different groups’ data—and that necessitates building a data catalog. Remember: data is the key ingredient that enables people to ask the big questions that ultimately produce the deepest insights. As you work across departments, include as many different data types as possible in your catalog, because you can’t predict which data will produce the next big discovery. At Microsoft, we’re taking a holistic approach, building out a data catalog that provides our data scientists with visibility into our entire rich assortment of data in different storage fabrics, cubes, reports, APIs, and events. This deep, high-quality information is precisely what our AI models need to predict future trends.

Follow the recipe. Instead of having your data catalog present technical lists and unfamiliar data structures that are difficult to decipher, consider designing a more intuitive interface. For example, we’ve built logical models that map to the data assets, as well as “recipes” with easy-to-follow instructions for how to generate the right connected data set for a given business scenario. Building these recipes is a collaborative effort involving technical teams that design the framework/interface, and business-process experts who can identify the appropriate data sources and provide instructions for how these “ingredients” are best used for each scenario.

Applying data governance without stifling discovery

A good data catalog does more than ease discovery. It also helps drive data governance by building controls into the catalog’s data-discovery process to ensure that only people with appropriate need and authority can access sensitive data. This promotes compliance with government regulations through processes, patterns, and tools for data management and governance of Microsoft Digital data assets.

With the right mechanisms in place, a data catalog can even require the requestor to fill out additional forms or sign policies to understand data sharing and usage before user access is granted. This promotes compliance with various government regulations. Data catalogs can also incorporate traceability, tracking who accesses what, how often the data is accessed, and how it’s being used. Our data catalog:

  • Gives management a clear understanding of the data we have and what that data represents, which is key to applying appropriate policies to ensure compliance.
  • Includes certification attributes that flag vetted data, then prioritizes data from these certified tables as the recommended source for building connected data sets. This helps us use the highest-quality data while bypassing other tables in the catalog whose information might be less accurate (such as test data or data published by other sources).
  • Spreads data-ownership responsibility across the teams and individuals who understand its business value.
  • Incorporates traceability to monitor usage of data assets and enforce protection.

Technical decision makers can use our data catalog’s dashboards to quickly scan all the organization’s data and review health information. They can then drill into details to identify potential issues that can impact governance (as illustrated in Figure 2) and subsequently take appropriate action for remediation.

Illustration that depicts how we leverage from siloed data across the organization, classify it, and identify the certification, resources, and filters.
Figure 2. We’re designing an intuitive interface that technical decision makers and IT leadership can use to scan the data sources, review health and other conditions, and identify potential issues and quickly remediate them
  1. Selecting the Customer hot spot in the data map displays summary information about the health of the Customer data.
  2. This health table highlights that non-certified asset consumption is higher than desired. Select Consumption of non-certified assets to view consumption details.
  3. You can easily identify which non-certified data is used the most, then take appropriate action.

What this means for you

Incorporate governance into your data-discovery processes. As you develop your data catalog’s cross-department capabilities, you should consider inserting mechanisms that enhance governance. You can design your data catalog to become the primary tool that helps you identify how your data is being used in the enterprise, and then use that information to enforce security policies that help maintain compliance. At Microsoft, we’re incorporating the following governance capabilities into our data catalog:

  • Allow publishers to self-attest that they’re submitting a trusted data set, which can also be certified by a governance board.
  • Send automatic access requests for restricted data to the data owner as soon as the need for that data is identified.
  • Provide links to forms to fill out if additional authorization for access is required.
  • Surface access policies before providing access to data.
  • Track who’s using sensitive data, how frequently that data is being pulled from the data catalog, and where the data is going.
  • Set retention policies based on how frequently data needs to be accessed, such as setting an auto-expire to every six months.
  • Use the data catalog’s telemetry to identify areas of improvement for data-management processes, such as redundant data sets.

Align each data set’s curation process with its related business process. This requires both a cultural change and a process change, because to manage the data catalog successfully, you need to foster partnerships between the developers who define the data-catalog structures, and the businesspeople who have a good understanding of the value of the data and how each data element should be used. Employees who become data owners must understand the value of supplying and maintaining their data. You also need to streamline the data-management process as much as possible. Don’t assume that you can just create a form and expect people to fill it out, and continue to manage their data on an ongoing basis. Instead, take time to review the business, and how business users actually orchestrate their data. Then, give users tools to move the information into the catalog that aligns with that process.

The goal is to simplify data entry and management so people are encouraged to maintain their data. For example, we’ve created auto-scanners that monitor our data lake and automatically register a data file on behalf of the developer the moment it’s added to the data lake. The appropriate businessperson—the data owner—is then notified and asked to fill out details of the new data-table files that have been placed into the data lake for them. We’ve also created a Microsoft Excel export capability to simplify how business owners add metadata to the catalog.

Certify and promote your best data. Find the right balance between including only the best curated data (which would be extremely limiting) and adding everything to your catalog, including non-curated and test tables that can add large amounts of poor-quality data. At Microsoft, we’ve taken a best-of-both-worlds approach. We allow most data to be catalogued, but we also apply several attributes to each data source that describe the nature of the data, including whether it’s certified. Certified data has passed a validation process to ensure that it’s properly curated. Further, both certified data and other higher-quality data are surfaced in the discovery experience (such as search results) so that data seekers always use the highest-quality data in the data catalog.

Creating an end-to-end experience

Creating a modern data catalog isn’t just about solving a discovery problem. It should also offer an end-to-end experience that provides access to the data that a user needs. Microsoft Digital’s goal was to empower data consumers to search and find data to solve their business problems, and to enable data scientists to make use of the AI and ML models of other scientists and data owners in their own work.

To create this end-to-end experience, we took an agile, customer-centered approach to design the data catalog based on feedback from users. We began by gathering requirements across Microsoft Digital and doing a competitive analysis of other data catalogs at Microsoft. We met with key users to understand their needs and pain points, which were prioritized when developing the new iteration of the data catalog. Based on feedback from end users, the team iterated on a paper prototype of the data catalog’s interface. This enabled us to validate design decisions and workflows with user feedback. Once we arrived on a design, we built a fully functional prototype to validate and test the user experience and user interface of the catalog. We tested the search experience with pilot subscribers to validate our design decisions and the end-to-end experience.

Our modern data catalog offers multiple views for finding the right data set and getting access to it. These views include:

  • A card view,
  • A list and table view to see large sets of data, and
  • A visual browse view that shows how data is categorized and connected across data sets.

We also added an option for users to filter their data or bookmark searches, which is helpful when users perform multiple queries across sources. With this modern data catalog, engineers, analysts, and data scientists can find the right data for their project and apply machine-learning techniques. Using the modern data catalog also ensures that data consumers are receiving “source of truth” data, meaning they can trust or verify that the data can be used for its intended purpose.

Feedback continues to be a cornerstone of maintaining a modern data catalog. We use automated telemetry about the number of monthly active users and search success rate as well as verbatim comments from users to understand the current solution’s effectiveness. This feedback tells us which elements need to be prioritized in subsequent updates, and which capabilities are most used and valued by customers. Combining telemetry about usage with direct feedback from users gives us a complete picture of what works, and where we need further investment.

Driving awareness through evangelism

Developing a modern data catalog isn’t just about building a solution. It’s also about creating excitement for the catalog and highlighting its value, even before the official launch. As illustrated in Figure 3, the go-to-market strategy was divided into three phases: pre-launch, launch, and post-launch.

Illustration of pre-launch, launch and post-launch major activities of a marketing plan.
Figure 3. The three phases of our communication and marketing plan for the implementation of our modern data catalog and the major activities that happen within each phase

During pre-launch, we identified a key set of features and capabilities that users wanted. To do this, we had to make sure we had enough assets in the catalog and we had to provide catalog users with the training materials they needed for immediate success. We invited potential early adopters to test our modern data catalog and provide feedback to help us work through any kinks. We called on the Data Analytics Working Group, which comprises key leaders who use data every day. Employee feedback helped our team identify common issues to address before the catalog shipped. We also gained new fans and influencers who drove adoption in communities across Microsoft Digital by creating anticipation and awareness of the solution.

During the launch, we created awareness about the modern data catalog by investing in a structured product-launch marketing campaign that took advantage of a range of channels. We strategically created a series of teaser campaigns with previews and quick demos, and we shared them on internal communication channels such as Yammer and Microsoft Teams. In addition to campaigns across Microsoft Digital, we promoted posters, videos, and digital signage on Microsoft campuses in the United States and in India. We tracked how many people had scanned the QR codes on our posters to get a sense of the campaign’s impact.

Post-launch, we continue to support the ongoing success of users with office hours and brown-bag events, where we share the latest best practices and featured demos. Providing these resources ensures that users have the necessary collateral to drive awareness of the data catalog. We also continue to collect telemetry and feedback, which inform future iterations of the catalog.

Best practices

Create conversations between data owners and data users: Before, conversations about data quality would occur only between team members and business stakeholders. Joining producers and data consumers in conversation improves data quality and increases business value. The more they talk, the more they can identify gaps in governance and understand how data is used and identify any additional security and compliance needs.

Develop empathy for your users: Before proposing a new solution, start by understanding the problems faced by your users. By talking directly with the people who will use your solution, you can eventually create a data catalog that benefits all end users.

Account for the impact of marketing: Microsoft Digital initially projected usage numbers based on the usage of existing data catalogs. The number of users was tenfold our expectations in the first month, which reflected the increased interest created by marketing. It’s important to account for the impact of marketing and create a data catalog that can autoscale accordingly.

Create tools to support accurate data tagging: It can be challenging to correctly populate metadata, especially because data publishers have time constraints. Moreover, there is a human element in providing meaningful descriptions of tables and columns. To address this, we implemented telemetry to see where people get stuck in the catalog because of confusing documentation or unclear metadata. Then, we reached out to publishers to outline the problem and support them in accurate data tagging. This ensures that the catalog offers discoverable, usable, and certified data.

Conclusions

Technical decision makers are under immense pressure to help transform their business into a digital company—and the foundation of any digital company is its data. Companies must use their data more holistically, promoting a means for people and AI to synthesize information across the enterprise to gain deeper, more accurate insights that unlock new business capabilities. However, the data commonly siloed in different departments of the enterprise is a major stumbling block—you can’t use what you can’t discover.

Creating a modern data catalog is a foundational aspect of finding and connecting your data. If it includes data from all the enterprise’s departments, a data catalog can disrupt your old business model and deliver insights into the state of your business, into what your customers need, and more. When designed properly, a data catalog can dispel the notion that data discoverability conflicts with governance. Data catalogs that incorporate traceability and other governance features make it easier for your leaders to understand what data they have, how it’s being used, and whether the regulatory mechanisms they have are in place and effective.

At Microsoft, our strategy is to build a holistic data catalog that incorporates discoverability, traceability, and access management. Our data catalog connects information from many different areas and ensures that information is curated to help consumers use it appropriately. This ensures that we can ask the big questions, that the right conversations are happening, and that the best AI models are being built. Our connected-data catalog is helping us discover new ways of adding value to our products and services, aligning our capabilities to benefit our customers now and in the future.

Our next steps

We’ll continue enhancing the scope of the data sets in the catalog, adding more metadata to simplify the query process. We began with key business-operations data and “recipes” for data engineers who need connected data sets to support key business-domain questions and create connected experiences. We’re in the process of collaborating with the Azure Data Catalog product team to apply our lessons learned to their product.

As we continue to add data to our catalog, we envision empowering every employee to use and share the data catalog. Our data catalog will be key to helping people find the connected data they need quickly, taking advantage of its functionality to gain insights and create data-driven, intelligent user experiences that continue to propel us along our digital transformation journey.