How Microsoft connects high-quality, discoverable data

Jan 30, 2020   |  

At Microsoft, employees were already aware of the power of using data to create experiences that people love.

But that awareness wasn’t enough to bring the data to life—they needed better tools.

That’s why Michael Lucas set out to transform the company’s internal data catalog. He wanted to make it easier for employees to find data for their work, as well as upload assets for others to use. Lucas also wanted to provide tools for assessing data quality, which is a measurement of the data’s accuracy, consistency, and reliability.

Michael Lucas poses for a photo against a brick wall.
Michael Lucas is leading Microsoft’s effort to redesign the company’s internal data catalog. He is a principal program manager for the Data Team in Microsoft Digital.

“A modern data catalog is a catalyst for intelligent experiences and insights,” says Lucas, a principal program manager for the Data Team in Microsoft Digital. “You can’t do that if you don’t have a foundation of high-quality, discoverable data.”

The company’s retooled modern data catalog has been a boon for employees like Marcela Alvarez Rodriguez, a software engineer on the Microsoft Cloud Business Intelligence and Analytics Team at Microsoft. She spends most of her time developing data analytics platforms for enterprise infrastructure scenarios. She’s also responsible for uploading her team’s SQL and data lake assets to the company’s internal data catalog, but this is only a small part of her work.

“Having a data catalog with a streamlined asset uploading process frees up my time to focus on developing data-driven solutions that increase engineering efficiency,” Rodriguez says. “It also helps our customers know which data we own so they can develop their own solutions.”

When Lucas started to redesign the company’s existing internal catalog, he discovered that it wasn’t designed to help people find data. Instead, it was optimized for a team of software developers who used it to maintain data warehouses.

“If we wanted to create a catalog that worked for developers and data analysts, we knew we had to start from scratch,” Lucas says.

Lucas knew that a successful catalog would answer questions for all end users from developers and data analysts. To do this, he and his Modern Catalog Team decided to develop a modern data catalog that could be used for end-to-end scenarios.

[Learn how Microsoft developed its modern data catalog.]

Focusing on the user

To identify the key requirements for the data catalog, Lucas interviewed users across the company. Their jobs ranged from traditional data roles such as developers, data scientists, data analysts, and data stewards to common business roles like program managers and business managers. His interviews enabled Lucas’s team to understand pain points, create prototypes, and conduct usability tests with data publishers and catalog users to ensure development of a truly usable data catalog that would accelerate data discovery for business insights.

The Modern Catalog Team quickly found that data consumers spent most of their time tracking down the appropriate group owner of the data they needed, requesting and receiving authorization from the data owner, and then cleaning the data before being able to use it.

“It was time-consuming for users of our data and challenging for us because we had to answer so many different emails,” Rodriguez says.

It was also difficult for users like Kathy Brustad, a senior data and applied scientist in Worldwide Learning, to assess data quality.

“It was hard to identify the source of truth for certain data assets,” Brustad says. “It was also hard to know how data had been transformed when it flowed from one source to another and finally ended up being used.”

After gathering pain points, Lucas’s team worked with the Data Analytics Working Group, which is comprised of key principal-level representatives of Microsoft Digital who help shape data policies, to create a prioritized list of high-level requirements for the catalog. This led to the design of a modern data catalog that enables employees to intuitively browse for available assets and share their team’s data using a single site.

Brustad wanted to be able to identify the source of truth for data assets. Using the redesigned data catalog, she can assess data quality and its transformation over time by referencing the quality score, sample list of data, and lineage showing how the data connects to other datasets.

“There are many different places you can find this data, because it’s still getting replicated throughout the company,” she says. “Knowing that there are quality standards in Microsoft’s data catalog gives me a higher level of confidence.”

Prioritizing governance and user feedback

The catalog also integrates governance into the data registration process. If data publishers already have their assets in Azure Data Lake and follow best practices for governance, their assets can automatically be scanned into the catalog.

“We’re able to turn around our analytics a lot faster because we can establish an automatic connection to the source of the data,” Brustad says of the catalog, which she uses to find data to measure the impact of seller training programs and understand changes in seller behavior. “It gives us a 20 to 25 percent gain on the turnaround time.”

The catalog’s connection to Azure Data Lake facilitates the asset upload process for Rodriguez, because her team’s assets are already in Azure SQL Server and Azure Data Lake. The data catalog also improves the experience for consumers of her team’s data.

“It was appealing to have a centralized data catalog that helps customers know what data we have,” Rodriguez says. “Now, we can invite them to go to the data catalog and check out our team’s assets.” The catalog offers visibility not just into the technical information of the assets, but also into key governance metadata, such as compliance adherence and data quality measurements.

The Modern Catalog Team is committed to continuously learning from employees by collecting feedback through telemetry, email, and a feedback button in the modern catalog. This came in handy when Rodriguez couldn’t add her team’s distribution list as an asset owner. She reached out to the team via email, and they provided her an immediate workaround and added the feature to their backlog. Rodriguez has proposed additional features for future iterations of the data catalog, such as a guide for naming and tagging assets and supporting data quality.

The team also collects telemetry data to identify errors in data access, which are coupled with user interviews to understand their intent. These ongoing conversations inform future iterations of the catalog.

“It’s an intuitive platform, and the team is always available for feedback,” Rodriguez says. “This make the process easier.”

Whether it’s used to share data or understand behavior, the modern data catalog is an invaluable tool for employees.

“As someone who works in the data science field, I’m comfortable with going to the data catalog to procure data because I know that the data has been vetted,” Brustad says.

Learn how Microsoft developed its modern data catalog.