Data Cleaning

Established: July 1, 2002

Poor data quality is a well-known problem in data warehouses that arises for a variety of reasons such as data entry errors and differences in data representation among data sources. For example, one source may use abbreviated state names while another source may use fully expanded state names. However, high quality data is essential for accurate data analysis. Data cleaning is the process of detecting and correcting errors and inconsistencies in data.

Goal

Typical data cleaning tasks include record matching, deduplication, and column segmentation which often need logic that go beyond using traditional relational queries. This has led to development of utilities for data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as “horizontal” platforms that are applicable across a variety of domains. These platforms provide a suite of additional data cleaning operators but also including relational operators such as select, project and equi-join.

In the Data Cleaning project, our goal is to define a repertoire of “built-in” operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions. We also investigate their efficient implementation on horizontal ETL engines (such as. SQL Server Integration Services). While for wide usage it is desirable to have generic, domain-neutral operators, it is also important that the operators allow domain specific customizations. Examples of such customizations include plugging in synonyms (e.g., Robert and Bob while dealing with people names), domain-specific tokenization, and token weighting.

The operators that we have identified and implemented so far include

Fuzzy Lookup (FL) that is used as one of the core operators to perform record matching. The matching can be customized by configuring the tokenization, token weighting and providing transformation rules. We address efficiency by designing a similarity index.
Fuzzy Grouping that is used for deduplication. Fuzzy Grouping clusters the pairwise matches produced by Fuzzy Lookup.
Column segmentation that segments input strings based on specification of regular expressions that can also reference tables in a database.

One of the contributions of the project is a Data Profiling tool that is used to analyze the data quality through a variety of statistics ranging from simple statistics such as the number of null values in a column to more complex statistics such as the strength of keys, foreign keys and functional dependencies.

Impact

Fuzzy Lookup and Fuzzy grouping are currently shipping as part of Microsoft SQL Server Integration Services (SSIS)
Fuzzy Lookup is used for geocoding incoming queries in Bing Maps.
Column Segmentation and Fuzzy Grouping are used for de-duplication of product names and descriptions at the back-end of Bing Shopping.
Fuzzy Lookup is used in Microsoft’s internal master data management project that maintains information about Microsoft’s customers to match new customers with existing customers.
The Data Profiling technology ships as part of Microsoft SQL Server Integration Services (SSIS). A part of the technology that discovers foreign keys in a database ships in Microsoft PowerPivot for Excel as part of the SQL Server 2008 R2 release.
We have released a Fuzzy Lookup Add-in for Microsoft Excel 2010 via Microsoft Business Intelligence (BI) Labs that allows users to invoke our approximate matching technology on spreadsheet data in Excel.

If you have questions about this project, please contact the Data Cleaning research team (dcrt@microsoft.com).

People

Arnd Christian König

Principal Researcher

Learn more

Raghav Kaushik

Senior Researcher

Learn more

Surajit Chaudhuri

Technical Fellow, Data Platforms and Analytics

Learn more

Vivek Narasayya

Partner Research Manager

Learn more