Goal
Typical data cleaning tasks include record matching, deduplication, and column segmentation which often need logic that go beyond using traditional relational queries. This has led to development of utilities for data transformation and cleaning. Such software falls into two broad categories. The first category consists of verticals that provide data cleaning functionality for specific domains, e.g., addresses. By design, these are not generic and hence cannot be applied to other domains. The other category of software is that of ETL tools such as Microsoft SQL Server Integration Services (SSIS) that can be characterized as “horizontal” platforms that are applicable across a variety of domains. These platforms provide a suite of additional data cleaning operators but also including relational operators such as select, project and equi-join.
In the Data Cleaning project, our goal is to define a repertoire of “built-in” operators beyond traditional relational operators with a few core data cleaning operators such that with very less extra code, we can obtain a rich variety of data cleaning solutions. We also investigate their efficient implementation on horizontal ETL engines (such as. SQL Server Integration Services). While for wide usage it is desirable to have generic, domain-neutral operators, it is also important that the operators allow domain specific customizations. Examples of such customizations include plugging in synonyms (e.g., Robert and Bob while dealing with people names), domain-specific tokenization, and token weighting.
The operators that we have identified and implemented so far include
- Fuzzy Lookup (FL) that is used as one of the core operators to perform record matching. The matching can be customized by configuring the tokenization, token weighting and providing transformation rules. We address efficiency by designing a similarity index.
- Fuzzy Grouping that is used for deduplication. Fuzzy Grouping clusters the pairwise matches produced by Fuzzy Lookup.
- Column segmentation that segments input strings based on specification of regular expressions that can also reference tables in a database.
One of the contributions of the project is a Data Profiling tool that is used to analyze the data quality through a variety of statistics ranging from simple statistics such as the number of null values in a column to more complex statistics such as the strength of keys, foreign keys and functional dependencies.
Impact
- Fuzzy Lookup and Fuzzy grouping are currently shipping as part of Microsoft SQL Server Integration Services (SSIS)
- Fuzzy Lookup is used for geocoding incoming queries in Bing Maps.
- Column Segmentation and Fuzzy Grouping are used for de-duplication of product names and descriptions at the back-end of Bing Shopping.
- Fuzzy Lookup is used in Microsoft’s internal master data management project that maintains information about Microsoft’s customers to match new customers with existing customers.
- The Data Profiling technology ships as part of Microsoft SQL Server Integration Services (SSIS). A part of the technology that discovers foreign keys in a database ships in Microsoft PowerPivot for Excel as part of the SQL Server 2008 R2 release.
- We have released a Fuzzy Lookup Add-in for Microsoft Excel 2010 via Microsoft Business Intelligence (BI) Labs that allows users to invoke our approximate matching technology on spreadsheet data in Excel.
If you have questions about this project, please contact the Data Cleaning research team (dcrt@microsoft.com).