Example Driven Design of Efficient Record Matching Queries

Surajit Chaudhuri; Bee Chung Chen; Venky Ganti; Raghav Kaushik

Example Driven Design of Efficient Record Matching Queries

Surajit Chaudhuri ,
Bee Chung Chen ,
Venky Ganti ,
Raghav Kaushik

VLDB | January 2007

Published by Very Large Data Bases Endowment Inc.

Download BibTex

Record matching is the task of identifying records that match the same real world entity. This is a problem of great signiﬁcance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difﬁcult and depends on the speciﬁc application scenario. Speciﬁcally, the number of options both in terms of string matching operations as well as the choice of external sources can be daunting. In this paper, we exploit the availability of positive and negative examples to search through this space and suggest an initial record matching query. Such queries can be subsequently modiﬁed by the programmer as needed. We ensure that the record matching queries our approach produces are (1) efﬁcient: these queries can be run on large datasets by leveraging operations that are well-supported by RDBMSs, and (2) explainable: the queries are easy to understand so that they may be modiﬁed by the programmer with relative ease. We demonstrate the effectiveness of our approach on several real-world datasets.

All articles published in this journal are protected by copyright, which covers the exclusive rights to reproduce and distribute the article (e.g., as offprints), as well as all translation rights. No material published in this journal may be reproduced photographically or stored on microfilm, in electronic data bases, video disks, etc., without first obtaining written permission from Very Large Data Bases Endowment Inc.