Learnable Similarity Functions and Their Applications in Information Integration and Clustering

March 27, 2006
Mikhail Bilenko | University of Texas at Austin

Pairwise similarity functions are ubiquitous in data mining and machine learning algorithms. Record linkage, clustering, nearest-neighbor search, information retrieval – these are all tasks where pairwise distance computations play a central role. Accuracy in these tasks depends critically on how well the similarity function captures the notion of likeness between objects in a given domain. Therefore, it is desirable to employ similarity functions that can adapt to the domain and task at hand.

We demonstrate the benefits of using learnable similarity functions on two tasks: record linkage and clustering. The goal of record linkage (also known as de-duplication and identity uncertainty) is to identify different database records that describe the same underlying entity. We introduce several learnable string distance functions based on probabilistic models, as well as an adaptive framework for combining them, both of which lead to significant accuracy improvements. The other task we consider is semi-supervised clustering, where we present a probabilistic clustering framework based on Hidden Markov Random Fields that incorporates learnable similarity functions. Finally, we describe how learning similarity functions allows efficient scaling of record linkage and clustering methods to large datasets.

Speaker Details

Mikhail Bilenko is a Ph.D. candidate in the Department of Computer Sciences at the University of Texas at Austin. His research interests are in machine learning and data mining. His recent work has won the Best Research Paper award from ACM SIGKDD, and a patent has been filed at Google based on his summer research.