IMS-Microsoft Research Workshop: Foundations of Data Science – The small clustering problem: When the cluster sizes don’t grow with the data

Information about social entities is often spread across multiple large databases, each degraded by noise, and without unique identifiers shared across databases. Record linkage – reconstructing the actual entities and their attributes – is essential to using big data and is challenging not only for inference but also for computation. Record linkage can be view as a clustering problem, however, many popular generative models for clustering assume that the number of data points in each cluster grows linearly with the number of data points. Examples include finite mixture models (with fixed or random mixing proportions), Dirichlet process mixture models, and Pitman – Yor mixture models. In record linkage, where the database entries corresponding to a single latent individual may be thought of as a cluster, we expect few data points per cluster in general. Moreover, we do not expect the cluster sizes to grow substantially as the dataset size grows. We call this the small clustering assumption and propose a new generative model that, unlike the models listed above, does not violate it. We compare our method with other recently proposed methods in the literature in an application to survey data.

Rebecca C. Steorts
Duke University, Department of Statistical Science