Distributed Entity Resolution for Computational Social Science

Very often information about social entities is scattered across multiple databases. Combining that information into one database can result in enormous benefits for analysis, resulting in richer and more reliable conclusions. In most practical applications, however, analysts cannot simply link records across databases based on unique identifiers, such as social security numbers, either because they are not a part of some databases or are not available due to privacy concerns. In such cases, analysts need to use methods from statistical and computational science known as entity resolution (record linkage or de-duplication) to proceed with analysis. Entity resolution is not only a crucial task for social science and industrial applications, but is a challenging statistical and computational problem itself. In this talk, we describe the past and present challenges with entity resolution. More specifically, I will discuss unsupervised Bayesian entity resolution models, which are able to identify duplicate records in the data, while quantifying uncertainty of the entity resolution process. In addition, one can prove tight theoretical bounds on the class of entity resolution models, which support the proposed approach. Finally, I present a distributed extension of this work, where we can scale into the millions of records, while crucially incorporating partitions. I will provide results on three real data sets to the computational social sciences, and in progress work in the field of human rights on El Salvador.


Rebecca C. Steorts
Duke University