Web Scale Entity Resolution using Relational Evidence
- Haixun Wang
MSR-TR-2011-30 |
Entity resolution has been extensively studied. Many approaches have been proposed, including using machine learning techniques to derive domain-specific lexical similarity measures, or rank entities’ attributes by their discriminative power, etc. In this paper, we study the problem in the setting of matching two web scale taxonomies. Besides the scale, we address the challenge that the taxonomies may not contain enough context (such as attributes) for entity resolution, and traditional lexical similarity measures result in many false positive matches. To tackle this new task, we explore negative evidence in the structure of the taxonomy, as well as in external data sources such as the web. To integrate positive and negative evidence, we formulate the entity resolution problem as a problem of finding optimal multi-way cuts in a graph. We analyze the complexity of the problem, and propose a Monte Carlo algorithm for finding greedy cuts. We conduct extensive experiments that demonstrate the advantage of our approach.