Abstract

Name disambiguation is an important challenge in data cleaning. In this paper, we focus on the problem that multiple real-world objects (e.g., authors, actors) in a dataset share the same name. We show that Web corpora can be exploited to significantly improve the accuracy (i.e. precision and recall) of name disambiguation. We introduce a novel approach called WebNaD (Web-based Name Disambiguation) to effectively measure and use the Web connection between different object appearances of the same name in the local dataset. Our empirical study done in the context of Libra, an academic search engine that indexes 1 million papers, shows the effectiveness of our approach.