Web Scale Taxonomy Cleansing

  • Zhongyuan Wang ,
  • Haixun Wang

VLDB |

Large ontologies and taxonomies are automatically harvested from web-scale data. These taxonomies tend to be huge, noisy, and contains little context. As a result, cleansing and enriching those large-scale taxonomies becomes a great challenge. A natural way to enrich a taxonomy is to map the taxonomy to existing datasets that contain rich information. In this paper, we study the problem of matching two web scale taxonomies. Besides the scale of the problem, we address the challenge that the taxonomies may not contain enough context (such as attribute values). As existing entity resolution techniques are based directly or indirectly on attribute values as context, we must explore external evidence for entity resolution. Specifically, we explore positive and negative evidence in external data sources such as the web and in other taxonomies. To integrate positive and negative evidence, we formulate the entity resolution problem as a problem of finding optimal multi-way cuts in a graph. We analyze the complexity of the problem, and propose a Monte Carlo algorithm for finding greedy cuts. We conduct extensive experiments and compare our approach with three existing methods to demonstrate the advantage of our approach.