Scaling up Extraction Over Entities (and Relations)

February 18, 2014
Ganesh Ramarkrishnan | IIT Bombay

Entity relationship search at the Web scale or even at the Enterprise level depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being (i) fast and effective entity extractors and disambiguators, (ii) the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices and (ii) use of annotations and efficient indices for effective and efficient entity-oriented search.

After providing a brief introduction to our prior work on entity annotation, disambiguation and entity-based search, we will focus on specific approaches we explored for scaling them up. In particular, we present two of our approaches geared toward scaling up operations in this area:

The translation of rule based annotation to operations on the inverted index, to achieve an order of magnitude speedup (EMNLP 2006, ICDE 2008, Infoscale 2008, CIKM 2008) over the standard document-at-a-time rule-based annotation paradigm.
The design of RAM data structures for spotting and and disambiguating entity mentions (WWW 2012), and highly compressed disk-based annotation indices (WWW 2011). These data structures cannot be readily built upon standard inverted indices. We present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. We present how the disk-based annotation index enables entity-centric snippet oriented search (WWW 2011).