Entity Linking at the Tail: Sparse Signals, Unknown Entities and Phrase Models

WSDM '14 Proceedings of the 7th ACM international conference on Web search and data mining |

Published by ACM

Publication

Web search is seeing a paradigm shift from keyword based search to an entity-centric organization of web data. To support web search with this deeper level of understanding, a web-scale entity linking system must include feature extraction that is robust to the diversity of web documents and their varied writing styles and content structure; maintain high-precision linking for “tail” (unpopular) entities that is robust to the existence of confounding entities outside of the knowledge base and entity profiles with minimal information; and represent large-scale knowledge bases with a scalable and powerful feature representation. We have built and deployed a web-scale unsupervised entity linking system for a commercial search engine that addresses these requirements by combining new developments in sparse signal recovery to identify the most discriminative features from noisy, free-text web documents; explicit modeling of out-of-knowledge-base entities to improve precision at the tail; and the development of a new phrase-unigram language model to efficiently capture high-order dependencies in lexical features. Using a knowledge base of 100M unique people from a popular social networking site, we present experimental results in the challenging domain of people-linking at the tail, where most entities have limited web presence. Our experimental results show that this system substantially improves on the precision-recall tradeoff over baseline methods, achieving precision over 95% with recall over 60%.