Automatic Extraction of Top-k Lists from the Web

Haixun Wang; Hongsong Li

Automatic Extraction of Top-k Lists from the Web

Haixun Wang ,
Hongsong Li

ICDE | January 2013

Published by International Conference on Data Engineering

Download BibTex

This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic which is of general interest. Examples include “the 10 tallest buildings in the world”, “the 50 hits of 2010 you don’t want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efﬁcient method that extracts top-k lists from web pages with high performance. Speciﬁcally, we extract more than 1.7 million top-k lists from a web corpus of 1.6 billion pages with 92.0% precision and 72.3% recall.