Combining Link and Content Information in Web Search

  • Matthew Richardson ,
  • Pedro Domingos

|

Publication

As the World Wide Web has grown, search engines have become the preferred method for finding information on the Web; it is now almost impossible to find specific information without them. This has given rise to a problem: How can we automatically determine the quality and relevance of a Web page to a particular query? The original search engines used the content of a page to determine its relevance, but recently it was found that results could be greatly improved by incorporating information gleaned from the link structure as well. In this chapter, we describe two of the most well-known algorithms that do this: HITS [17] and PageRank [18], and also survey some of their improvements. We then introduce our algorithm, Query-Dependent PageRank, which maintains query-time efficiency while alleviating the problem of topic drift. Experiments on two large subsets of the Web indicate that our algorithm significantly outperforms PageRank in the (human-rated) quality of the pages returned, while remaining efficient enough to be used in today’s large search engines. After presenting these results and a discussion of scalability, we leave the reader with some open questions and possible directions for future work.