The Scalable Hyperlink Store
This talk describes the Scalable Hyperlink Store, a specialized database that gives very fast access to the forward and backward links of very large web graphs. SHS has been designed to scale to the size of the current MSN Search corpus (about 5 billion crawled web pages and 250 billion hyperlinks) and to provide link access times in the microsecond range. I am currently exploring cost-efficient fault-tolerance schemes and ways to support incremental updates to the database.
SHS provides infrastructure for conducting research on properties of the web graph, and can potentially be a useful tool to MSN Search. But most interestingly, it has the potential of enabling a class of search result ranking algorithms known as query-dependent link-based ranking that have been widely studied in the scientific literature, but not been deployed by major search engines. Our plans for the summer are to implement a variety of such algorithms on top of SHS and to measure their performance and effectiveness.
Marc Najork is a senior member of the research staff at Compaq Computer Corporation’s Systems Research Center. His current research focuses on high-performance web crawling and web characterization. He was a principal contributor to Mercator, the web crawler used by AltaVista. In the past, he has worked on 3D animation, information visualization, algorithm animation, visual programming languages, and tools for web surfing. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 1994.
- Marc Najork