Microsoft opens up online infrastructure to the research community
As our lives become increasingly conducted online, the growth rate of data recording our daily activities has exploded. These records, which range from the details of our online social engagements to the graphs gathered by major search engine companies representing a snapshot of our collective curiosity and knowledge, have largely been kept in the hands of the corporations that collect them, out of reach of the research community.
A recent research challenge changed all that. The WSDM Cup, a competition run by Microsoft Research in partnership with the 9th ACM Conference on Web Search and Data Mining (WSDM), gave researchers open access to the data in Microsoft’s academic graph to assess the query-independent importance of scholarly articles in the graph.
“The WSDM Cup is the first time a major commercial search engine, Microsoft Academic powered by Bing, has opened its data to the academic community for research,” said Kuansan Wang, Director of Microsoft Research’s Internet Services Research Center. The graph, a continuously growing collection of millions of pieces of information about scientific publications, authors, institutions, journals, conferences, and fields of study, is the largest such graph in existence.
“We are also opening the graph’s back-end Academic Knowledge API,” said Wang. The service-based API enables researchers to access fresh data from the web crawled by an industrial-grade search engine. “The community can build on top of our baseline system and test innovative ideas.”
The result? Another first: A graph of web data about researchers, built on and contributed to by both internal and external researchers.
The goal of the WSDM Cup was to provide the best static rank values for each publication in the Microsoft Academic Graph. The challenge attracted 80 teams from 34 institutions across 13 countries to compete fiercely over a two-month period.
Signaling a sign of technological advancement and a milestone in computer science research, more than half of the final submissions fared better than the seminal PageRank algorithm that was first made popular by the web search giant Google and is still in use by many publishers and independent consulting firms in assessing the impact of scholarly research.
“The most commonly used measures of importance and impact in scholarship, such as citation counts, Journal Impact Factor, and h-index, are one-dimensional, looking solely at the citations between publications,” said workshop co-chair Alex Wade, Director of Scholarly Communications at Microsoft Research. “But as we look into the richer and more varied relationships between the people, places, and things that make up the scholarly record, new opportunities for ranking and evaluation emerge. A key goal of this challenge was to test whether that heterogeneity can actually lead to improved ranking solutions.”
The top eight teams from the cup’s first phase were invited to participate in Phase 2 of the challenge and to present their approaches at a WSDM conference workshop. During Phase 2, the top eight datasets were used to power the ranker used by Bing for academic queries.
The WSDM Cup Entity Ranking Challenge Workshop brought together researchers in the areas of data mining and large heterogeneous networks. Organized by Northeastern University, Elsevier, and Microsoft Research, the workshop was held this week in San Francisco, showcasing the efforts of the leading teams in the 2016 WSDM Cup.
“People are really excited that Microsoft was willing to share this data with the research community,” said Jevin West, Assistant Professor at the University of Washington, workshop attendee, and a member of the second-place Eigenfactor team. “We’re really grateful Microsoft engaged with the research community on this important problem of finding better ways of searching scientific content. I hope we can continue seeing these collaborations between industry and the research community.”
New challenges and opportunities
The WSDM Cup’s online experiments allowed participants to see how their algorithms performed in front of real users in addition to the traditional static evaluation found in similar competitions. The live competition generated considerable excitement among the internet research crowd—and conference organizers agreed.
Based on the widespread popularity of the WSDM Cup challenge, Microsoft has been invited to partner in two similar competitions this year. Microsoft will run the 2016 KDD Cup in partnership with the newly established Big Scholarly Data Institute in Tsinghua, Beijing. And to strengthen the research community’s understanding and use of online evaluations as a key part of modern information retrieval, Microsoft has partnered with TREC, overseen by the National Institute of Standards and Technology, to support a new OpenSearch track that will be run for the first time at TREC 2016.
“This is the beginning of a whole new era in data access and analysis that will benefit the research community for many years to come,” said Wade.
—Christine Clifton-Thornton, Senior Writer, Microsoft Research