Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Microsoft and Tsinghua University Work Together on Open Academic Data Research

March 22, 2018 | By Microsoft blog editor

In a recent collaboration, Microsoft and China’s Tsinghua University released an academic graph, named Open Academic Graph (OAG). This billion-scale academic graph integrates the current Microsoft Academic Graph (MAG) and Tsinghua’s AMiner academic graph. Specifically, it contains the metadata information of 155 million academic paper metadata from AMiner and 166 million papers from MAG. By consolidating metadata information of each, it generates nearly 65 million matching relationships between the two academic graphs [1].

Picture 1: Connections between Tsinghua University AMiner and Microsoft Academic Graph

The construction of the billion-scale OAG is challenging, because of the heterogeneous distribution of academic data that exists in the different academic graphs, challenges in terms of homonyms and synonyms, and the need for accuracy in data matching. Some examples:

  • Heterogeneous data. Because the data is distributed in different data sources, it faces a heterogeneous data problem. For example, paper authors may publish in different formats, such as Quoc le and Le, Quoc; or a journal or conference uses either a full name or an abbreviation.
  • Disambiguation problem with the same name. The same name can represent multiple entities. For example, in China, one common name might be used by more than 200,000 people. Similarly, one topic, such as “data,” may correspond to multiple articles.

In addition, to achieve the billion-level of data integration, efficient computing is key. In the case of AMiner, there are 155 million published papers, while MAG has public data of 160 million papers. Calculating the algorithmic complexity of two map-matching relationships is generally O (N2), which requires a lot of computation. We designed a compromise approach that uses a hashing algorithm to improve efficiency. This approach completed matches for approximately 300 million papers automatically, while still ensuring high matching accuracy.

OAG is an important project of Open Academic Society (OAS), which is a consortium of 20 global institutions—including Microsoft, Tsinghua, the Allen Institute for Artificial Intelligence, the University of Arizona, the University of Washington, the University of California, Los Angeles, and the Australian National University—to promote the open sharing of academic data and strengthen academic exchanges and cooperation. The OAG aims to integrate the global atlas of academic knowledge, publicly share academic atlas data, and provide relevant academic search and mining services. Specifically, OAS activities include:

  • The integration of rich academic knowledge data. At present, the core data of OAG is from MAG and AMiner. The next step will be to integrate additional academic data, including the semantic data of different types of entities such as authors and papers. Data integration and data mining algorithms will link more entities to more accurate and richer data, including metadata, concept networks, research field, full text and author biographical information.
  • Data sharing. By sharing different academic knowledge maps and their links, we hope to benefit academic research in the fields of knowledge atlases, scholar cooperative relationships, and academic topic mining.
  • Service sharing. We want to design more intelligent academic atlas connectivity systems and provide relevant services (such as APIs) to encourage more people to use the services and join open academic communities.

In another collaboration, this time between Microsoft Academic, Tsinghua and the Documentation and Information Center of the Chinese Academy of Sciences, more than 1000 students in 400 teams participated in the “Open Academic Precision Portrait Competition.” Students from Peking University, University of Science and Technology of China, and Harbin Institute of Technology took home the top prizes.

Picture 2: Poster and Awarding Ceremony of “Open Academic Precision Portrait Competition”

 One last collaboration of note is a contract between Professor Jie Tang, (of Tsinghua University and the founder of AMiner) and Microsoft, where AMiner [2] will be deployed on Azure. This, together with OAG, will further enlarge the impact of Azure on academic research.

Picture 3: Global coverage of AMiner

The future of the collaboration between Microsoft and Tsinghua will include integrating different types of entities (such as authors and conferences) in a large-scale heterogeneous academic atlas, publishing more academic atlas connectivity data, and designing more intelligent academic atlas connectivity systems.

Reference:
[1] Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, Jie Tang WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA ACM February 5, 2018 (https://dl.acm.org/citation.cfm?doid=3159652.3159706)

[2] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998.

Up Next

Data management, analysis and visualization, Systems and networking

Announcing Microsoft Research Open Data – Datasets by Microsoft Research now available in the cloud

The Microsoft Research Outreach team has worked extensively with the external research community to enable adoption of cloud-based research infrastructure over the past few years. Through this process, we experienced the ubiquity of Jim Gray’s fourth paradigm of discovery based on data-intensive science – that is, almost all research projects have a data component to […]

Vani Mandava

Director, Data Science Outreach

Data management, analysis and visualization, Search and information retrieval

Microsoft researchers unlock the black box of network embedding

At the ACM Conference on Web Search and Data Mining 2018, my team will introduce research that, for the first time, provides a theoretical explanation of popular methods used to automatically map the structure and characteristics of networks, known as network embedding. We then use this theoretical explanation to present a new network embedding method […]

Kuansan Wang

Managing Director, MSR Outreach Academic Services

Data Science education at UC Berkeley

Data management, analysis and visualization

A new understanding of the world through grassroots Data Science education at UC Berkeley

By Vani Mandava, Director, Data Science, Microsoft Research While some may regard data science as an easy passport to a job for the tech savvy, Luis Macias has different ideas. The fourth-year undergraduate student, who is majoring in American Studies at University of California, Berkeley (UC Berkeley), wants to turn the hype of data science […]

Microsoft blog editor