This post will be updated as examples and data are added
Given the circumstances surrounding the COVID-19 pandemic, we would like to provide an overview of the services that we provide, explain the focus of each and provide working examples on how to best use our data and to help generate insight from coronavirus-related scholarly communications.
We would also like to recognize that we are partnering with the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), and the National Library of Medicine (NLM) at the National Institutes of Health to produce an open research dataset of scholarly literature about COVID-19, SARS-COV-2 and the coronavirus group. Please visit the link below to access the dataset that was released on Semantic Scholar today:
Over the past week our team has worked to update our graph with the most recent publications regarding COVID-19. With the support of Bing, we will double our MAG update frequency as well as publish a side stream linking WHO and PubMed publication ID’s to MAG ID’s. Side stream source code (opens in new tab)
For anyone who would like to use the releases of our available services for further investigation, we provide the following summaries and examples:
The Microsoft Academic Graph (opens in new tab) is a heterogeneous graph of academic data. The content of the graph is distributed in text files when new builds are created via a subscription. This graph can be used as is or merged with other public or private data sets.
Primary use cases: Data mining, long running analytic processing for academic analytics or Business Intelligence.
The Microsoft Academic Knowledge Exploration Service (MAKES) (opens in new tab) and its predecessor the Project Academic Knowledge API were created to serve a need for indexing and rapid data retrieval from the MAG data set.
Primary use cases: Fast, top N entity retrieval, on-line scenarios such as dashboards or search applications.
The Microsoft Academic website (opens in new tab) is for the community to be used as both a research tool and an example of what can be built with MAG data and MAKES. We have made our own modifications, just as consumers of MAG and MAKES may choose to do. In this way we use it as an experimental platform to test hypotheses and new features for both MAG and MAKES.
Primary use cases: Finding relevant research and analytics, test drive an end to end solution of the services we provide.
The Microsoft Academic Graph
Our approach to generating the data
At current release, the Microsoft Academic Graph (MAG) contains over 233 million publications and their related academic entities, e.g. authors, publishing venues, associated concepts, etc. Microsoft Research has the benefit of our partnership with Bing to crawl the internet to discover research content from around the world and update our graph regularly. Using the power of cloud computing, we have built a pipeline where Bing data and other sources are filtered and analyzed. Using AI techniques, we disambiguate, conflate, apply rank and taxonomy to entities into the graph as stated in our recent paper (opens in new tab): “The project pushes the boundary of machine cognition technology by deploying software agents trained with natural language understanding capabilities to continuously scavenge the Web for research artifacts and, from them, extract up-to-date academic knowledge into a graph based representation called Microsoft Academic Graph (MAG)”. Optimization of this pipeline has reduced the time to create and validate our graph over time as well. Currently, we can produce a new version of MAG every week. For a great overview of how MAG is laid out and built, see our recently published paper (opens in new tab).
What is contained in the dataset and what makes it unique?
Web scale data collection
As stated, our Bing partnership brings the knowledge of the entire web into our graph. Combining this with the power of cloud computing allows for rapid iteration over immense quantities of data (tens of billions of raw data points) using intelligent agents to objectively curate the graph.
Advanced entity conflation and disambiguation
Our agents are trained to understand and reason over partial and noisy information from documents in diverse data sources. They recognize and assemble semantic objects in the academic domain (e.g. scholarly publications, authors, affiliations, conferences, journals, and fields of study) into the cohesive and evolving knowledge graph (MAG).
Paper citation networks are well known to be sparse, have human bias, and be “cliquey” as researchers often cite papers by their advisors, friends and peers. It is rare to see inter-disciplinary citations even though researchers in disparate disciplines are often solving the same underlying scientific problem. MAG mitigates sparsity and clique issues in the graph by enriching paper-to-paper links across disciplines via a paper similarity system. This system not only uses the citation graph but also the content of each paper via trained language embeddings, as outlined in this paper (opens in new tab). The word embedding and citation-based paper recommendations can be found here (opens in new tab), in MAG.
Trained word embeddings are also used to generate embeddings for our fields of study, allowing us to quickly tag papers with relevant concepts based on their content. We provide the ability for users to tag their own text documents with our fields of study, using our trained language embeddings, as part of the Microsoft Academic Language Similarity API (opens in new tab). This API is made available to anyone upon request, alongside weekly MAG updates.
Field of study tagging and taxonomy learning
MAG is built and organized using field of study tagging and taxonomy learning allowing consumers of the graph the ability to sub-divide the data. This is done through concept discovery, concept-document tagging and concept hierarchy generation. A detailed explanation of this process is provided in our recent paper (opens in new tab).
In MAG, the fields of study can be found in this stream (opens in new tab) and their parent-child relationship can be found in this one (opens in new tab). The corresponding UMLS Ids and source URLs are available in this stream. (opens in new tab)
See an example of using fields of study below.
Predictive static ranking: Saliency
MAG computes saliency using reinforcement learning (RF) to assess the importance of each entity in the coming years. As MAG sources contents from the Web, saliency plays a critical role in telling the difference between good and poor content. The RF algorithm is programmed to predict future citations. Based on the publication and citation activities surrounding the novel coronavirus, MAG has learned COVID-19 related articles are most likely to be cited in the coming years. See our recent blog post (opens in new tab) for more details.
Saliency is available in MAG as the “rank” attribute (opens in new tab).
MAG is, in nature, a heterogeneous graph with different types of entities and relations; in which there exist various structural relations corresponding to different semantic similarities. For example, two fields of study can be similar in different senses, such as they might be often studied together (coappear in the same papers or venues) or cooccur with all types of entities in the graph. Therefore, we learn the multi-sense network representations for entities in MAG and make the Network Similarity (NS) package (opens in new tab) publicly available. By using the NS package, we can reveal the most similar fields to “COVID-19” and “SARS-COV-2” under different senses.
See an example of Multi-sense similarities below.
Impact of COVID-19 on the Computer Science Research Community (opens in new tab) – Our teams’ research into the impact COVID-19 may have on conferences, authors and the Computer Science field of study. (Source code coming soon). Source code examples: https://github.com/microsoft/mag-covid19-research-examples/tree/master/src/MAG-Samples/impact-of-covid19-on-the-computer-science-research-community (opens in new tab)
How I built a list of coronavirus-related research papers using the Microsoft Academic Graph (opens in new tab) – Medium post by Adam Day outlining the process that was used to generate a list of COVID-19 related papers.
Multi-sense network similarity examples:
We show the most similar fields of study to COVID-19 and SARS-COV-2 under three different senses in MAG, powered by the NS package. Take COVID-19 for example, under the “copaper” sense, the top entities to it indicate the other fields that are discussed in COVID-19 publications, such as SARS-COV-2, H1N1, Ebola, Nipah, and MERS. Under “covenue”, the most similar entities to COVID-19 are fields—e.g., Infectious Disease Epidemiolog, Index case (patient zero), Middle East respiratory syndrome coronavirus—that are also studied in the journals or conferences in which the COVID-19 publications are published. Finally, the “metapath” sense represent that two fields are similar if they co-occur with all other types of entities—papers, venues, affiliations, and fields of study, under which, the most similar fields include Viral phylodynamics, Middle East respiratory syndrome coronavirus, Lassa fever, etc. Overall, we can observe different sets of similar fields under these three senses, revealing different perspectives about the focused entity, I.e., COVID-19 at this case.
Fields of study stamping examples:
Microsoft Academic provides highly related papers as recommended reading to this paper even though this paper currently has no citations. It therefore uses only trained word embeddings and the content of this paper. Notice the ability of the system to pick up relevant papers based on the broad concept of “climate and how it affects transmission of viruses”. Word embeddings allow the system to relate terms like “humidity”, “climate”, “tropical”, and “weather” together as well as “influenza”, “coronavirus”, and “virus”.
In the USQL sample code below, we show how Fields of Study are used along with publication title and abstract term matching to find papers about COVID-19.
Microsoft Academic Knowledge Exploration Service (MAKES) / Project Academic API
MAKES was created in response to our customers request for a non-rate-limited version of our Project Academic Knowledge API. In its basic form MAKES is a self-hosted REST API (opens in new tab) leveraging an index of all the entities in the Microsoft Academic Graph (MAG). By subscribing to MAKES, the required components are delivered to your Azure subscription when new versions of MAG are created; typically, once every 1 to 2 weeks. A provided tool can then be run and MAKES instances are automatically provisioned to your Azure account.
MAKES Documentation (opens in new tab) – Documentation for self-hosting a MAKES API
Project Academic Knowledge Documentation (opens in new tab) – Documentation for the Microsoft hosted API
Querying MAKES / Project Academic API to retrieve papers about coronavirus – The following query can be given to MAKES / Project Academic API to produce a list of papers associated with coronavirus. This selects publications based on the field of study group (coronavirus), family (coronaviridae), genus (betacoronavirus), species (SARS-COV-2/COVID-19) or title/abstract matches on those keywords.
Or(Composite(F.FN==’coronavirus disease 2019′), Composite(F.FN==’severe acute respiratory syndrome coronavirus 2′), Composite(F.FN==’betacoronavirus’), Composite(F.FN==’coronaviridae’), Composite(F.FN==’coronavirus’), W=’coronavirus’, AW=’coronavirus’, W=’coronaviridae’, AW=’coronaviridae’, W=’betacoronavirus’, AW=’betacoronavirus’)
Microsoft Academic website
The Microsoft Academic website is updated when new versions of the graph are released, about once a week. The site is designed to provide top ‘n’ results of search queries through an easy to use interface. Some advantages to using our website would be our improved search functionality that allows for semantic interpretations and suggestions for your queries.
In the example above, you can see results from a query for coronavirus desease 2019, the topic for the current novel coronavirus. You can see that we understand the topic from the query and populate a card to the right of the search results showing a description along with parent and related topics. By selecting a publication from the list you are taken to a details page for that publication showing any information we collected about the publication (publishing venue, authors, institutions, links that we have currently found to the document on the web any topics that have been tagged for the publication). In the bottom section of the page, we show any publications that are referenced, any publications that we have found to cite the publication and a tab for related publications that we have found.
We also provide analytics for each entity type in our graph (publications (opens in new tab), authors (opens in new tab), topics (opens in new tab), conferences (opens in new tab), journals (opens in new tab), and institutions (opens in new tab)). Our analytics pages allow you to search for Topics and find the top 100 entities, trends data and an overview of the distribution of entity types in the graph.
The Microsoft Academic team is committed to providing the community with any data that can help stem the COVID-19 advance. We hope that this blog post has offered some guidance and it will be updated as appropriate when details or related information changes.
Appendix A – Links to Microsoft Academic resources
Appendix B – Updates to this post
- 3/16 – Original publication
- 3/18 – Added links to source code examples and data files
- 3/20 – Added link to sample code to search MAG for COVID-19 papers using Fields of Study along with, publication title and abstract term matching