Rationalizing Semantic and Keyword Search on Microsoft Academic

Rationalizing Semantic and Keyword Search on Microsoft Academic

Established: February 22, 2016

Rationalizing Semantic and Keyword Search on Microsoft Academic

May 21, 2020

Over the past 6 months we’ve been experimenting with a host of changes to Microsoft Academic’s search experience, and now that the last of those experiments has shipped we’re excited to finally discuss them.

Before we jump in, if you’re interested in a deeper technical analysis of the new capabilities please review the following resources:

No room for interpretation?

From the initial release of Microsoft Academic in 2016, up until 6 months ago, our semantic search algorithm focused on generating results that best matched semantically coherent interpretations of user queries, informed by the Microsoft Academic Graph (MAG).

To better explain, let’s examine the query “covid-19 science”. Traditional search engines based on keyword search (i.e. Google Scholar, Semantic Scholar, Lens.org, etc.) do an excellent job of retrieving relevant results that have keyword matches for “covid-19” and variations of “science” (science, sciences, scientific, etc.) Our system, however, prefers to interpret “covid-19” as a shorthand reference (synonym) of the topic “Coronavirus disease 2019 (COVID-19)” and “science” as the journal “Science” because MAG suggests this interpretation will turn up more highly cited and relevant papers than treating the query as simple paper full-text (title/abstract/body) keywords. This distinction is important, as it allows our semantic search algorithm to leverage semantic inference to retrieve seminal publications that do not strictly contain “covid-19” as keywords, yet are nevertheless relevant and important.

Regardless, we still previously allowed for rudimentary keyword matching, namely, prefix and literal unigram matching of publication titles (with no support for stemming or spelling corrections). Unfortunately, the outcome of this limited keyword matching was frequently encounters with the dreaded “no results” page.

For example, assume you were looking for a paper that you thought was named “heterogeneous network embeddings via deep architectures”. Entering this phrase as a query would result in no suggestions and an error page if executed on the site:

No search results

This is a classic case of users knowing what they want but having difficulty getting an algorithm to understand. A common problem with keyword search is it puts the burden of choosing the “right” keywords for a query squarely on the shoulder of the user.

Now with our newest search implementation this same query will work exactly as intended:

Paper search result with dropped term

To understand why this now works we first need to explain how our semantic search implementation works.

Ok, maybe a little room for interpretation

To put it simply, we’ve changed our semantic search implementation from a strict form where all terms must be understood to a looser form where as many terms as possible are understood.

The formulation of semantic interpretations (as explained above) remains unchanged, in that the knowledge in MAG still plays the central role in guiding how a query should be interpreted. What has changed is that when a portion of a query is thought to refer to full-text properties (i.e. title, abstract), the algorithm can now dynamically switch to a new scoring function that is more appropriate than literal unigram matching and hence less brittle as the example above shows.

Going a bit deeper, let’s define what “as many terms as possible are understood” means. By its nature, loose semantic query interpretation will produce interpretations with the highest coverage first and fastest, and as interpretations with less coverage (i.e. terms are dropped from consideration) are generated the relevance and speed decrease. The reasons for this are technical and have to do with the search space growing exponentially as the query considered becomes less specific. So in practice “as many as possible” is better defined as “as many as possible in a fixed amount of time”.

This means that factoring in variables such as query complexity and service load, the results generated from a fixed timeout where terms are more loosely matched (aka the result “tail”) could vary between sessions. However because the interpretations with highest coverage are generated first, the results they cover (aka the “head”) are very stable.

While this change is a great remedy for queries with full-text matching intent, the loosened interpretation does also impact semantic search results as they are no longer as concise as before due to a longer result “tail” that includes full-text matches.

As always, an example speaks a thousand words:

Query formulation

BEFORE

Show results matching top interpretations where all query terms are understood, ranked only by paper salience (static rank, aka importance)

AFTER

Show results matching top interpretations where as many query terms as possible are understood, ranked first by number of terms matched then by paper salience

Let’s take a closer look at the new “loose” semantic search algorithm, as it comes with a new user interface that illustrates how each search result is understood in the context of the user query:

As mentioned earlier, results are first ranked based on the number of query terms matched. In this case the first result matched all query terms and takes the top spot even though it has a lower static rank (and citation count) than the following two results. Another important item to call out is that when query terms are matched using synonyms, the synonymous terms are shown in parenthesis next to the canonical form, e.g. the user typed “z shen” but it was matched to “zhihong shen”.

 

Here we can see the new semantic search results are based on “loose” interpretations. In both cases, the query terms “acl 2018” were not understood in the context of the result, and were shown as crossed out while the other terms maintain the same semantic understanding as the first result. Additionally, both results have a higher static rank than the first result but are ranked lower because they match less of the query.

 

As we look farther into the tail of results we can see how much of the query can be dropped (in this case 4 of the 8 query terms).

 

Matching phrases

Historically Microsoft Academic has support for matching queries to values in a few different ways:

  • Matching exact values, e.g.
    “a web scale system for scientific knowledge exploration” => “a web scale system for scientific knowledge exploration”
  • Matching the beginning of values (aka prefix completions, only available as query suggestions), e.g.
    “a web scale system for scientific” => “a web scale system for scientific knowledge exploration
  • Literally matching words from the value, e.g.
    “microsoft academic overview” => “an overview of microsoft academic service mas and applications

In addition we now support a new form of partial value matching based on phrases. This is a common feature frequently seen in keyword search, where query interpretation prefers interpretations with closer term proximity. For example, comparing results for the query “deep learning brain images” based on simple word matching and phrase matching:

Top 5 papers using word matching, where results are based on matching words and ranking based on paper static rank:

  • Classification of CT brain images based on deep learning networks
    (Static rank = -18.994, Distance = 4)
  • Unsupervised Deep Feature Learning for Deformable Registration of MR Brain Images
    (Static rank = -19.036, Distance = 8)
  • Application of deep transfer learning for automated brain abnormality classification using MR images
    (Static rank = -19.305, Distance = 10)
  • Age estimation from brain MRI images using deep learning
    (Static rank = -19.727, Distance = 6)
  • Exploring deep features from brain tumor magnetic resonance images via transfer learning
    (Static rank = -20.06, Distance = 13)

Top 5 papers using phrase matching, where results are based on first matching words and then re-ranking based on edit distance between query and value (ignoring stop words):

  • Deep Learning on Brain Images in Autism: What Do Large Samples Reveal of Its Complexity?
    (Static rank = -20.372, Distance = 0)
  • Deep learning of brain images and its application to multiple sclerosis
    (Static rank = -20.534, Distance = 0)
  • Classification of CT brain images based on deep learning networks
    (Static rank = -18.994, Distance = 4)
  • Unsupervised Deep Feature Learning for Deformable Registration of MR Brain Images
    (Static rank = -19.036, Distance = 8)
  • A deep learning-based segmentation method for brain tumor in MR images
    (Static rank = -20.171, Distance = 6)

This new ability to re-rank based on query-value edit distance also allows us to support quoted phrases in queries:

The rules for quoted values are:

  • A quoted value can only be matched to a single field, i.e. title, author name, journal name, etc.:
    Works: “deep learning” (matches field of study)
    Works: “microsoft research” (matches affiliation)
    Doesn’t work: “deep learning microsoft research”
  • For attributes that support partial matching (title, abstract), all quoted words must have a term-based edit distance of zero, ignoring stop words:
    Works: “deep learning brain images”
    Doesn’t work: “brain deep images learning”
  • Queries can contain multiple quoted values, each being evaluated using the rules defined above:
    Works: “deep learning” “microsoft research”
  • A quoted value is treated as a single query term and can be dropped accordingly based on the new search algorithm:
    Doesn’t work: “deep learning at microsoft research rocks!”
    Works: deep learning “at microsoft research rocks!”
  • All terms in a quoted value are normalized in exactly the same fashion as non-quoted terms

Support for searching paper abstract

We have finally added support for a long requested feature: searching paper abstracts! This is an important addition that significantly expands the reach of our partial-term matching for papers.

Abstracts are treated like all other semantic values, meaning they can be matched implicitly or explicitly using the “abstract:” scope, e.g.:

  • title: “microsoft academic” abstract: “heterogeneous entity graph”
  • “microsoft academic” “heterogeneous entity graph”

Scoped queries

Microsoft Academic has always supported query “hints” that require subsequent terms to match a specific attribute, i.e. the classic “papers about ”, but with our most recent release we now also support colon delimited scopes.

The rules for scopes are simple: the query term immediately after the scope must be matched with that scopes attribute type. A query “term” is defined as a single word or a quoted phrase. For example, if you wanted to match papers with “heterogeneous”, “entity” and “graph” in their abstracts but didn’t care about them being part of a sequence you would issue the query “abstract: heterogeneous abstract: entity abstract: graph”.

Supported scopes and their corresponding triggers:

Scope Description Example
abstract: Match term or quoted value from the paper abstract abstract: “heterogeneous entity graph comprised of six types of entities”
affiliation: Match affiliation (institution) name affiliation: “microsoft research”
author: Match author name author: “darrin eide”
conference: Match conference series name conference: www
doi: Match paper Document Object Identifier (DOI) doi: 10.1037/0033-2909.105.1.156
journal: Match journal name journal: nature
title: Match term or quoted value from the paper title title: “an overview of microsoft academic service mas and applications”
topic: Match paper topic (field of study) topic: “knowledge base”
year: Match paper publication year year: 2015

 

Feedback welcome

These changes have been in the works for over 6 months, and as always we’d love to hear your feedback, be it suggestions, critiques, bug reports or kudos. To provide feedback, navigate to Microsoft Academic and click the “feedback” icon in the lower right-hand corner.

Stay tuned in the coming weeks for another search-oriented post about how you can accomplish reference string parsing using Microsoft Academic Services!