Expanding Semantic Search into Biomed with Medical Subject Headings (MeSH)
MeSH is a controlled and hierarchically organized vocabulary that the National Institute of Health (NIH) maintains for indexing, cataloging, and facilitating search in biomedical databases such as PubMed. Since releasing the new version of MA nearly 5 years ago, we have increasingly observed that many user queries are phrased using MeSH terminology. That observation coupled with the prevalence of biomedical literature in the Microsoft Academic Graph (MAG) led us to pursue the integration of MeSH into MA’s unique semantic search capabilities.
Revisiting semantic search
One of the core differentiating behaviors of MA has always been its emphasis on semantic search. In contrast to keyword search where a search engine performs best when users select the “right” keywords that match how the contents are indexed, semantic search is designed for the cases when it is not clear what the “right” keywords should be. For example, suppose you want to find the most influential publications in artificial intelligence (AI). Using the query “artificial intelligence” with a keyword-based search engine, you will get results where the query terms explicitly appear in the paper title/body, which misses the influential publications on AI that do not contain those specific terms. A semantic search engine like MA, on the other hand, will be able to overcome this limitation.
As of the time of writing, the top results for the query “artificial intelligence” on MA are articles that demonstrate the efficacy of deep convolutional neural networks for computer vision. These trend setting articles do not include “artificial intelligence” anywhere in their titles, abstracts, or even in the full text body and hence will not be retrieved by keyword search unless additional field of study annotations are also indexed as keywords.
However, there are scenarios where a more intelligent search behavior cannot be so easily addressed, which is where our semantic search truly shines.
What are composite attributes?
Composite data relationships are one such example. In a world where talents can move from one institution to another, it is common to see authors with publications affiliated with different institutions. In the meantime, authors can also collaborate with others from their previous affiliations. The query consisting of an author and an institution can therefore be interpreted as to find either the work of the author while affiliated with the institution, or the collaborative work this author has with the said institution. We can distinguish these two different meanings by modeling the author-affiliation relationship as a composite attribute of a publication. Our API users have always been able to express this nuanced intent using the composite query function, and we are now making the same capability available to our website users.
Take the Turing Award winner Yann LeCun as an example. As a renowned computer scientist, he has had a productive career through AT&T Bell Labs, Courant Institute in New York University and, most recently, Facebook. Previously, MA treated the query “Yann LeCun New York University” by lumping the search results of both interpretations together. MA users can now use “Yann LeCun while at New York University” to more narrowly scope search to only include papers written while the author was affiliated with New York University. As the goal of semantic search is to zoom in on the most relevant result, being able to express more precise intent can help quickly filter the massive result sets that a keyword search engine would produce. For example, MA will only serve up one result to the query “Yann LeCun while at New York University Bell Labs” where another Bell Labs researcher coauthored a paper with Yann LeCun. All the papers Yann published while he worked at Bell Labs are not included in the search results as shown below (Note: be sure to engage with the query suggestion as explained in MA FAQ):
Similarly, the query “Yann LeCun Bell Labs” is now treated as an ambiguous query and will prompt MA to help the user clarify their intent with disambiguating query suggestions:
MeSH as a composite attribute
Composite attributes provide a powerful mechanism to group concepts that should be processed together, and one area that can further demonstrate its efficacy is in handling Medical Subject Headings (MeSH).
In the MeSH implementation now available on MA, two basic types of MeSH records are included: the descriptor (aka main heading) and the qualifier (aka subheading). Descriptors characterize the subject matter or content of an article, while qualifiers are used in connection with descriptors to define a particular aspect of a subject.
A good way to understand the differences between descriptors and qualifiers and our rationale to keep them as distinct fields in a composite attribute is through terms that can play either role. Take “mortality” as an example. MA can now differentiate the dual roles this term can play directly in the query suggestion dropdown where a darker/lighter icon is used for a descriptor/qualifier, respectively:
Clicking on the fourth suggestion to instruct MA to interpret “mortality” as a descriptor, one can see (from the “Top Topics” on the left rail of the search result page) that research on this subject commonly co-occurs with topics in “demography”, “population” and “public health”.
Further down the search result page are new sections for top co-occurring MeSH descriptors, where we can see that mortality is typically studied with other subjects like sex (male vs female), age, and geography. Similarly, by looking into top related MeSH qualifiers, MA shows the research articles addressing the subject of mortality are commonly from the areas of epidemiology or etiology, and the top topics include mortality trends and prevention control:
In contrast, when asking MA to interpret “mortality” as a qualifier, we can see “mortality” is often an aspect in “internal medicine”, “surgery”, “cardiology” or “cancer” research. Take heart attack (MeSH descriptor “myocardial infarction”) as an example. As MA can now show, this area of research can be studied through many aspects, including “mortality” but also others ranging from “drug treatment” to “complications”:
In this example, if you want to focus on articles about the mortality rate of heart attacks, you can select the first query suggestion “myocardial infarction in relation to mortality”. On the subsequent search result page all the top-most results will match the “myocardial infarction/mortality” descriptor/qualifier pair, indicated by the highlighted tag as
One important item to note here is the presence of the “*”, which is a MeSH convention to annotate the “major topic” for an article. This major topic flag is used in MA as one of the many signals in determining search result rankings. However, because search rankings are influenced by many factors, it is possible that an article whose major topic matches the query perfectly is ranked lower than others whose major topics are not as tightly matched.
Moving back to query formulation, similar to the author/affiliation example showcased above, when encountering the ambiguous query “heart attack mortality” MA will now generate two suggestions that reflect distinct interpretations:
The first interpretation generates results explicitly about the mortality of heart attacks. The second query suggestion, however, reflects a larger set of results with articles about the mortality rate for diseases (not specifically heart attacks) but also mentioning heart attacks (e.g. as a preexisting condition). To put it another way, the first interpretation is more specific and the second less specific.
As with author/affiliation metadata, modeling MeSH concepts with composite attributes enables this behavior in semantic search. It also enables descriptor/qualifier values to be queried independent of each other.
As MeSH concepts overlap significantly with MA’s existing topics, we’ve also provided new scoping triggers for MeSH so that queries can be more precisely specified:
|mesh:||Match MeSH descriptor and/or qualifier||mesh: heart attack
mesh: heart attack mortality
mesh: heart attack in relation to mortality
|mesh descriptor||Match MeSH descriptor||mesh descriptor heart attack|
|mesh qualifier||Match MeSH qualifier||mesh qualifier diagnosis|
|abstract:||Match term or quoted value from the paper abstract||abstract: “heterogeneous entity graph comprised of six types of entities”|
|affiliation:||Match affiliation (institution) name||affiliation: “microsoft research”|
|author:||Match author name||author: “darrin eide”|
|conference:||Match conference series name||conference: www|
|doi:||Match paper Document Object Identifier (DOI)||doi: 10.1037/0033-2909.105.1.156|
|journal:||Match journal name||journal: nature|
|title:||Match term or quoted value from the paper title||title: “an overview of microsoft academic service mas and applications”|
|topic:||Match paper topic (field of study)||topic: “knowledge base”|
|year:||Match paper publication year||year: 2015|
In closing, we are excited about the addition of MeSH to MA, and the opportunities it enables with the research community. As always, we love getting feedback and try to respond to as much of it as possible. To provide feedback, navigate to Microsoft Academic and click the “feedback” icon in the lower right-hand corner.