Natural language processing (NLP) is a major frontier of artificial intelligence. With the recent statistical and neural revolutions, the research community have made great stride in many fronts, such as machine translation, speech recognition, question answering, and language model pretraining. However, the mainstream NLP focuses on general domains such as newswires and the web. By contrast, specialized domains such as biomedicine have received relatively little attention.
With profound disruptions such as $1000 person genome and rapid digitization of medical records, biomedicine faces an unprecedented opportunity to fulfill the dream of precision medicine, where treatments become increasingly effective by tailoring to individual patients. NLP can play a key role in this revolution. The bulk of biomedicine is conducted and recorded in natural language. The deluge of biomedical information is well beyond manual curation. For example, PubMed adds two biomedical papers every minute, thousands every day, and over a million every year. Similarly, every year, there are two million new cancer patients in the US, seventeen million worldwide, and a cancer patient’s medical record typically comprises hundreds of clinical notes. To create personalized, data-centric precision medicine, it is thus imperative to develop NLP methods that can understand biomedical text and extract knowledge from it.
Table 1: Comparison of common biomedical terms in vocabularies used by the standard BERT, SciBERT and PubMedBERT. A check mark indicates the biomedical term appears in the corresponding vocabulary, otherwise the term will be shattered into small sub-words.
Biomedical text is drastically different from general-domain one. Table 1 shows the vocabularies used in neural language models, as generated from general domains (BERT), PubMed and computer science (SciBERT), or purely biomedical text (PubMedBERT). Besides from the most frequent ones, biomedical terms are not the first-class citizens in general-domain text.
Biomedical applications also highlight key research areas that have been underexplored in mainstream NLP. In biomedicine, the most valuable information typically has little redundance, such as latest medical findings. Consequently, biomedical NLP needs to tackle more complex linguistic phenomena. For example, while most information extraction work focuses on extracting binary relations in single sentences, in biomedicine, we need to expand the scope to complex relations involving three or more entities, which typically span multiple sentences.
Biomedical NLP is also faced with substantial machine learning challenges. The standard supervised paradigm requires annotated examples. In general domains, these can be produced at scale by crowdsourcing. In specialized domains such as biomedicine, crowdsourcing is difficult, as annotation requires domain expertise that crowd workers do not possess.
At Microsoft, we are committed to advancing the state of the art in biomedical NLP, with the ultimate goal of accelerating progress in precision medicine. We are building on our deep research assets in deep learning and biomedical machine reading. In foundational research, we focus on advancing self-supervised learning, such as conducting task-agnostic biomedical language model pretraining and proposing a general framework for task-specific self-supervision. In end-to-end biomedical applications, we have made excited progress in deep collaborations with Microsoft partners such as JAX and Providence. Most recently, we have created BLURB, a comprehensive benchmark and leaderboard for biomedical NLP, and released our state-of-the-art PubMedBERT models. With these open-source resources, we hope to lower the entry barriers for NLP researchers aspiring to contribute to biomedicine and help accelerate progress in this vitally important area.