Domain-specific language model pretraining for biomedical natural language processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general-domain corpora, such as in newswire and web text. Biomedical text is very different from general-domain text, yet biomedical NLP has been relatively underexplored. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models.

In this webinar, Microsoft researchers Hoifung Poon, Senior Director of Biomedical NLP, and Jianfeng Gao, Distinguished Scientist, will challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models.

You will begin with understanding how biomedical text differs from general-domain text and how biomedical NLP poses substantial challenges that are not present in mainstream NLP. You will also learn about the two paradigms for domain-specific language model pretraining and see how pretraining from scratch significantly outperforms mixed-domain pretraining in a wide range of biomedical NLP tasks. Finally, find out about our comprehensive benchmark and leaderboard created specifically for biomedical NLP, called BLURB, and see how our biomedical language model, PubMedBERT, sets a new state of the art.

Together, you’ll explore:

How biomedical NLP differs from mainstream NLP
A shift in approach to pretraining language models for specialized domains
BLURB: a comprehensive benchmark and leaderboard for biomedical NLP
PubMedBERT: the state-of-the-art biomedical language model pretrained from scratch on biomedical text

Resource list:

BioMed NLP Group (opens in new tab) (Group page)
Hanover (opens in new tab) (Project page)
Deep Learning (opens in new tab) (Group page)
BLURB (opens in new tab) (GitHub)
PubMedBERT (opens in new tab) (GitHub)
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing (opens in new tab) (Paper)
Hoifung Poon (opens in new tab) (Profile page)
Jianfeng Gao (opens in new tab) (Profile page)

*This on-demand webinar features a previously recorded Q&A session and open captioning.

Explore more Microsoft Research webinars: https://aka.ms/msrwebinars (opens in new tab)

Date:: October 15, 2020
Speakers:: Hoifung Poon, Jianfeng Gao
Affiliation:: Microsoft Research

- Hoifung Poon
  
  General Manager, Health Futures
- Jianfeng Gao
  
  Distinguished Scientist & Vice President
Research Area
Research Lab
- Microsoft Research Lab - Redmond
Group
- Deep Learning Group
- Real-world Evidence
Publication
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Download
- BLURB

Watch Next

Microsoft Research India - who we are.
September 15, 2023
Speakers:

Kalika Bali,

Sriram Rajamani,

Venkat Padmanabhan

, et. al.
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
March 24, 2023
Speakers:

Stephanie Hyland
Microsoft Research 2022 Global PhD Fellowship Recipients
October 13, 2022
Speakers:

Kangle Deng,

Paulo Freitas de Araujo Filho,

Simla Harma

, et. al.

Domain-specific language model pretraining for biomedical natural language processing

Speakers

Hoifung Poon

Jianfeng Gao

Related Links

Research Area

Research Lab

Group

Publication

Download

Watch Next

Microsoft Research India - who we are.

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Microsoft Research 2022 Global PhD Fellowship Recipients