Towards universal language embeddings

Published March 18, 2019

By Jianfeng Gao , Technical Fellow & Corporate Vice President

Share this page

Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to many NLU tasks.

Two popular approaches to learning language embeddings are language model pre-training and multi-task learning (MTL). While the former learns universal language embeddings by leveraging large amounts of unlabeled data, MTL is effective to leverage supervised data from many related tasks and profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned embeddings universal across tasks.

Researchers at Microsoft have released MT-DNN—a Multi-Task Deep Neural Network (opens in new tab) model for learning universal language embeddings. MT-DNN combines the strengths of MTL and language model pretraining of BERT (opens in new tab), and outperforms BERT on 10 NLU tasks, creating new state-of-the-art results across many popular NLU benchmarks, including General Language Understanding Evaluation (GLUE (opens in new tab)), Stanford Natural Language Inference (SNLI (opens in new tab)) and SciTail (opens in new tab).

MT-DNN architecture

MT-DNN extends the model proposed by Microsoft in 2015 by incorporating a pre-trained bidirectional transformer language model, known as BERT, developed by Google AI. The architecture of the MT-DNN model is illustrated in the following figure. The lower layers are shared across all tasks while the top layers are task-speciﬁc. The input X, either a sentence or a pair of sentences, is ﬁrst represented as a sequence of embedding vectors, one for each word, in l_1. Then the transformer-based encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l_2. Finally, for each task, additional task-speciﬁc layers generate task-speciﬁc representations, followed by operations necessary for classiﬁcation, similarity scoring, or relevance ranking. MT-DNN initializes its shared layers using BERT, then refines them via MTL.

MT-DNN architecture

Domain Adaptation Results

One way to evaluate how universal the language embeddings are is to measure how fast the embeddings can be adapted to a new task, or how many task-specific labels are needed to get a reasonably good result on the new task. More universal embeddings require fewer task-specific labels.
The authors of the MT-DNN paper compared MT-DNN with BERT in domain adaption, where both models are adapted to a new task by gradually increasing the size of in-domain data for adaptation. The results on the SNLI and SciTail tasks are presented in the following table and figure. With only 0.1% of in-domain data (which amounts to 549 samples in SNLI and 23 samples in SciTail), MT-DNN achieves +80% in accuracy while BERT’s accuracy is around 50%, demonstrating that the language embeddings learned by MT-DNN are substantially more universal than those of BERT.

MT-DNN accuracy as compares with BERT across SNLI and SciTail datasets.

Release news

Microsoft will release the MT-DNN package to public at https://github.com/namisan/mt-dnn (opens in new tab). The release package contains the pretrained models, the source code and the Readme that describes step by step how to reproduce the results reported in the MT-DNN paper (opens in new tab), and how to adapt the pre-trained MT-DNN models to any new tasks via domain adaptation. We welcome your comments and feedback and look forward to future developments!