Towards universal language embeddings


Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to many NLU tasks.

Two popular approaches to learning language embeddings are language model pre-training and multi-task learning (MTL). While the former learns universal language embeddings by leveraging large amounts of unlabeled data, MTL is effective to leverage supervised data from many related tasks and profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned embeddings universal across tasks.


Register today: Microsoft Research Summit 2022

October 18–20, 2022
Join us as the global research community gathers to share progress and spark conversations around advances that could empower people in new ways and positively impact our world.

Researchers at Microsoft have released MT-DNN—a Multi-Task Deep Neural Network model for learning universal language embeddings. MT-DNN combines the strengths of MTL and language model pretraining of BERT, and outperforms BERT on 10 NLU tasks, creating new state-of-the-art results across many popular NLU benchmarks, including General Language Understanding Evaluation (GLUE), Stanford Natural Language Inference (SNLI) and SciTail.

MT-DNN architecture

MT-DNN extends the model proposed by Microsoft in 2015 by incorporating a pre-trained bidirectional transformer language model, known as BERT, developed by Google AI. The architecture of the MT-DNN model is illustrated in the following figure. The lower layers are shared across all tasks while the top layers are task-specific. The input X, either a sentence or a pair of sentences, is first represented as a sequence of embedding vectors, one for each word, in l_1. Then the transformer-based encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l_2. Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking. MT-DNN initializes its shared layers using BERT, then refines them via MTL.

MT-DNN architecture

Domain Adaptation Results

One way to evaluate how universal the language embeddings are is to measure how fast the embeddings can be adapted to a new task, or how many task-specific labels are needed to get a reasonably good result on the new task. More universal embeddings require fewer task-specific labels.
The authors of the MT-DNN paper compared MT-DNN with BERT in domain adaption, where both models are adapted to a new task by gradually increasing the size of in-domain data for adaptation. The results on the SNLI and SciTail tasks are presented in the following table and figure. With only 0.1% of in-domain data (which amounts to 549 samples in SNLI and 23 samples in SciTail), MT-DNN achieves +80% in accuracy while BERT’s accuracy is around 50%, demonstrating that the language embeddings learned by MT-DNN are substantially more universal than those of BERT.

MT-DNN accuracy as compares with BERT across SNLI and SciTail datasets.

Release news

Microsoft will release the MT-DNN package to public at The release package contains the pretrained models, the source code and the Readme that describes step by step how to reproduce the results reported in the MT-DNN paper, and how to adapt the pre-trained MT-DNN models to any new tasks via domain adaptation. We welcome your comments and feedback and look forward to future developments!