Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Towards universal language embeddings

March 18, 2019 | By Jianfeng Gao, Partner Research Manager

Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to many NLU tasks.

Two popular approaches to learning language embeddings are language model pre-training and multi-task learning (MTL). While the former learns universal language embeddings by leveraging large amounts of unlabeled data, MTL is effective to leverage supervised data from many related tasks and profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned embeddings universal across tasks.

Researchers at Microsoft have released MT-DNN—a Multi-Task Deep Neural Network model for learning universal language embeddings. MT-DNN combines the strengths of MTL and language model pretraining of BERT, and outperforms BERT on 10 NLU tasks, creating new state-of-the-art results across many popular NLU benchmarks, including General Language Understanding Evaluation (GLUE), Stanford Natural Language Inference (SNLI) and SciTail.

MT-DNN architecture

MT-DNN extends the model proposed by Microsoft in 2015 by incorporating a pre-trained bidirectional transformer language model, known as BERT, developed by Google AI. The architecture of the MT-DNN model is illustrated in the following figure. The lower layers are shared across all tasks while the top layers are task-specific. The input X, either a sentence or a pair of sentences, is first represented as a sequence of embedding vectors, one for each word, in l_1. Then the transformer-based encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l_2. Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking. MT-DNN initializes its shared layers using BERT, then refines them via MTL.

MT-DNN architecture

Domain Adaptation Results

One way to evaluate how universal the language embeddings are is to measure how fast the embeddings can be adapted to a new task, or how many task-specific labels are needed to get a reasonably good result on the new task. More universal embeddings require fewer task-specific labels.
The authors of the MT-DNN paper compared MT-DNN with BERT in domain adaption, where both models are adapted to a new task by gradually increasing the size of in-domain data for adaptation. The results on the SNLI and SciTail tasks are presented in the following table and figure. With only 0.1% of in-domain data (which amounts to 549 samples in SNLI and 23 samples in SciTail), MT-DNN achieves +80% in accuracy while BERT’s accuracy is around 50%, demonstrating that the language embeddings learned by MT-DNN are substantially more universal than those of BERT.

MT-DNN accuracy as compares with BERT across SNLI and SciTail datasets.

Release news

Microsoft will release the MT-DNN package to public at The release package contains the pretrained models, the source code and the Readme that describes step by step how to reproduce the results reported in the MT-DNN paper, and how to adapt the pre-trained MT-DNN models to any new tasks via domain adaptation. We welcome your comments and feedback and look forward to future developments!

Up Next

Human language technologies

Analyzing ambiguity and word embeddings by probing semantic classes

Word embeddings have had a big impact on many applications in natural language processing (NLP) and information retrieval. It is, therefore, crucial to open the blackbox and understand their meaning representation. We propose probing tasks for analyzing the meaning representation in word embeddings. Our tasks are classification based with word embeddings as the only input. […]

Yadollah Yaghoobzadeh

Senior Researcher

Human language technologies

Robust Language Representation Learning via Multi-task Knowledge Distillation

Language Representation Learning maps symbolic natural language texts (for example, words, phrases and sentences) to semantic vectors. Robust and universal language representations are crucial to achieving state-of-the-art results on many Natural Language Processing (NLP) tasks. Ensemble learning is one of the most effective approaches for improving model generalization and has been used to achieve new […]

Microsoft blog editor

Artificial intelligence, Human language technologies, Programming languages and software engineering

Bringing low-resource languages and spoken dialects into play with Semi-Supervised Universal Neural Machine Translation

Machine translation has become a crucial component in the advancing of global communication. Millions of people are using online translation systems and mobile applications to communicate across language barriers. Machine translation has made rapid advances in recent years with the deep learning wave. Microsoft Research recently achieved a historic milestone in machine translation – human […]

Hany Hassan Awadalla

Principal Research Scientist