Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Towards universal language embeddings

March 18, 2019 | By Jianfeng Gao, Partner Research Manager

Language embedding is a process of mapping symbolic natural language text (for example, words, phrases and sentences) to semantic vector representations. This is fundamental to deep learning approaches to natural language understanding (NLU). It is highly desirable to learn language embeddings that are universal to many NLU tasks.

Two popular approaches to learning language embeddings are language model pre-training and multi-task learning (MTL). While the former learns universal language embeddings by leveraging large amounts of unlabeled data, MTL is effective to leverage supervised data from many related tasks and profits from a regularization effect via alleviating overfitting to a specific task, thus making the learned embeddings universal across tasks.

Researchers at Microsoft have released MT-DNN—a Multi-Task Deep Neural Network model for learning universal language embeddings. MT-DNN combines the strengths of MTL and language model pretraining of BERT, and outperforms BERT on 10 NLU tasks, creating new state-of-the-art results across many popular NLU benchmarks, including General Language Understanding Evaluation (GLUE), Stanford Natural Language Inference (SNLI) and SciTail.

MT-DNN architecture

MT-DNN extends the model proposed by Microsoft in 2015 by incorporating a pre-trained bidirectional transformer language model, known as BERT, developed by Google AI. The architecture of the MT-DNN model is illustrated in the following figure. The lower layers are shared across all tasks while the top layers are task-specific. The input X, either a sentence or a pair of sentences, is first represented as a sequence of embedding vectors, one for each word, in l_1. Then the transformer-based encoder captures the contextual information for each word and generates the shared contextual embedding vectors in l_2. Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking. MT-DNN initializes its shared layers using BERT, then refines them via MTL.

MT-DNN architecture

Domain Adaptation Results

One way to evaluate how universal the language embeddings are is to measure how fast the embeddings can be adapted to a new task, or how many task-specific labels are needed to get a reasonably good result on the new task. More universal embeddings require fewer task-specific labels.
The authors of the MT-DNN paper compared MT-DNN with BERT in domain adaption, where both models are adapted to a new task by gradually increasing the size of in-domain data for adaptation. The results on the SNLI and SciTail tasks are presented in the following table and figure. With only 0.1% of in-domain data (which amounts to 549 samples in SNLI and 23 samples in SciTail), MT-DNN achieves +80% in accuracy while BERT’s accuracy is around 50%, demonstrating that the language embeddings learned by MT-DNN are substantially more universal than those of BERT.

MT-DNN accuracy as compares with BERT across SNLI and SciTail datasets.

Release news

Microsoft will release the MT-DNN package to public at The release package contains the pretrained models, the source code and the Readme that describes step by step how to reproduce the results reported in the MT-DNN paper, and how to adapt the pre-trained MT-DNN models to any new tasks via domain adaptation. We welcome your comments and feedback and look forward to future developments!

Up Next

Artificial intelligence, Human language technologies, Search and information retrieval

Democratizing APIs with Natural Language Interfaces

Benefiting from a confluence of factors, such as service-oriented architecture, cloud computing, and Internet-of-Things (IoT), application program interfaces – APIs – are playing an increasingly important role in both the virtual and the physical world. For example, web services, such as those featuring weather, sports, and finance, hosted in the cloud provide data and services […]

Ahmed Hassan Awadallah

Senior Researcher, Research Manager

Artificial intelligence, Human language technologies, Programming languages and software engineering

Bringing low-resource languages and spoken dialects into play with Semi-Supervised Universal Neural Machine Translation

Machine translation has become a crucial component in the advancing of global communication. Millions of people are using online translation systems and mobile applications to communicate across language barriers. Machine translation has made rapid advances in recent years with the deep learning wave. Microsoft Research recently achieved a historic milestone in machine translation – human […]

Hany Hassan Awadalla

Principal Research Scientist

Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

Artificial intelligence, Human language technologies

Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

The vision of the researchers at the Microsoft Research Montreal lab is to create machines that can comprehend, reason and communicate with humans. As part of this vision, our dialogue team has been doing research on task-oriented dialogue systems. We had earlier proposed the lexicalized delexicalized – semantically controlled – LSTM (ld-sc-LSTM) model for Natural Language […]

Microsoft blog editor