Language is the holy grail of Artificial Intelligence. The progress of Natural Language Processing (NLP) technologies will push the entire AI field forward. Here’s a view into what’s next.
Since the inception of Microsoft Research Asia, NLP has been a key research area in the field of Artificial Intelligence (AI). In the past 20 years, Microsoft Research Asia has developed NLP technologies, including those which have been shipped in Windows, Office, Bing, Microsoft Cognitive Services, Xiaoice, and Cortana.
This past work, which includes research in deep learning applied to machine translation, extractive machine reading comprehension, and grammar check, has achieved parity with human performance on related evaluation tasks
So what’s next? We believe that the next 10 years will be a golden era in NLP development, for the following reasons:
- Big data will become more easily collected, processed, and archived.
- NLP technology will extend into new applications for search engines, customer support, business intelligence, translation, education, law, finance, and more.
- Robot and IOT requirements will increasingly include text, speech, and vision capabilities.
These trends will stimulate large-scale investment in NLP and attract more talent to work in NLP research and development.
Areas of focus for the next generation of NLP research will include:
- Integrating knowledge and common sense into data-driven learning approaches.
- More attention to low-resource NLP tasks.
- Contextualized modelling and multi-turn dialogue understanding.
- Semantic analysis, leading to NLP that is knowledge-based, commonsense, and explainable.
Why NLP research is key
Natural Language Understanding (NLU) is a research area that uses the computer to analyze and extract key information from natural language sentences and texts, and then perform information retrieval, question-answering, machine translation, and text-generation activities. It is central to progress in many areas of AI, because the goal of AI overall is to make computers and smart devices listen, speak, and understand language; be able to think and solve problems; and even be able to create new things.
Recent progress in NLP includes:
Neural machine translation
Neural machine translation is a process of simulating how a human brain translates.
The task of translation is to convert a source language sentence into a target language sentence and retain the original meaning. When translating, human brains first try to understand the sentence, then form a semantic representation of the sentence in the mind, and finally transform this semantic representation into a sentence in another language. Neural machine translation simulates this human translation process, through two modular processes—encoding and decoding. The encoder is responsible for compressing source language sentences into vector representations in the semantic space, which are expected to contain the semantic information of source language sentences. The decoder generates semantically equivalent sentences of the target language based on semantic vectors provided by the encoder.
The advantage of the neural machine translation model lies in three aspects: end-to-end training, which reduces error propagation among multiple sub-models; the distributed representation of information, which can automatically learn multi-dimensional translation knowledge; and the use of global context information to complete the translation, rather than just using local context. Recurrent neural machine translation is an important foundational model, over which there have been many improvements on either advanced network structures or novel model training methods.
The translation quality of neural machine translation systems keeps improving, with the goal of reaching human-level performance. In 2018, the Chinese-English machine translation system, developed by Microsoft Research Asia in collaboration with the Microsoft Translator product team, reached a translation quality level comparable to human professional translation on the WMT 2017 news test dataset. This system combines four advanced technologies proposed by Microsoft Research Asia, including joint training and dual-learning techniques that can efficiently utilize large-scale monolingual training data to improve the model training, an agreement regularization technique to address the issue of exposure bias, as well as a deliberation network approach to improving translation quality with two-pass translations that simulate the human translation process.
Human-computer interaction (HCI) aims to build machine intelligence that can communicate with humans by using natural language. Conversation as a Platform (CaaP) is one of the most important concepts for this.
Conversation as a Platform (CaaP) is a brand-new concept proposed by Microsoft CEO Satya Nadella in 2016. Satya thinks that conversation will become the next-generation interface, which will bring progress to both the artificial intelligence and device fields.
The reasons why this concept is important are two-fold. First, conversation-centered apps, such as WeChat and Facebook, have become part of everyone’s life, setting up our expectations for future HCI platforms. Second, a large portion of devices have small screens (such as cell phones) or even no screen (such as some IoT devices). On such devices, natural language presents the most straightforward and natural form of communication. Today, HCI using conversational systems can complete tasks such as buying coffee and booking tickets, and there are several CaaP platforms available for developers around the world to build their own conversation-based HCI systems.
In general, the technologies used for building such HCI systems can be divided into three layers: the chat layer, the search and question/answer (QA) layer, and the task-completion layer. The chat layer, such as Xiaoice, provides chat capability, which can make an HCI system more conversational and increase user engagement. The search and QA layer, such as Bing, provides search and question answering capabilities, which can return accurate search results or provide answers to a user’s questions. The task-completion layer, represented by Cortana, provides task-oriented dialogue capability that can help users complete specific tasks such as restaurant reservations, hotel booking, or weather inquiry; and once it gets to know your personal habits, it can remind you of meetings or suggest somewhere to go. A bot with these three layers can provide a natural and useful user experience.
Machine reading comprehension
Teaching a machine to read and comprehend text is a very important research challenge in natural language understanding.
The task and goal of machine reading comprehension is to design systems that can answer questions after reading a passage or document. There are a wide range of applications for this, including the ability for search engines to provide intelligent and accurate answers for natural-language queries by reading the relevant documents on the Web. In addition, machine reading comprehension can also be used in personal assistants, such as Cortana, so that Cortana can help answer customer support questions after reading documents (such as user manuals or product descriptions). It can be also used in work environments to help users read and process emails or business documents, and then summarize the relevant information. In the education domain, machine reading comprehension can be used to design tests. In legal circles, it can be used to help lawyers or judges by reading and understanding legal questions. In financial applications, machine reading comprehension can be used to extract information for making better financial decisions.
The recent advances in machine reading comprehension have been furthered by the use of large-scale, manually annotated datasets. The Stanford Question Answering Dataset (SQuAD) is the most widely used benchmark dataset for machine reading comprehension research. Stanford released SQuAD in July 2016 and it consists of 100,000 human-labeled question and answer pairs. The passages in SQuAD are from Wikipedia articles and each passage is annotated with no more than five questions, with answers that are exact sub-spans of each passage. Stanford divides the dataset into training, development, and test sets. The training set and development set are publicly available, while the test set is hidden from both researchers and participants. Participants need to submit their systems to the Stanford team to obtain the results on the test set, which will be updated on the SQuAD leaderboard. As of November 2018, there were more than 100 entries from academic and industry research labs.
The leaderboard indicates that there has been great progress in machine reading comprehension research in the last two years. In January 2018, the R-net system from Microsoft Research Asia was the first system to exceed human parity on the SQuAD dataset, in terms of the Exact Match (EM) metrics. In early 2018, systems from Alibaba and iFLYTEK also exceeded the EM test for human parity. In September 2018, the system from Microsoft Research Asia, nLnet, became the first to exceed both EM and F1 human parity on the SQuAD dataset. Google’s BERT then became the leader.
The SQuAD dataset provides a great platform and testing ground for the whole research community to develop, verify, and accumulate techniques to benefit the broader research effort in NLP. The technology stacks behind the recent progress of research on machine reading comprehension include end-to-end neural machine reading comprehension models; pretrained models, such as the ELMo from AI2 and BERT from Google AI, for machine reading comprehension and natural language processing; and system innovations on network structures, automatic data augmentation, and implementation.
Infusing AI into creation processes and democratizing creation for ordinary people.
As early as 2005, Microsoft Research Asia successfully developed the Microsoft Couplet system, with the proposal and support of Dr. Harry Shum, who at the time was director of the lab. Given the user’s input of the first line of a couplet, the system can automatically generate the second sentence of a couplet, as well as the streamer description.
After that, we developed two intelligent AI creation systems: Metrical Poetry and Chinese Character Riddles. For example, in Chinese Character Riddles, the system is able to both solve and generate riddles based on Chinese characters.
In 2017, Microsoft Research Asia developed a system for writing modern poetry and composing music (including lyric generation and melody composition). This system of song generation has participated in the CCTV 1’s AI program (Machine vs. Human Intelligence). All of these show that deep learning technology and big data have great potential for mimicking a human’s ability to create, and that they can be used to help artists and others to create.
Taking the capability of lyrics generation as an example, the system will first generate a topic before writing the lyrics. For instance, if you would like to write a song related to “autumn,” “sundown,” and “sigh with feeling,” the user can add keywords such as “autumn wind,” “flowing year,” ” gleaming,” “changing,” and so on. The sequence-to-sequence neural networks are used to generate the sentences in the lyrics line-by-line, under the constraint of the topics.
To compose the melody for lyrics, the system should not only consider the quality of the melody, but also the correspondence between the lyrics and the melody. It requires that each note correspond to each word in the lyrics. Given the lyrics as input, we generate the lyric-conditional melody as well as the exact alignment between the generated melody and the given lyrics, simultaneously. More specifically, we develop the melody composition model based on the sequence-to-sequence framework, which is able to jointly produce musical notes and the corresponding alignment.
Hot NLP topics
We summarize the latest NLP technologies into five hot topics:
Hot topic 1: Pre-trained models (or representations)
How machines learn more general and effective pre-trained models (or representations) will continue to be one of the hottest research topics in the NLP area.
One major difficulty faced by many natural language tasks is the limited amount of training data. Today’s researchers are investigating how to learn general and effective pre-trained representations for language understanding, where words and text are represented as vectors. These are useful when task-specific training data are limited.
A Neural Probabilistic Language Model is a foundational work in neural language modeling. In this work, word embeddings are further fed into a neural sequence encoder to encode contextual information. Following this direction, many works, such Word2vec (opens in new tab) and GloVe (opens in new tab), emerged to further improve the quality of learned word embeddings. One drawback of word embedding is its lack of context sensitivity: the representation of one word is the same regardless of the context it appears in. Work by Peters et al. with ELMo reveal that such context-sensitive representations have already been built by the neural language model. Instead of only using word embeddings, ELMo also leverages the sequence encoder from the language model; and such context-sensitive representations bring drastic improvements over traditional word embedding methods. More recently, BERT uses a transformer-based encoder and a masked word approach to train a very large bidirectional representation from large amount of text, which again brings astounding gains in a variety of tasks.
In the future, it is worth investigating new network structures, lightweight approaches, as well as incorporating world knowledge and common-sense knowledge to learn general pre-trained representations for language understanding. It is also interesting to see if further scaling up the model size and training on more text can bring further improvements.
Hot topic 2: Transfer learning and multi-task learning
Transfer learning has important and practical significance to NLP tasks that lack enough training data. Multi-task learning uses common knowledge from multiple task supervisions and improves model generalization.
In the era of deep learning, different NLP tasks often share encoders that have a homogeneous network structure, such as RNN, CNN, or transformer. This makes transfer learning more practical and straightforward. Using pre-trained word embeddings such as Word2Vec,ELMo or BERT, we employ a type of transfer learning method where the knowledge (word embeddings) learnt from a large-scale corpus via a language model is transferred to downstream tasks directly, by initializing corresponding network layers of downstream task models. Such methods are important to those tasks with little training data.
Multi-task learning is another paradigm that can use different task supervisions to improve a target task, by learning common knowledge from all involved tasks. In 2008, Collobert and Weston proposed a deep learning-based, multi-task framework, and it was the first work to combine deep learning and multi-task learning for NLP. In 2018, McCann proposed another multi-task learning framework, which treats all involved tasks as question-answering tasks and trains a unified model for ten NLP tasks. Experiments show that all tasks can benefit by using the common knowledge learnt from different task supervisions. Based on such common knowledge, each specific task can be further fine-tuned.
Hot topic 3: Knowledge and common sense
How to utilize knowledge and common sense in natural language understanding has become one of the most important topics in NLP.
With the rapid development of HCI engines (such as chat, QA, and dialogue systems), how to utilize knowledge and common sense in natural language understanding has become one of the most important topics in NLP, as they are essential for conversation engines or other types of HCI engines to understand user queries, manage conversations, and generate responses.
Wikipedia and knowledge graphs (such as Freebase and Satori) are two types of commonly used knowledge bases. Machine Reading Comprehension(MRC) is a typical NLP task based on Wikipedia, where the MRC model aims to extract an answer from the passage based on the input question. Semantic parsing is another typical NLP task based on a knowledge graph, which aims to convert an input question into a machine-readable and executable logical form. Both tasks are hot topics in NLP.
Commonsense knowledge refers to those facts that all humans are expected to know, such as lemons are sour and an elephant is bigger than a butterfly. Many HCI tasks, like QA and dialogue, need common sense to reason and generate responses. However, as most commonsense knowledge is rarely explicitly expressed in textual corpora, NLP models cannot use such knowledge directly. With the rapid development of chat, dialogue, and QA engines, how to build large-scale commonsense knowledgebases and apply them to various NLP tasks have been explored by many researchers in the last two decades.
Hot topic 4: Low-resource NLP tasks
Data augmentation methods are popularly used to enrich the data resource for low-resource NLP tasks, such as introducing domain knowledge (dictionaries and rules) and labeling more useful data with active learning.
For some NLP tasks, such as rare language translation, chatbot and customer service systems in specific domains and in multi-turn tasks, labeled data is hard to acquire and the data sparseness problem becomes serious. These are called low-resource NLP tasks. To enrich the training data, many data augmentation methods can be used. For example, we can introduce domain knowledge (dictionaries and rules) or leverage active learning to maximize the gain of labeling data. Researchers can also employ semi-supervised and unsupervised methods to use the unlabeled data. Labeled data from other tasks and other languages can also be used with multi-task learning and transfer learning.
Taking machine translation as an example, some rare language translation tasks only have a bilingual dictionary for model training, without any bilingual corpus. Based on this small dictionary of only a few thousand entries, cross-lingual word embedding methods can be used to map the source words and the target words into one semantic space, leveraging a large monolingual corpus. In this semantic space, the source word and the corresponding target word have similar word representations. Based on the cross-lingual word embedding, we can compute the semantic similarity of source and target words, which are used to build a word-based translation table. Together with the trained language model, we can build word-based statistical machine translation (SMT) systems, which are used to translate the monolingual corpus into a pseudo-bilingual corpus and turn the unsupervised translation task into a supervised one. Leveraging the pseudo-bilingual corpus, source-to-target and target-to-source neural translation models can be initialized and boosted with each other by using joint training methods and the large, monolingual corpus.
To improve the translation performance of rare languages, we also propose leveraging the large bilingual corpus between rich languages to boost four translation models for rare ones in a joint generalized EM training framework. Given two rich languages, such as X （Chinese） and Y (English), the rare language Z (such as Hebrew) is treated as a hidden state between X and Y. The translation process from X to Y can be redefined as translating X to Z first, and then translating from Z to Y, and similar for the direction from Y to X. Based on this, we can use the large bilingual data between X and Y to jointly train four translation models, which are P(Z|X), P(Y|Z), P(Z|Y), and P(X|Z), with the popularly used generalized EM training in an iterative process.
Hot topic 5: Multi-modal learning
As a typical multi-modal task, visual QA (VQA) receives great interest by researchers from both NLP and computer vision areas.
Before knowing how to speak, infants perceive the world by seeing, listening, and touching. This means language is not the only way to learn and communicate with the world. Therefore, we should substantially consider both language and other modalities for building artificial generic intelligence. This is called multi-modal learning.
As a typical multi-modal task, visual QA (VQA) receives great interest by researchers in the NLP and computer vision areas. Given an image and natural language question, VQA aims to generate the answer to the input question and depends on the deep understanding and sufficient interaction between the input question and image. Recently, researchers from Microsoft Research Asia presented two VQA approaches in this year’s CVPR and KDD, based on question generation and scene graph generation technologies respectively. We achieved state-of-the-art results on VQA benchmark datasets, including COCO and VQA 2.0. Besides VQA, video QA is another popular multi-modal learning task. Different from VQA, video QA returns a short video clip as the answer to the input query, which makes search results more vivid. With the rapid development of short music and video social platforms, live streaming apps, and mixed and artificial reality technology, how to understand, search, create, and utilize videos will become one of the keys to the next generation of HCI engines.
We think an ideal NLP framework could be a general-purpose architecture as described below. Note this would be one of the typical designs, there could be different design choice on using various technologies for a specific task.
As the first step, it works on the natural language sentence and obtains the word sequence, part-of-speech, dependency analysis, entity identification, intent identification, relation identification, and so on.
Then, the encoder will transform the information obtained into a semantic expression. In this procedure, the pre-trained word embedding and pre-trained entity embedding naturally bring in contextual information of a word or an entity. Furthermore, the same sentence will be encoded with other task-specific encoders and information obtained from these encoders are appended into the encoding of the main-task, with appropriate weights via transfer learning. The additional encoding from other task-specific encoding will further enrich the encoding of the input sentence.
Next, based on the semantic expression obtained from the above process, we can use a decoder to generate the expected output. Additionally, multi-task learning can be applied to introduce other NLP tasks as complementary resources to help with the main-task learning. If the task involves multi-turn modeling, we will need to record the output of the previous turn into memory and use it for the decoding and inference of the subsequent turns.
To realize this ideal NLP framework, we will need to implement the following tasks:
- Construct a large-scale commonsense knowledge base and set up effective evaluation tasks to push forward the related research.
- Study more effective expressions of words, phrases, and sentences, and build a more powerful pre-trained network for expressions at different levels.
- Push forward unsupervised learning and semi-supervised learning, by using a limited amount of knowledge to strengthen the learning ability and by building powerful, cross-lingual word-embedding models.
- Leverage the effect of multi-task learning and transfer learning in NLP tasks, and boost the effect of reinforcement learning for typical tasks such as multi-turn dialogue in customer support systems.
- Effectively model discourse and multi-turn conversation and multi-turn semantic analysis.
- Conduct user modeling and apply it to personalized recommendation and output systems.
- Build an expert system for a specific domain that uses the new generation of reasoning systems, task-completion and conversation systems, and integrates both domain knowledge and commonsense knowledge.
- Develop the explainability of NLP systems by using semantic analysis and knowledge systems.
In the next ten years, NLP research will explode. We can expect that there will be big progress in NLP fundamental research, core technologies, and important applications. As Bill Gates said, “We always overestimate the change that will occur in the next two years and underestimate the change that will occur in the next ten.” This is true for NLP as well.
Let’s imagine what changes NLP will bring us in the next ten years.
- In ten years, machine translation systems will be capable of modeling the context of a sentence and handling new terms. People will use a machine system as a spontaneous interpreter at meetings or presentations.
- An electronic personal assistant will understand your natural command and completes orders for food, flowers, and tickets. You will get used to being served by a robot customer support agent.
- When you climb a mountain, you can tell your phone about your thoughts and upload a photo. Then, your phone will pop up a poem with beautiful language and the photo, and that poem can be sent out to your friends.
- Many news articles will be written by a computer.
- A computer teacher corrects your English pronunciation and polishes your sentences through natural conversation.
- A robot will analyze massive documents and provide a data analysis report in a timely manner to help business leaders make decisions.
- News, books, classes, meetings, articles, and goods will be recommended to you by an intelligent recommendation system.
- Robots will help lawyers to find evidence and suggest similar cases. It can also discover the flaws of a contract or write up a legal document.
- And more, limited only by our imaginations.
While some of the above-mentioned scenarios have already emerged, they will become more mature in the next ten years. In the future, NLP and other AI technologies will dramatically change human life. To realize this bright future, we will continue to innovate boldly and solidly advance by balancing research and application. We will create a new generation of technology designed to serve all of human society.