Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Machine Reading Systems Are Becoming More Conversational

May 3, 2019 | By Microsoft blog editor

A team of researchers from the Natural Language Processing (NLP) Group at Microsoft Research Asia (MSRA) and the Speech Dialog Research Group at Microsoft Redmond are currently leading in the Conversational Question Answering (CoQA) Challenge organized by Stanford University. In this challenge, machines are measured by their ability to understand a text passage and answer a series of interconnected questions that appear in a conversation. Microsoft is currently the only team to have reached human parity in its model performance.

CoQA is a large-scale conversational question-answering dataset that is made up of conversational questions on a set of articles from different domains. The MSRA NLP team previously reached the human parity milestone on single-round question answering using the Stanford Question Answering Dataset (SQuAD). Compared with SQuAD, the questions in CoQA are more conversational and the answers can be free-form text to ensure the naturalness of answers in a conversation.

The questions in CoQA are very short, to mimic human conversation. In addition, every question after the first is dependent on the conversational history, which makes the short questions even more difficult for machines to parse. For example, suppose you asked a system, “Who is the founder of Microsoft?” You need it to understand that you were still speaking on the same subject when you ask the follow-up question, “When was he born?”

A Conversation from the CoQA Dataset. CoQA Paper: https://arxiv.org/abs/1808.07042

A Conversation from the CoQA Dataset. CoQA Paper: https://arxiv.org/abs/1808.07042

To better test the generalization ability of existing models, CoQA collected data from seven different domains: children’s stories, literature, middle and high school English exams, news, Wikipedia, Reddit, and science. The first five are used in the training, development, and test sets, and the last two are used only for the test set. CoQA uses the F1 metric to evaluate performance. The F1 metric measures the average word overlap between the prediction and ground truth answers. In-domain F1 is scored on test data from the same domain as the training set; and out-of-domain F1 is scored on test data from different domains. Overall F1 is the final score on the whole test set.

The method used by the Microsoft researchers employs a special strategy, in which information learned from several related tasks is used to improve the target machine reading comprehension (MRC) tasks. In this multistage, multitask, fine-tuning method, researchers first learn MRC-relevant background information from related tasks under a multitask setting, and then fine-tune the model on the target task. Language modeling is additionally used as an auxiliary task in both stages to help reduce the over-fitting of the conversational question-answering model. Experiments have supported the effectiveness of this method, which is further demonstrated by its strong performance in the CoQA Challenge.

Overview of the multistage multitask fine-tuning model

Overview of the multistage multitask fine-tuning model

According to the CoQA leaderboard, the ensemble system that Microsoft researchers submitted on March 29, 2019 reached 89.9/88.0/89.4 as its respective in-domain, out-of-domain, and overall F1 scores. Human performance on the same set of conversational questions and answers stands at 89.4/87.4/88.8.

This achievement marks a major advance in the effort to have search engines such as Bing and intelligent assistants such as Cortana interact with people and provide information in more natural ways, much like how people communicate with each other. Nonetheless, general machine reading comprehension and question answering remains an unsolved problem in natural language processing. To further push the boundary of machine capability in understanding and generating natural language, the team continues to work on producing even more powerful pre-training models.

Up Next

an illustration of a woman looking at a Windows laptop, next to a man, and a man looking at a cellphone all sitting at a desk

Artificial intelligence, Computer vision, Human language technologies

Making machines recognize and transcribe conversations in meetings using audio and video

The ability to perceive communication signals and make sense of them played an essential role in the evolution of human intelligence. Computing technology is following the same trajectory. Now, computer vision and automatic speech recognition (ASR) technologies have enabled the advent of many artificial intelligence (AI) applications and virtual assistants by allowing machines to see […]

Takuya Yoshioka

Principal Researcher

Artificial intelligence

The KnowRef Coreference Corpus: a resource for training and evaluating common sense in AI

  AI has made major strides in the last decade, from beating the world champion of Go, to learning how to program, to telling fantastical short stories. However, a basic human trait continues to elude machines: common sense. Common sense is a big term with plenty of baggage, but it typically includes shared background knowledge […]

Ali Emami

PhD Candidate, Mila/McGill University

Graphic showing different types of microphone arrays

Artificial intelligence, Human language technologies

New Advancements in Spoken Language Processing

Deep learning algorithms, supported by the availability of powerful Azure computing infrastructure and massive training data, constitutes the most significant driving force in our AI evolution journey. In the past three years, Microsoft reached several historical AI milestones being the first to achieve human parity in the following public benchmark tasks that have been broadly […]

Xuedong Huang

Technical Fellow and Chief Technology Officer of AI Services