Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Machine Reading Systems Are Becoming More Conversational

May 3, 2019 | By Microsoft blog editor

A team of researchers from the Natural Language Processing (NLP) Group at Microsoft Research Asia (MSRA) and the Speech Dialog Research Group at Microsoft Redmond are currently leading in the Conversational Question Answering (CoQA) Challenge organized by Stanford University. In this challenge, machines are measured by their ability to understand a text passage and answer a series of interconnected questions that appear in a conversation. Microsoft is currently the only team to have reached human parity in its model performance.

CoQA is a large-scale conversational question-answering dataset that is made up of conversational questions on a set of articles from different domains. The MSRA NLP team previously reached the human parity milestone on single-round question answering using the Stanford Question Answering Dataset (SQuAD). Compared with SQuAD, the questions in CoQA are more conversational and the answers can be free-form text to ensure the naturalness of answers in a conversation.

The questions in CoQA are very short, to mimic human conversation. In addition, every question after the first is dependent on the conversational history, which makes the short questions even more difficult for machines to parse. For example, suppose you asked a system, “Who is the founder of Microsoft?” You need it to understand that you were still speaking on the same subject when you ask the follow-up question, “When was he born?”

A Conversation from the CoQA Dataset. CoQA Paper:

A Conversation from the CoQA Dataset. CoQA Paper:

To better test the generalization ability of existing models, CoQA collected data from seven different domains: children’s stories, literature, middle and high school English exams, news, Wikipedia, Reddit, and science. The first five are used in the training, development, and test sets, and the last two are used only for the test set. CoQA uses the F1 metric to evaluate performance. The F1 metric measures the average word overlap between the prediction and ground truth answers. In-domain F1 is scored on test data from the same domain as the training set; and out-of-domain F1 is scored on test data from different domains. Overall F1 is the final score on the whole test set.

The method used by the Microsoft researchers employs a special strategy, in which information learned from several related tasks is used to improve the target machine reading comprehension (MRC) tasks. In this multistage, multitask, fine-tuning method, researchers first learn MRC-relevant background information from related tasks under a multitask setting, and then fine-tune the model on the target task. Language modeling is additionally used as an auxiliary task in both stages to help reduce the over-fitting of the conversational question-answering model. Experiments have supported the effectiveness of this method, which is further demonstrated by its strong performance in the CoQA Challenge.

Overview of the multistage multitask fine-tuning model

Overview of the multistage multitask fine-tuning model

According to the CoQA leaderboard, the ensemble system that Microsoft researchers submitted on March 29, 2019 reached 89.9/88.0/89.4 as its respective in-domain, out-of-domain, and overall F1 scores. Human performance on the same set of conversational questions and answers stands at 89.4/87.4/88.8.

This achievement marks a major advance in the effort to have search engines such as Bing and intelligent assistants such as Cortana interact with people and provide information in more natural ways, much like how people communicate with each other. Nonetheless, general machine reading comprehension and question answering remains an unsolved problem in natural language processing. To further push the boundary of machine capability in understanding and generating natural language, the team continues to work on producing even more powerful pre-training models.

Up Next

Graphic showing different types of microphone arrays

Artificial intelligence, Human language technologies

New Advancements in Spoken Language Processing

Deep learning algorithms, supported by the availability of powerful Azure computing infrastructure and massive training data, constitutes the most significant driving force in our AI evolution journey. In the past three years, Microsoft reached several historical AI milestones being the first to achieve human parity in the following public benchmark tasks that have been broadly […]

Xuedong Huang

Technical Fellow, Speech and Language


Artificial intelligence, Data platforms and analytics

FigureQA: an annotated figure dataset for visual reasoning

Reasoning about figures Almost every scientific publication is accompanied by data visualizations in the form of graphs and charts. Figures are an intuitive aid for understanding the content of documents, so naturally, it is useful to leverage this visual information for machine reading comprehension. To enable research in this domain we built FigureQA, a new dataset composed of […]

Microsoft blog editor

Artificial intelligence

Transfer learning for machine reading comprehension

By Xiaodong He, Principal Researcher, Microsoft Research For human beings, reading comprehension is a basic task, performed daily. As early as in elementary school, we can read an article, and answer questions about its key ideas and details. But for AI, full reading comprehension is still an elusive goal–but a necessary one if we’re going […]

Microsoft blog editor