NewsQA Dataset

NewsQA Dataset

Overview

With massive volumes of written text being produced every second, how do we make sure that we have the most recent and relevant information available to us? Microsoft research Montreal is tackling this problem by building AI systems that can read and comprehend large volumes of complex text in real-time.

The purpose of the NewsQA dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills.

Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.

  • Documents are CNN news articles.
  • Questions are written by human users in natural language.
  • Answers may be multiword passages of the source text.
  • Questions may be unanswerable.
  • NewsQA is collected using a 3-stage, siloed process.
  • Questioners see only an article’s headline and highlights.
  • Answerers see the question and the full article, then select an answer passage.
  • Validators see the article, the question, and a set of answers that they rank.
  • NewsQA is more natural and more challenging than previous datasets.

Challenges

A significant proportion of questions in NewsQA cannot be solved without reasoning. The reasoning types we have identified in our analysis are as follows:

  • Synthesis: Some answers can only be inferred by synthesizing information distributed across multiple sentences.
  • Paraphrasing: A single sentence in the article might entail or paraphrase the question. Paraphrase recognition may require synonymy and word knowledge.
  • Inference: Some answers must be inferred from incomplete information in the article or by recognizing conceptual overlap. This typically draws on general knowledge.
  • Additionally, some questions have no answer or no unique answer in the corresponding story, so a system must learn to recognize when given information is not sufficient.

See other datasets from Microsoft Montreal:
Frames | FigureQA

Stats

Summary

12,744
Stories

119,633
Question-Answer Pairs

616
Average Words per Article

4.13
Average Words per Answer

74.9%
Human Performance (F1)

Reasoning Statistics

Reasoning mechanisms needed to answer questions in NewsQA based on 500 examples. For each type, we show an example question with the text snippet that contains the answer span, with words relevant to the reasoning type in bold.

Reasoning Proportion Example
Word Matching 31.6% Q: When were the findings published?

T: Both sets of research findings were published Thursday

Paraphrasing 26.8% Q: Who is the struggle between in Rwanda?

T: The struggle pits ethnic Tutsis, supported by Rwanda, against ethnic Hut

Synthesis 17.8% Q: Where is Brittanee Drexel from?

T: The mother of a 17-year-old Rochester, New York high school student … says she did not give her daughter permission to go on the trip. Brittanee Marie Drexel’s mom says…

Inference 14.0% Q: Who drew inspiration from presidents?

T: Rudy Ruiz says the lives of US presidents can make them positive role models for students.

Ambiguous/Insufficient 9.8% Q: Whose mother is moving to the White House?

T: … Barack Obama’s mother-in-law, Marian Robinson will join the Obamas at the family’s private quarters at 1600 Pennsylvania Avenue. [Michelle is never mentioned]

Consensus Statistics

102,841
With Consensus (Including Validated)
(85.96%)

51,630
Validated Answers
Only 51,630 needed to be validated because there was agreement during second step of the collection.

45,381
Validated Answers with Consensus
(87.90%)

Story Length Distribution

Question Length Distribution

Question Type Distribution

Question Type Distribution

Overall Answer Length Distribution

Answer Length Distribution per Question

Download

CNN Stories

Notice: CNN articles are used here by permission from The Cable News Network (CNN). CNN does not waive any rights of ownership in its articles and materials. CNN is not a partner of, nor does it endorse, Microsoft research Montreal or its activities.

The stories are not owned by Microsoft and can be retrieved from DeepMind Q&A Dataset.

Questions and Answers

The following package just includes the questions and answers. Instructions on how to combine the stories and answers into one file can be found in the GitHub repo.

 

GitHub Repo

https://github.com/Maluuba/newsqa

Paper

NewsQA: A Machine Comprehension Dataset

People