NewsQA Dataset

NewsQA Dataset

Microsoft Research blog


With massive volumes of written text being produced every second, how do we make sure that we have the most recent and relevant information available to us? Microsoft research Montreal is tackling this problem by building AI systems that can read and comprehend large volumes of complex text in real-time.

The purpose of the NewsQA dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills.

Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.

  • Documents are CNN news articles.
  • Questions are written by human users in natural language.
  • Answers may be multiword passages of the source text.
  • Questions may be unanswerable.
  • NewsQA is collected using a 3-stage, siloed process.
  • Questioners see only an article’s headline and highlights.
  • Answerers see the question and the full article, then select an answer passage.
  • Validators see the article, the question, and a set of answers that they rank.
  • NewsQA is more natural and more challenging than previous datasets.


A significant proportion of questions in NewsQA cannot be solved without reasoning. The reasoning types we have identified in our analysis are as follows:

  • Synthesis: Some answers can only be inferred by synthesizing information distributed across multiple sentences.
  • Paraphrasing: A single sentence in the article might entail or paraphrase the question. Paraphrase recognition may require synonymy and word knowledge.
  • Inference: Some answers must be inferred from incomplete information in the article or by recognizing conceptual overlap. This typically draws on general knowledge.
  • Additionally, some questions have no answer or no unique answer in the corresponding story, so a system must learn to recognize when given information is not sufficient.

See other datasets from Microsoft Montreal:
Frames | FigureQA




Question-Answer Pairs

Average Words per Article

Average Words per Answer

Human Performance (F1)

Reasoning Statistics

Reasoning mechanisms needed to answer questions in NewsQA based on 500 examples. For each type, we show an example question with the text snippet that contains the answer span, with words relevant to the reasoning type in bold.

Reasoning Proportion Example
Word Matching 31.6% Q: When were the findings published?

T: Both sets of research findings were published Thursday

Paraphrasing 26.8% Q: Who is the struggle between in Rwanda?

T: The struggle pits ethnic Tutsis, supported by Rwanda, against ethnic Hut

Synthesis 17.8% Q: Where is Brittanee Drexel from?

T: The mother of a 17-year-old Rochester, New York high school student … says she did not give her daughter permission to go on the trip. Brittanee Marie Drexel’s mom says…

Inference 14.0% Q: Who drew inspiration from presidents?

T: Rudy Ruiz says the lives of US presidents can make them positive role models for students.

Ambiguous/Insufficient 9.8% Q: Whose mother is moving to the White House?

T: … Barack Obama’s mother-in-law, Marian Robinson will join the Obamas at the family’s private quarters at 1600 Pennsylvania Avenue. [Michelle is never mentioned]

Consensus Statistics

With Consensus (Including Validated)

Validated Answers
Only 51,630 needed to be validated because there was agreement during second step of the collection.

Validated Answers with Consensus

Story Length Distribution

Question Length Distribution

Question Type Distribution

Question Type Distribution

Overall Answer Length Distribution

Answer Length Distribution per Question


CNN Stories

Notice: CNN articles are used here by permission from The Cable News Network (CNN). CNN does not waive any rights of ownership in its articles and materials. CNN is not a partner of, nor does it endorse, Microsoft research Montreal or its activities.

The stories are not owned by Microsoft and can be retrieved from DeepMind Q&A Dataset.

Questions and Answers

The following package just includes the questions and answers. Instructions on how to combine the stories and answers into one file can be found in the GitHub repo.


GitHub Repo


NewsQA: A Machine Comprehension Dataset