Microsoft Research Open Data Project: Evolving our standards for data access and reproducible research

Published December 5, 2019

By Vani Mandava , Director, Data Science Outreach

Share this page

Last summer we announced Microsoft Research Open Data—an Azure-based repository-as-a-service for sharing datasets—to encourage the reproducibility of research and make research data assets readily available in the cloud. Among other things, the project started a conversation between the community and Microsoft’s legal team about dataset licensing. Inspired by these conversations, our legal team developed a set of brand new data use agreements (opens in new tab) and released them for public comment on GitHub earlier this year.

Today we’re excited to announce that Microsoft Research Open Data will be adopting these data use agreements for several datasets that we offer.

Diving a bit deeper on the new data use agreements

The Open Use of Data Agreement (opens in new tab) (O-UDA) is intended for use by an individual or organization that is able to distribute data for unrestricted uses, and for which there is no privacy or confidentiality concern. It is not appropriate for datasets that include any data that might include materials subject to privacy laws (such as the GDPR or HIPAA) or other unlicensed third-party materials. The O-UDA meets the open definition: it does not impose any restriction with respect to the use or modiﬁcation of data other than ensuring that attribution and limitation of liability information is passed downstream. In the research context, this implies that users of the data need to cite the corresponding publication with which the data is associated. This aids in findability and reusability of data, an important tenet in the FAIR guiding principles (opens in new tab) for scientific data management and stewardship.

We also recognize that in certain cases, datasets useful for AI and research analysis may not be able to be fully “open” under the O-UDA. For example, they may contain third-party copyrighted materials, such as text snippets or images, from publicly available sources. The law permits their use for research, so following the principle that research data should be “as open as possible, as closed as necessary (opens in new tab),” we developed the Computational Use of Data Agreement (opens in new tab) (C-UDA) to make data available for research while respecting other interests. We will prefer the O-UDA where possible, but we see the C-UDA as a useful tool for ensuring that researchers continue to have access to important and relevant datasets.

Datasets that reflect the goals of our project

The following examples reference datasets that have adopted the Open Use of Data Agreement (O-UDA).

Location data for geo-privacy research

Microsoft researcher John Krumm and collaborators collected GPS data (opens in new tab) from 21 people who carried a GPS receiver in the Seattle area. Users who provided their data agreed to it being shared as long as certain geographic regions were deleted. This work covers key research on privacy preservation of GPS data as evidenced in the corresponding paper, “Exploring End User Preferences for Location Obfuscation, Location-Based Services, and the Value of Location,” which was accepted at the Twelfth ACM International Conference on Ubiquitous Computing (UbiComp 2010). The paper has been cited 147 times, including for research that builds upon this work to further the field of preservation of geo-privacy for location-based services providers.

Hand gestures data for computer vision

Another example dataset is that of labeled hand images and video clips (opens in new tab) collected by researchers Eyal Krupka, Kfir Karmon, and others. The research addresses an important computer vision and machine learning problem that deals with developing a hand-gesture-based interface language. The data was recorded using depth cameras and has labels that cover joints and fingertips. The two datasets included are FingersData, which contains 3,500 labeled depth frames of various hand poses, and GestureClips, which contains 140 gesture clips (100 of these contain labeled hand gestures and 40 contain non-gesture activity). The research associated with this dataset is available in the paper “Toward Realistic Hands Gesture Interface: Keeping it Simple for Developers and Machines,” which was published in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.

Question-Answer data for machine reading comprehension

Finally, the FigureQA dataset generated by researchers Samira Ebrahimi Kahou, Adam Atkinson, Adam Trischler, Yoshua Bengio and collaborators, introduces a visual reasoning task for research that is specific to graphical plots and figures. The dataset (opens in new tab) has 180,000 figures with 1.3 million question-answer pairs in the training set. More details about the dataset are available in the paper “FigureQA: An Annotated Figure Dataset for Visual Reasoning (opens in new tab)” and corresponding Microsoft Research Blog post. The dataset is pivotal to developing more powerful visual question answering and reasoning models, which potentially improve accuracy of AI systems that are involved in decision making based on charts and graphs.

The data agreements are a part of our larger goals

Microsoft Research Open Data project was conceived from the start to reflect Microsoft Research’s commitment to fostering open science and research and to achieve this without compromising the ethics of collecting and sharing data. Our goal is to make it easier for researchers to maintain provenance of data while having the ability to reference and build upon it.

The addition of the new data agreements (opens in new tab) to Microsoft Research Open Data’s feature set is an exciting step in furthering our mission.

Acknowledgements: This work would not have been possible without the substantial team effort by — Dave Green, Justin Colannino, Gretchen Deo, Sarah Kim, Emily McReynolds, Mario Madden, Emily Schlesinger, Elaine Peterson, Leila Stevenson, Dave Baskin, and Sergio Loscialo.