Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Microsoft Research Open Data Project: Evolving our standards for data access and reproducible research

December 5, 2019 | By Vani Mandava, Director, Data Science Outreach

Datasets compilation for Open Data

Last summer we announced Microsoft Research Open Data—an Azure-based repository-as-a-service for sharing datasets—to encourage the reproducibility of research and make research data assets readily available in the cloud. Among other things, the project started a conversation between the community and Microsoft’s legal team about dataset licensing. Inspired by these conversations, our legal team developed a set of brand new data use agreements and released them for public comment on GitHub earlier this year.

Today we’re excited to announce that Microsoft Research Open Data will be adopting these data use agreements for several datasets that we offer.

Diving a bit deeper on the new data use agreements

The Open Use of Data Agreement (O-UDA) is intended for use by an individual or organization that is able to distribute data for unrestricted uses, and for which there is no privacy or confidentiality concern. It is not appropriate for datasets that include any data that might include materials subject to privacy laws (such as the GDPR or HIPAA) or other unlicensed third-party materials. The O-UDA meets the open definition: it does not impose any restriction with respect to the use or modification of data other than ensuring that attribution and limitation of liability information is passed downstream. In the research context, this implies that users of the data need to cite the corresponding publication with which the data is associated. This aids in findability and reusability of data, an important tenet in the FAIR guiding principles for scientific data management and stewardship.

We also recognize that in certain cases, datasets useful for AI and research analysis may not be able to be fully “open” under the O-UDA. For example, they may contain third-party copyrighted materials, such as text snippets or images, from publicly available sources. The law permits their use for research, so following the principle that research data should be “as open as possible, as closed as necessary,” we developed the Computational Use of Data Agreement (C-UDA) to make data available for research while respecting other interests. We will prefer the O-UDA where possible, but we see the C-UDA as a useful tool for ensuring that researchers continue to have access to important and relevant datasets.

Datasets that reflect the goals of our project

The following examples reference datasets that have adopted the Open Use of Data Agreement (O-UDA).

Location data for geo-privacy research

Microsoft researcher John Krumm and collaborators collected GPS data from 21 people who carried a GPS receiver in the Seattle area. Users who provided their data agreed to it being shared as long as certain geographic regions were deleted. This work covers key research on privacy preservation of GPS data as evidenced in the corresponding paper, “Exploring End User Preferences for Location Obfuscation, Location-Based Services, and the Value of Location,” which was accepted at the Twelfth ACM International Conference on Ubiquitous Computing (UbiComp 2010). The paper has been cited 147 times, including for research that builds upon this work to further the field of preservation of geo-privacy for location-based services providers.

Hand gestures data for computer vision

Another example dataset is that of labeled hand images and video clips collected by researchers Eyal Krupka, Kfir Karmon, and others. The research addresses an important computer vision and machine learning problem that deals with developing a hand-gesture-based interface language. The data was recorded using depth cameras and has labels that cover joints and fingertips. The two datasets included are FingersData, which contains 3,500 labeled depth frames of various hand poses, and GestureClips, which contains 140 gesture clips (100 of these contain labeled hand gestures and 40 contain non-gesture activity). The research associated with this dataset is available in the paper “Toward Realistic Hands Gesture Interface: Keeping it Simple for Developers and Machines,” which was published in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems.

Question-Answer data for machine reading comprehension

Finally, the FigureQA dataset generated by researchers Samira Ebrahimi Kahou, Adam Atkinson, Adam Trischler, Yoshua Bengio and collaborators, introduces a visual reasoning task for research that is specific to graphical plots and figures. The dataset has 180,000 figures with 1.3 million question-answer pairs in the training set. More details about the dataset are available in the paper “FigureQA: An Annotated Figure Dataset for Visual Reasoning” and corresponding Microsoft Research Blog post. The dataset is pivotal to developing more powerful visual question answering and reasoning models, which potentially improve accuracy of AI systems that are involved in decision making based on charts and graphs.

The data agreements are a part of our larger goals

Microsoft Research Open Data project was conceived from the start to reflect Microsoft Research’s commitment to fostering open science and research and to achieve this without compromising the ethics of collecting and sharing data. Our goal is to make it easier for researchers to maintain provenance of data while having the ability to reference and build upon it.

The addition of the new data agreements to Microsoft Research Open Data’s feature set is an exciting step in furthering our mission.

Acknowledgements: This work would not have been possible without the substantial team effort by — Dave Green, Justin Colannino, Gretchen Deo, Sarah Kim, Emily McReynolds, Mario Madden, Emily Schlesinger, Elaine Peterson, Leila Stevenson, Dave Baskin, and Sergio Loscialo.

Up Next

Artificial intelligence, Data platforms and analytics

Project Rocket platform—designed for easy, customizable live video analytics—is open source

Thanks to advances in computer vision and deep neural networks (DNNs) in what can arguably be described as the golden age of vision, AI, and machine learning, video analytics systems—systems performing analytics on live camera streams—are becoming more accurate. This accuracy offers opportunities to support individuals and society in exciting ways, like informing homeowners when […]

Ganesh Ananthanarayanan

Principal Researcher

Data platforms and analytics, Systems and networking

Announcing Microsoft Research Open Data – Datasets by Microsoft Research now available in the cloud

The Microsoft Research Outreach team has worked extensively with the external research community to enable adoption of cloud-based research infrastructure over the past few years. Through this process, we experienced the ubiquity of Jim Gray’s fourth paradigm of discovery based on data-intensive science – that is, almost all research projects have a data component to […]

Vani Mandava

Director, Data Science Outreach

Data platforms and analytics

Microsoft and Tsinghua University Work Together on Open Academic Data Research

In a recent collaboration, Microsoft and China’s Tsinghua University released an academic graph, named Open Academic Graph (OAG). This billion-scale academic graph integrates the current Microsoft Academic Graph (MAG) and Tsinghua’s AMiner academic graph. Specifically, it contains the metadata information of 155 million academic paper metadata from AMiner and 166 million papers from MAG. By […]

Microsoft blog editor