Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Announcing Microsoft Research Open Data – Datasets by Microsoft Research now available in the cloud

June 21, 2018 | By Vani Mandava, Director, Data Science Outreach

The Microsoft Research Outreach team has worked extensively with the external research community to enable adoption of cloud-based research infrastructure over the past few years. Through this process, we experienced the ubiquity of Jim Gray’s fourth paradigm of discovery based on data-intensive science – that is, almost all research projects have a data component to them. This data deluge also demonstrated a clear need for curated and meaningful datasets in the research community, not only in computer science but also in interdisciplinary and domain sciences.

Today we are excited to launch Microsoft Research Open Data – a new data repository in the cloud dedicated to facilitating collaboration across the global research community. Microsoft Research Open Data, in a single, convenient, cloud-hosted location, offers datasets representing many years of data curation and research efforts by Microsoft that were used in published research studies.

Why we are investing in this

The goal is to provide a simple platform to Microsoft researchers and collaborators to share datasets and related research technologies and tools. Microsoft Research Open Data is designed to simplify access to these datasets, facilitate collaboration between researchers using cloud-based resources and enable reproducibility of research. We will continue to shape and grow this repository and add features based on feedback from the community.
We recognize that there are dozens of data repositories already in use by researchers and expect that the capabilities of this repository will augment existing efforts.

Figure 1 – Dataset in Microsoft Research Open Data

“This is a game changer for the big data community. Initiatives like Microsoft Research Open Data reduce barriers to data sharing and encourage reproducibility by leveraging the power of cloud computing”
-Sam Madden, Professor, Massachusetts Institute of Technology

With data growing at an exponential rate, perceived to be over 150 ZB of data available by 2025, it is now recognized that we need to prioritize bringing processing to data versus relying on data movement through Internet bandwidth that is growing at a much slower pace. We believe that there is real utility in providing an option to bring the processing to the data. Therefore, in addition to providing an option to download the data assets, users can also copy datasets directly to an Azure based Data Science virtual machine, as shown in Figure 2.

Figure 2 – Data copied from microsoftopendata.com to an Azure based Linux virtual machine

The Data Science virtual machine comes preloaded with a variety of development tools popular with researchers and practitioners as can been seen in Figure 3.

Figure 3 Linux Data Science virtual machine

“I am often asked to share my research data and the public sharing I have done in the past has been popular. Coordinating and cataloging these datasets in one place with Azure will be helpful for both internal and external researchers, giving them easy access, encouraging collaboration, and providing convenient cloud-based access to the wealth of Microsoft Research shared data.”
-John Krumm, Principal Researcher, Microsoft Research AI

Datasets in Microsoft Research Open Data are categorized by their primary research area, as shown in Figure 4. You can find links to research projects or publications with the dataset. You can browse available datasets and download them or copy them directly to an Azure subscription through an automated workflow. To the extent possible, the repository meets the highest standards for data sharing to ensure that datasets are findable, accessible, interoperable and reusable; the entire corpus does not contain personally identifiable information. The site will continue to evolve as we get feedback from users.

Figure 4 – Dataset Categories

Microsoft Research Open Data is an outcome of the Microsoft Research Outreach Data science program and was made possible by a collaboration between many teams at Microsoft, Microsoft researchers, our industry partners, and our academic advisors.

We would love to hear your comments and feedback! Please send us a note via the Feedback feature on the site http://microsoftopendata.com and tell us what you think.

 

Up Next

Data management, analysis and visualization

Microsoft and Tsinghua University Work Together on Open Academic Data Research

In a recent collaboration, Microsoft and China’s Tsinghua University released an academic graph, named Open Academic Graph (OAG). This billion-scale academic graph integrates the current Microsoft Academic Graph (MAG) and Tsinghua’s AMiner academic graph. Specifically, it contains the metadata information of 155 million academic paper metadata from AMiner and 166 million papers from MAG. By […]

Microsoft blog editor

Data management, analysis and visualization

Transportation Data Science at Microsoft

By Vani Mandava, Director, Data Science Outreach, Microsoft Research The National Science Foundation (NSF)-supported Big Data Innovation Hubs launched a National Transportation Data Challenge with a kickoff event in Seattle in May 2017. Microsoft Outreach, through its partnership with the Big Data Hubs organized an Azure workshop and participated in a panel discussion on ‘How […]

Microsoft blog editor

Data management, analysis and visualization

Microsoft continues to support data science research with $3M cloud credits to NSF BIGDATA program

By Vani Mandava, Director, Data Science, Microsoft Research The National Science Foundation has launched a new solicitation in 2017 for the advancement of data science research and applications. The solicitation, titled Critical Techniques, Technologies and Methodologies for Advancing Foundations and Applications of Big Data Sciences and Engineering (BIGDATA), is inviting proposals under two categories: Foundations […]

Microsoft blog editor