NSF Big Data Innovation Hubs collaboration — looking back after one year
By Vani Mandava, Director, Data Science
Significant technical advancements in cloud computing have led to lower infrastructure costs, making possible big storage and big computing. Big data technology, though, requires cross-discipline research within and beyond non-computing domains. This is where domain experts collaborate with computing teams, industry, and government agencies to discover new insights that may fundamentally change the way we think about science. The Big Data Innovation Hubs initiative, launched by the National Science Foundation (NSF) in collaboration with the host institutions, presents one such bold opportunity to focus on vertical-specific, yet interdisciplinary, research.
Microsoft Research entered a partnership with the Big Data Regional Innovation Hubs in June 2016 to further unleash the power of cloud computing to accelerate discovery. A year later, we offer a retrospective overview of the initiative to highlight what has worked and some of the research domains that have benefited from the collaboration.
Microsoft’s commitment of $3 million in cloud computing credits was initially designed to boost the NSF-supported Big Data Spokes and Planning projects. The commitment was recently extended to other projects nominated by the regional Big Data Hub executive directors. The following table lists the projects that received awards totaling about $600,000 in Azure credits during the initial year.
|Cloud computing for neuroscience research||Health||Midwest||Franco Pestilli, assistant professor||Indiana University|
|SEADTrain: Cloud-based, hands-on environment for online data science training||Education||Midwest||Beth Plale, director||Indiana University|
|Cloud infrastructure for vision-based yield mapping||Agriculture||Midwest||Volkan Isler, associate professor||University of Minnesota|
|Comprehensive and large-scale analysis of 10,000 microbiomes to better understand human disease||Health||Northeast||Chirag Patel, assistant professor||Harvard Medical School|
|Accelerating analytical workloads using GPU-CPU coprocessing||
Data Sharing / Computer Science
Sam Madden, professor
Anil Shanbhag, PhD student
|Massachusetts Institute of Technology|
|Crowdsourcing platform for generating gold standard labels for medical data||Health||South||Gari Clifford, professor||Emory University|
|Proteogenomic applications of genetic variations||Genomics||West||Eric Deutsch, senior research scientist||Institute for Systems Biology, Seattle|
|Metroinsight: Smart cities as a system||Smart Cities||West||Bharathan Balaji, postdoctoral scholar||University of California, Los Angeles|
“This partnership in big data and data science with Microsoft is bringing together the expertise from academia, leading-edge infrastructure and technologies from industry, and interesting datasets from the government sector to solve important problems,” said Dr. Jim Kurose, Assistant Director of the National Science Foundation for Computer and Information Science and Engineering. “Microsoft’s significant contribution of cloud resources to the Big Data Hubs and Spokes effort has provided a major boost to the various Spokes projects. We are delighted to have Microsoft as a continued partner in this endeavor, and look forward to working with them, and with all of our industry partners.”
The Microsoft Azure for Research program was launched in 2014 and has directly engaged researchers and funding agencies including driving more than 1,000+ Azure grants and training more than 2,500 researchers in 400 universities in more than 50 countries worldwide. Our partnership with the Big Data Hubs in the United States has empowered the executive directors of the hubs to award cloud credits to research proposals reviewed by NSF for Spoke or Planning grants, or toward hub-supported activities and projects with a need for cloud computing resources. In addition, due to the inherent design of hubs to foster cross-sector collaboration, Azure training workshops are helping reach data science communities beyond academic researchers. More than 200 data scientists from all four regional hubs in the United States participated in intensive day-long, hands-on training workshops to learn how to use Azure in their research.
“The Azure for Research training at UT-Austin was an intense boot camp for data scientists and gave us the introductory tools needed to start building machine learning systems. The opportunities to build on top of Azure’s platform and use its cloud computing resources are invaluable,” said Adam Klivans, associate professor in the computer science department of the University of Texas at Austin and member of the South Big Data Hub.
The Big Data Innovation Hubs are designed to encourage multiple stakeholders, including federal agencies, private industry, academia, state and local governments, nonprofits and foundations to advance and apply data science research and innovation to regional and national challenges. The cross-sector nature of the partnerships is lighting up several verticals that benefit from industry collaboration and technology access to cloud computing. Let’s explore some of these initiatives in diverse verticals, from agriculture to health, to obtain a glimpse of how the Big Data Hub initiative brings together these usually siloed and independent stakeholders to do research on the cloud.
Healthcare and Genomics
Eric Deutsch of the Institute for Systems Biology in Seattle and Andreas Prlic of the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB) are co-primary investigators (PI) on a $100,000 planning grant from NSF to increase collaborations in proteogenomic applications of genetic variations. The project brings together experts from the genomic and protein structure communities to help combine genetic variant datasets with the goal of disease diagnosis through proteogenomic data. “We are grateful to Microsoft’s support in our efforts to deploy proteogenomic software tools onto Azure, and to build a framework that enables researchers to transfer data among all these various community tools,” said Deutsch. The team is using Azure through West Big Data Hub credits to run their Trans-Proteomic Pipeline (TPP) on containers in Azure and train students on advanced analytics techniques for using TPP.
“Using Azure, we expect to provide a scalable cloud-based system for aggregating conflicting diagnoses and medical labels, and integrate out the biases and errors inherent in medical decision support,” said Gari Clifford of Emory University. Clifford and Herman Taylor, from Morehouse School of Medicine, are South Big Data Hub co-principal investigators on a $1 million NSF Spoke award that brings together six universities for a project designed to address the fractured nature of healthcare information. The project aims to devise a system that provides a highly efficient way to create ground-truth data that powers classifiers and diagnostic tools that comprise modern healthcare systems.
Chirag Patel, assistant professor at Harvard Medical School, is lead PI for a $1 million Big Data Spoke award for a project titled Comprehensive and Large-scale Analysis of Metagenomic Data to better understand Human Disease. The project is a great example of how the hubs bring together cross-sector collaborations – the project includes five federal agencies, four universities, and a major health system. “We are excited to leverage Azure data science virtual machines to create high throughput meta-analysis pipelines for modeling gene networks within microbial metagenomes. Separately we plan to host the exposomeDB in Azure and allow researchers to work with the data using the Jupyter notebook service,” said Patel. His team is building tools that will integrate over multiple datasets simultaneously with the goal of determining microbiome (including uncultivable organisms and undiscovered genes) correlation to human disease.
The Midwest Big Data Hub is focused on agriculture and brings together experts in fields such as machine learning, precision agriculture, and phenotyping. “Agriculture researchers working in the field are creating new phenotypes, which, when combined with machine learning and low-cost end-to-end solutions, will drive the next wave of disruption that will help feed the growing population of the world,” said Ranveer Chandra from Microsoft Research, Redmond.
Chandra would know — he and his global team spent the past few years building FarmBeats, which aspires to combat the world’s food supply shortage using computer vision and machine learning on cyber-physical systems consisting of sensors and UAVs in remote areas that communicate with Azure on the white space spectrum. The Agriculture-Food Energy Water (Ag-FEW) stakeholders at the “Machine Learning: Farm-to-Table” workshop hosted by the Midwest Big Data Hub, in collaboration with the International Food Security Institute at Illinois, are expected to further this research through new collaborations with top Ag-FEW researchers in the Midwest.
The West Big Data Hub, in collaboration with the three other Big Data Hubs and industry partners, spearheaded a National Transportation Data Challenge, with a kickoff event in Seattle on May 2-3. The kickoff was the first in a series of community problem-solving sessions, demos and hackathons that are being organized across the nation over six months. Several teams across Microsoft contributed ideas on recent and ongoing work on transportation data science.
Microsoft’s engagement with the challenge builds on a foundation of prior work in public safety and metro data science. The challenge’s launch event highlighted a collaboration between Microsoft’s Community Technology Engagement Group in the Corporate, External and Legal Affairs (CELA) team and DataKind Vision Zero, the New York City Department of Transportation, Seattle Department of Transportation and the City of New Orleans’ Office of Performance and Accountability. The project enabled an ecosystem that helped cities assign limited resources to traffic safety issues where they are most needed. In addition, the community event noted how Microsoft Research has been actively collaborating with the City of Bellevue and University of Washington on a project that combines video analytics with machine learning to make roads safer.
The Midwest Big Data Hub (MBDH) is actively pursuing community building around data science in the Midwest of the United States. Beth Plale, Professor, Indiana University, and team are working on project SEADTrain: Cloud-based Hands On Environment for On-line Data Science Training to prototype an Azure based cloud solution for data science hands-on training. SEADTrain allows instructors to publish datasets to Azure through the SEAD curation pipeline where the datasets pick up persistent IDs. Students then carry out their training modules in the Azure based cloud using real data hosted there, an approach to training that substantially lowers the burden to students who would otherwise be burdened by setting up an environment on their desktop, moving data, etc. The larger goal (post pilot) is to demonstrate the effectiveness of the solution, as applied to real world training, by connecting the hands-on training to MOOC style delivery using a MOOC module repository that is being developed in a separate initiative.
Big data hub leadership
All the progress in the past year wouldn’t have been possible without the effort of the executive directors at the helms of each of the hubs. They have worked tirelessly to bring together thousands of data scientists across sectors and disciplines through numerous events and outreach efforts.
It has been a productive year driving the agenda for cloud computing for data science research and impact with the Big Data Innovation Hubs. Microsoft is committed to the partnership with NSF and the hubs and will continue pushing the boundaries to achieve accelerated research outcomes using cloud-based infrastructure and big data services.
“The potential impact of data science platforms to shape generations of research relies on the imagination of collaborators to come together and shape it,” said Joseph Sirosh, corporate vice president of the Cloud AI platform at Microsoft. “Public-private partnerships that encourage a culture of data innovation to solve long-range applied problems, from healthcare to transportation, will be the bedrock of this impact. Our work with the Big Data Hubs, launched by the National Science Foundation in collaboration with universities, enables researchers across fields to achieve concrete data outcomes with Microsoft tools, and it creates a catalyst for the company to build data science tools useful to researchers working in numerous domains.”