Workshop on Data Science Innovation with NSF Big Data Hubs

Workshop on Data Science Innovation with NSF Big Data Hubs


The Microsoft Research Data Science Outreach Program engages with academic researchers and universities through an ecosystem of resources and partnerships to drive breakthrough impact. Microsoft Research Redmond is pleased to convene a workshop in Redmond to bring together Microsoft researchers plus data scientists with stakeholders from Big Data Innovation Hubs to:

  • Create an environment for better visibility and awareness of Microsoft Research to NSF funded BigData Hubs and vice versa
  • Highlight outcomes between Microsoft Research and Big Data hub supported projects
  • Highlight collaboration between Microsoft and regional big data projects and explore ideas for collaboration
  • Identify common strategies for advancing data science research and innovation in the US

Please see Agenda page for detailed session information. Participation is by invitation only.

Program Committee

Vani Mandava, Microsoft Research

René Bastón, Columbia University

Melissa Cragin, UIUC

Meredith Lee, UC Berkeley

Renata Rawlings-Goss, Georgia Tech

Lea Shanley, RENCI



Monday, October 29, 2018

Time (PDT) Session Location
8:00 AM–9:00 AM Breakfast Building 92, Memphis Room
9:00 AM–9:15 AM Welcome
Evelyne Viegas, Director, Artificial Intelligence, Microsoft Research
Vani Mandava, Director Data Science, Microsoft Research
Building 92, Memphis Room
9:15 AM–10:15 AM Session 1: Earth Science
Lucas Joppa, Chief Environmental Officer, Microsoft
Erin Robinson, Executive Director, Earth Science Information Partners
Shashi Shekar, Distinguished University Professor, University of Minnesota
Building 92, Memphis Room
10:15 AM–10:30 AM Break
10:30 AM–12:00 PM Session 2: Health and Genomics
Geralyn Miller, Director, Microsoft Genomics
Franco Pestilli, Assistant Professor Psychological and Brain Sciences, University of Indiana Bloomington
Gari Clifford, Interim Chair & Associate Professor Biomedical Informatics, Emory University
Braden Tierney, PhD Candidate, Biological and Biomedical Sciences Program, Harvard Medical School
Building 92, Memphis Room
12:00 PM–1:00 PM Lunch (and optional visit to Company Store & Visitor Center) Building 92
1:00 PM–1:30 PM TBA
Adam Fourney, Computer Scientist, Microsoft Research
Besmira Nushi, Researcher, Microsoft Research
Building 92, Memphis Room
1:30 PM–2:00 PM Travel to Building 99
2:00 PM–3:00 PM PhD Summit: Big Data Hub Executive Director Career Panel
René Bastón, Melissa Cragin, Meredith Lee, Renata Rawlings-Goss, Lea Shanley
Moderated by Vani Mandava
Building 99, Room 1919
3:00 PM–3:15 PM Break
3:15 PM–4:30 PM Microsoft Research Lab Tour Building 99
4:30 PM–5:30 PM PhD Poster Session or Informal Networking Building 99 Atrium
5:30 PM–6:00 PM Travel to Dinner
6:00 PM–9:00 PM Dinner at Purple Café & Wine Bar, Bellevue

Tuesday, October 30, 2018

Time (PDT) Session Location
8:00 AM–9:00 AM Breakfast Building 92, Tahoe Room
9:00 AM–9:15 AM Welcome
Vani Mandava, Director Data Science, Microsoft
Building 92, Jolt Room
9:15 AM–9:45 AM Keynote
Eric Horvitz, Technical Fellow, Managing Director, Microsoft Research
Building 92, Jolt Room
9:45 AM–10:15 AM Break
10:15 AM–10:25 AM Thoughts from the National Science Foundation
Meghan Houghton, Strategic Engagements, National Science Foundation
Building 92, Jolt Room
10:30 AM–12:00 PM Session 3: Beacons of Collaboration
Pietro Michelucci, Director Human Computation Institute
Renata Rawlings-Goss, Data Science, Georgia Institute of Technology
Karen Matthys, Executive Director, Stanford ICME
Ranveer Chandra, Principal Researcher, Microsoft Research
Building 92, Jolt Room
12:00 PM–1:00 PM Lunch (and optional visit to Company Store & Visitor Center)
1:00 PM–2:00 PM Session 4: Data Science for Social Good
Sarah Stone, Co-Executive Director, eScience Institute, University of Washington
Andrew Hoffman, Research Scientist, Data Ecologies Lab (Human Centered Design & Engineering/UW)
Kristin Tolle, Principal Data Scientist, Microsoft Philanthropies
Building 92, Jolt Room
2:00 PM–2:15 PM Wrap-up Building 92, Jolt Room
2:15 PM–2:45 PM Travel to Building 42
3:00 PM–4:30 PM Tour of Azure Cloud Collaboration Center Tour Building 42
4:30 PM Farewell

*Agenda subject to change without notice


A Public Cloud Platform for Large-Scale Data Analysis, Visualization and Sharing of Reproducible Neuroscience Research

Speaker: Franco Pestilli

Neuroscience is at the forefront of science by reaching across disciplinary boundaries and promoting transdisciplinary research. This process can, in principle, facilitate discovery by convergent efforts from theoretical, experimental and cognitive neuroscience, as well as computer science and engineering. To ensure success, mechanisms to guarantee reproducibility of scientific results must be established. Open software development and data sharing are therefore paramount in the quest to achieve reproducibility. We present, a platform which addresses challenges of neuroscience reproducibility by providing integrative mechanisms for publishing data, and algorithms while embedding them with computing resources to impact multiple scientific communities.

AI for Earth

Speaker: Lucas Joppa

The speed and scale at which climate systems are changing, and the enormity of the human impact of those changes, requires a commensurate response in how society monitors, models, and manages climate systems. A key component to that response will emerge from the fundamentals of AI – transforming how we collect data, convert those data into actionable information, and communicate that information across the world. By training increasingly sophisticated algorithms with this unprecedented collection of data on dedicated computational infrastructure, we can combine human and computer intelligence in a way that will allow us to make increasingly informed and optimal choices about today – and tomorrow.

Azure Case Study: A Human/Machine Partnership to Accelerate Biomedical Research

Speaker: Pietro Michelucci

Our greatest opportunity for problem-solving comes not from humans alone or from Artificial Intelligence (AI) alone, but by combining them in distributed networks. Leveraging the complementary abilities of humans and machines allows us to create unprecedented capabilities today. The EyesOnALZ citizen science project accelerates Alzheimer’s disease research by strategically combining machine learning and Crowd AI (human computation) methods. The specific human/machine partnerships that enabled this capability have co-evolved with algorithmic advancements and computing platforms like Azure.

Bad Kappas - the Biggest Problem for Machine Learning for Health?

Speaker: Gari Clifford

Perhaps the most significant barrier to accurate machine learning of medical data is the lack of accurate labels on which to train data. PhysioNet has set the gold standard in physiological databases. The last 18 years of PhysioNet/Computing in Cardiology (CinC) Challenges and related work have demonstrated that standard expert labels are far more error prone than would be expected by a relatively well-established medical field. With such inconsistencies, it is impossible for a diagnostic system to realistically achieve performance measure above 80-90%, which in turn prevents their use without human oversight; an imperative for large scale analysis. He will discuss solutions including a voting approach that combines multiple algorithms (and humans) of varying performance levels in an efficient manner to boost labels and classifier performances.

Building Data Science Capacity at Primarily Teaching Focused Institutions, Minority-Serving Institutions, and Community Colleges

Speaker: Renata Rawlings-Goss

The ability to utilize and understand data is an increasingly critical skill for the evolving 21st-century workforce. Sectors posting data-driven jobs, grants, and opportunities are realizing a critical shortfall in data-literate talent for the positions of today as well as tomorrow. To combat this shortfall, underrepresented groups and schools must be engaged in data science training. This fact has sparked a collective effort to design programs that engage a broader community around STEM and data science. As part of this effort, the South Big Data Hub has created a program, called DataUp, to accelerate data science education across the region.

Data, Predictions, and Decisions in Support of People and Society

Speaker: Eric Horvitz

Eric will share directions and results enabled by the confluence of large-scale data resources, jumps in computational power, and advances in machine learning. He will focus on efforts that leverage learning and inference to assist people with decisions, touching on work in transportation, medicine, and human-machine collaboration.

Data Science for Social Good at the University of Washington eScience Institute

Speaker: Sarah Stone

The UW Data Science for Social Good (DSSG) program partners eScience Institute Data Scientists and Student Fellows from across the country with Project Leads from academia, government, and private sector to find data-driven solutions to societal challenges. Previous projects span transportation, public health, urban planning, and disaster response. Project-based discussions around ethics, human-centered design and stakeholder collaboration are keystones of our program. Differences in prior experience and training among student fellows can pose a challenge, but become a strength in project work. Our experience supports the notion that DSSG programs can both impact social good and provide data science training for students from diverse disciplinary backgrounds.

Diversity in Data Science: The Women in Data Science Conference and Beyond

Speaker: Karen Matthys

Big Data market revenues are projected to reach $103 Billion by 2027. This expanding field presents a great opportunity for women and minorities to take on technical and leadership roles in all sectors. One way that we are addressing this opportunity is through the Women in Data Science Conference (WiDS), which was launched at Stanford just 3 years ago and now reaches over 100,000 people worldwide. This talk with share outcomes from WIDS global collaboration, lessons learned, and future plans. Karen will also cover other interdisciplinary collaborations at Stanford.

Enabling Genomics Discovery in the Cloud

Speaker: Geralyn Miller

This talk will explore why genomics workloads are perfect for the cloud and how Microsoft Research is using the cloud to disrupt genomics discovery.

FarmBeats: Empowering Farmers with Affordable Digital Agriculture Solutions

Speaker: Ranveer Chandra

Data-driven techniques help boost agricultural productivity by increasing yields, reducing losses and cutting down input costs. However, these techniques have seen sparse adoption owing to high costs of manual data collection and limited connectivity solutions. Our solution, called FarmBeats, an end-to-end IoT & AI platform for agriculture that enables seamless data collection from various sensors, cameras and drones. Our system design explicitly accounts for weather-related power and Internet outages, which has enabled six month long deployments in two US farms. In this talk, he will describe the FarmBeats system, and also outline the AI challenges we are currently addressing for outdoor as well as indoor agriculture.

Giving Credits Where They’re Due: The Value(s) of the Microsoft Research - BD Hubs Partnership

Speaker: Andrew Hoffman

The recent partnership forged between Microsoft Research and the NSF Big Data Regional Innovation Hubs and Spokes furthers a relationship that began as early as 2010. The field of data science has shifted considerably in these intervening years, however, as have the organizational structures tasked with shepherding through data science research, innovation, and education. This presentation provides a biotechnical analysis of this changing landscape. Touching on both high-level science funding policy initiatives, as well as on the more practical work of forging and carrying out collaborations between government, industry, non-profits, and academia to enable cloud-based computational research, it discusses the multiple ways in which participating entities come to valorize these partnerships.

Making Data Matter Together: Opportunities for Collaboration Within Earth Science Data Communities and Strategies to Avoid Overwhelm

Speaker: Erin Robinson

A critical part of effective earth science data and information system interoperability involves collaboration across geographically and temporally distributed communities. The Earth Science Information Partners (ESIP) is a broad-based, distributed community of science, data and information technology practitioners from across science domains, economic sectors and the data lifecycle primarily based in the United States. Over the last twenty years, ESIP’s open, participatory structure has provided a melting pot for coordinating around common areas of interest like data citation, experimenting on innovative ideas and capturing and finding best practices and lessons learned from across the network. This talk will provide an overview of relevant activities the ESIP is involved with and identify strategies for advancing data science research and innovation through open communities of practice like ESIP and Big Data Innovation Hubs.

Metagenomic Meta-Analysis Illuminates a Vast Universe of Genes in the Human Microbiome

Speaker: Braden Tierney

We do not have a grasp on the scope of the microbiome’s gene content, a question crucial for understanding the role of microbes in host health. To quantify this genetic universe, we undertook a meta-analysis of 3,500 human shotgun-sequencing samples from two body sites, the mouth and gut. We found that prior work has drastically underestimated the genetic richness of human microbiota by tens of millions of genes. These results serve as an explanation for the large heterogeneity of microbiome-derived human phenotypes, a path forward for gene-centric approaches in microbiome studies, and a quantification of the need for larger-scale metagenomic analyses than what currently exists.

Thoughts from the National Science Foundation

Speaker: Meghan Houghton

An update on the National Science Foundation’s investments in data science, including through the Big Data Regional Innovation Hubs program and Harnessing the Data Revolution—one of NSF’s 10 Big Ideas for Future Investment.

Using AI for Social Good

Speaker: Kristin Tolle

As Microsoft expands its AI portfolio of products, we have a keen interest in leveraging these technologies to enable people to do more. This is particularly critical in the nonprofit space where organizations are facing the world’s most challenging problems—from ensuring food and water security to enabling safer and more reliable disaster response. The Tech for Social Good team, in particular, the AI for Humanitarian Action, is working to build scalable, reusable solutions with nonprofits so that those with similar needs and missions can do more to help others build better lives. AI for Humanity is the third and most recently launched pillar of this applied AI mission to provide solutions to organizations aligned with our mission. This talk will briefly cover our mission and engagement model and discuss some of the solutions we are building for this community.

What Is Special About Spatial Data Science?

Speaker: Shashi Shekhar

Spatial big data, such as trajectories and satellite imagery, have transformed our society via popular applications for navigation, ride-sharing, precision agriculture, public health, and public safety. It is only a start and bigger opportunities lies ahead. However, classical one-size-fits-all data science methods are grossly inadequate for analyzing spatial data due to severe problems such as gerrymandering and the very high cost of spurious patterns. To overcome the limitations of traditional data science, this presentation will summarize recent developments (spatial Hadoop, spatial statistics, spatial data mining, nano-satellites, high-definition roadmaps) and calls for community action to improve data science curriculum and computational platforms.