eScience Workshop 2012

About

escience2012_chicago.jpgThe ninth annual Microsoft eScience Workshop was held October 8 and 9 at the Hyatt Regency Chicago in conjunction with the IEEE International Conference on eScience 2012. Discussions and presentations once again related to the theme of eScience in Action. In addition to sessions on a variety of topics, we announced the winer of the Microsoft Research 2012 Jim Gray eScience Award at the workshop. Microsoft Research bestows this annual award on a researcher who has made an outstanding contribution to the field of data-intensive computing.

Jim Gray eScience Award Winner Announced

escience2012_jimgrayaward-winner.jpgAntony John Williams was announced as the winner of the 2012 Jim Gray eScience Award at this year’s eScience Workshop. Vice president of strategic development and head of Chemoinformatics for the Royal Society of Chemistry, Antony has pursued a career built on rich experience in experimental techniques, implementation of new nuclear magnetic resonance technologies, research and development, and teaching, as well as analytical laboratory management. He has been a leader in making chemistry publically available through collective action: his work on ChemSpider helps provide fast text and structure search access to data and links on more than 28 million chemicals, and this resource is freely available to the scientific community and the general public. Learn more…

About the Workshop

Each year, the Microsoft Research eScience Workshop provides a forum for scientists and researchers to share their experiences and expertise with the academic and research communities. The eScience Workshop fosters collaboration, facilitates the sharing of software components and techniques, and defines rich, open scientific challenges. Microsoft has been actively pursuing research in eScience for more than 10 years; the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, provides a background on its many areas of focus.

Agenda with Abstracts

Monday, October 8, 2012

Welcome

Speaker: Tony Hey, Microsoft Research | slides

Keynote: Defensible Modeling of the Biosphere

Chair: Kristin Tolle, Microsoft Research

Speaker: Drew Purves, Microsoft Research | video | slides

To manage the planet on which we all depend, we need to predict the future outcome of various options. How would biofuel subsidies affect crop prices affect deforestation? CO2 emissions affect climate change affect fire? At present, we cannot make such predictions with any confidence. But, as I’ll show in this talk, a computational approach to environmental science can change that. I’ll explain how we built the first fully data-constrained model of the terrestrial carbon cycle, using Big Data, cloud computing, and machine learning. And I’ll demo similar models for global food production, Amazon deforestation, and bird biodiversity. The prototype tools on which these models have been built—for example, FetchClimate, Filzbach, WorldWide Telescope—are freely available, and will hopefully allow other scientists to adopt a rigorous approach to modeling the complexities of the biosphere.

Open Data for Open Science—Data Interoperability

Chair: Yan Xu, Microsoft Research

Panel: Open Data for Open Science—Data Interoperability | video

Speakers:

  • Robert Gurney, University of Reading
  • Philip Murphy, University of Redlands | slides
  • Karen Stocks, University of California, San Diego | slides
  • Yan Xu, Microsoft Research | slides
  • Ilya Zaslavsky University of California, San Diego | slides

The goal of cross-domain interoperability is to enable reuse of data and models outside the original context in which these data and models are collected and used and to facilitate analysis and modeling of physical processes that are not confined to disciplinary or jurisdictional boundaries. A new research initiative of the U.S. National Science Foundation, called EarthCube, is developing a roadmap to address challenges of interoperability in the Earth sciences and create a blueprint for community-guided cyberinfrastructure accessible to a broad range of geoscience researchers and students.

The panel will discuss this and related initiatives and projects, focusing on challenges of data discovery, interpretation, access, and integration across domain information systems, assessment of their readiness for cross-domain integration, and technologies enabling interoperability in the geosciences.

General Informatics

Chair: Kristin Tolle, Microsoft Research

Panel: Enabling Multi-Scale Science | video

Speakers:

  • Roberto Cesar, University of Sao Paulo (USP) | slides
  • James Hunt, University of California, Berkeley | slides
  • Claudia Bauzer Medeiros, University of Campinas (UNICAMP) | slides

eScience research increasingly involves the need to facilitate multi-scale problem solving that spans wide ranges in space and time scales. It requires collaboration among researchers and practioneers from multiple disciplines, each with their own orientations towards problem identification, solution formulation and implementation. The panel aims to discuss some of the challenges of working in multi-scale scenarios.

Panelists will present these challenges from two perspectives: application, and computing approaches. The first perspective will focus on issues such as scientific profiles involved, scales considered, data collected and produced, models and visualization needs. The second viewpoint will consider, among others, characteristics of data and storage structures to accommodate the wide variety of data scales and formats, language/workflow constructs that may facilitate the specification, execution and interaction of models, and interface/interaction primitives.


The Internet of Databases—Generalizing the Archaeo Informatics Approach | video | slides

Speaker: Chris van der Meijden, Ludwig Maximilians University of Munich, Germany

One thing we have learned from our Archaeo-Data-Network is that there is a need to split meta information of databases in two levels. The first level contains a centralized unique ID and very few standard information. The second level of meta information is defined by the archaeo scientist. This can be implemented for any kind of archaeo database, so the network’s extensibility is virtually unlimited. The advantage of this dual meta approach is its flexible connectivity and, therefore, getting comprehensive data transparent available for general searching and mining. With this approach huge, rigid archives can be connected to small, flexible databases for scientific analysis in any scientific domain. Combined with simple authorization management for unpublished data, we see in our system the potential of being the general blueprint for an eScience infrastructure that we call the Internet of databases.


Combining Semantic Tagging and Support Vector Machines to Streamline the Analysis of Animal Accelerometry Data | video | slides

Speaker: Nigel Ward, The University of Queensland

Increasingly, animal biologists are taking advantage of low cost micro-sensor technology by deploying accelerometers to monitor the behaviour and movement of a broad range of species. The result is an avalanche of complex tri-axial accelerometer data streams that capture observations and measurements of a wide range of animal body motion and posture parameters. We present a system that supports storing, visualizing, annotating and automatic recognition of activities in accelerometer data streams by integrating semantic annotation and visualization services with Support Vector Machine techniques.

Handling Big Data for the Environmental Informatics

Chair: Yan Xu, Microsoft Research | slides

Panel: Handling Big Data for the Environmental Informatics / Real-Time Environmental Observation, Modeling, and Decision Support | video

Speakers:

  • Jeff Dozier, University of California, Santa Barbara | slides
  • David Maidment, University of Texas, Austin | slides
  • Barbara Minsker, University of Illinois, Urbana-Champaign | slides
  • Chaowei Yang, George Mason University | slides

Earth observations and other environmental data collection methods help us accumulate terabytes to petabytes of datasets. This pose a grand challenge to the informatics for environmental studies. We propose this session to capture the latest development on the Big Data collection, processing, and visualization in several aspects.

With increasing near-real-time availability of embedded and mobile sensors, radar, satellite, and social media, the opportunities to improve understanding, modeling, and management of environmental systems, as well as the built and human systems that interact with environmental systems, is immense.

[panel header="Active Publications"]

Chair: Dennis Gannon, Microsoft Research

Active Publications | video

Speakers:

  • Ian Foster, University of Chicago and Argonne National Laboratory | slides
  • Tanu Malik, University of Chicago and Argonne National Laboratory | slides

The e-Science domain brings together scientists, experts, and engineers to enterprise comprehensive, large-scale data and computational cyberinfrastructures. The objective is to advance knowledge discovery in the sciences and establish effective channels of communication between the various disciplines. Software, data, workflows, technical reports, and publications are often the modes of this communication. However, currently all these modes of communication are disconnected from each other.

E-publishing is changing the nature of scientific communication through digital publication repositories and libraries. But the larger and more pertinent issue is connecting these yet static digital e-publications repositories to large amounts of computation, data, derived data, and extracted information.

The Cloud and Big Data

Chair: Kenji Takeda, Microsoft Research | slides

Panel: Cloud Computing – What Do Researchers Want? | video

Speakers:

  • Fabrizio Gagliardi, Microsoft Research | slides
  • Dennis Gannon, Microsoft Research | slides
  • Marty Humphrey, University of Virginia | slides
  • Paul Watson, Newcastle University | slides

Cloud computing for science is seeing take-up in many disciplines, but many researchers are skeptical. In this panel session we will discuss:

  • How researchers are using the cloud today
  • What they want/need for the future
  • Why they might not want to use the cloud

Machine-Assisted Thought

Chair: Harold Javid, Microsoft Research

Machine-Assisted Thought | video | slides

Speaker: Michael J. Kurtz, Harvard-Smithsonian Center for Astrophysics

I suggest that there are two distinct branches of eScience, both fundamentally enabled by the explosion of capabilities inherent in the information age. The first concerns the use of numbers, measurements from arrays of sensors, outputs from simulations, and so forth. The techniques of eScience increase our ability to perceive massive amounts of data by factors of billions or trillions. I call this Machine Assisted Perception.

The second branch of eScience concerns the use of words, the verbal abstractions used by humans to communicate ideas. The new technologies of digital libraries and search engines have already substantially changed the scholarly thought process, growth in the capabilities of these technologies continues to be rapid. I call this machine/human collaboration Machine Assisted Thought.


DemoFest | video | slides

Chair: Jim Pinkelman, Microsoft Research

Layerscape: Tools for Collaborative Analysis of Complex Data

Presenter: Rob Fatland, Microsoft Research

Layerscape is a set of (combined cloud/desktop) data visualization and collaboration tools provided at no cost by Microsoft Research. We describe how these tools (visualization engine, developer toolkit, RESTful API, Excel add-in, story authoring environment, collaboration/sharing website) can provide researchers and developers a way of addressing data deluge problems commonly faced in geoscience research. As a particular case study, we will discuss unfolding data streams from many sensors operated from autonomous underwater vehicles during a September 2012 experiment conducted by the Monterey Bay Aquarium Research Institute (MBARI) off the California coast. Additional visualizations will also be available for perusal and discussion, and may be freely searched and viewed at the support website.

Globus Online: Research Data Management as a Service

Presenter: Ian Foster, University of Chicago and Argonne National Laboratory; Steve Tuecke, University of Chicago; Vas Vasiliadis, University of Chicago

In millions of labs worldwide, researchers struggle with massive data, advanced software, complex experimental protocols, and burdensome reporting. The emergence of cloud computing offers the opportunity to accelerate discovery and innovation while reducing costs by outsourcing time-consuming information technology tasks from individual labs and institutions to third-party providers. Over the past two years, we have developed a cloud-hosted, high-performance data movement service that is currently used by thousands of researchers at campuses and institutions worldwide. We are expanding the capabilities we offer en route to our goal of delivering a comprehensive research data management solution comprising storage, sharing, cataloging, archiving, and other critical functions as a service. We expect these services will be particularly valuable to those investigators in small and medium-sized laboratories that face significant challenges in developing, deploying, and operating IT infrastructure to support their work.

The Open-Source ISA Metadata Tracking Framework: from Data Curation and Management at the Source, to the Linked Data Universe

Presenter: Eamonn Maguire, University of Oxford

Minimum reporting guidelines, terminologies, and formats (referred to generally as community standards) are increasingly used in the structuring and curation of datasets, enabling data annotation to varying degrees and reproducible research. But how can we enable researchers to make use of existing community standards, maximize curation and sharing, and subsequently reuse richly annotated experimental information? A successful example is provided by the Investigation/Study/Assay (ISA) open source, metadata tracking framework supported by the growing ISA Commons community.

SOLE: Connecting Publications to Large Online Data Repositories

Presenter: Tanu Malik, University of Chicago and Argonne National Laboratory

The exponential growth in the amount of scientific data means that revolutionary measures are needed for data management, analysis and accessibility. Online scientific databases—such as the SkyServer in astronomy, the Protein Data Bank in biology, and the PubChem in chemistry—are important repositories for publishing and accessing large scientific datasets. These databases have also become sources for new scientific research; researchers routinely interact with these repositories to search, download, and analyze relevant datasets. However, these interactions remain largely disconnected with the final outcomes of research, such as publications and journal articles. We will demonstrate components of the Science Object Linking and Embedding (SOLE) system, which aims to create interactive publications and make it easy to capture interactions with the online databases and associate them with publications.

DataUp: A Tool for Documenting and Sharing Scientific Tabular Data

Presenter: Carly Strasser, California Digital Library

DataUp is a project sponsored by Microsoft Research and theGordon and Betty Moore Foundation, conducted at the University of California Curation Center of the California Digital Library. The project’s goal was to develop tools that help researchers document, organize, preserve, and share their scientific data. We focused on assisting Earth, environmental, and ecological scientists, since these groups historically have not practiced good data stewardship. In this session, we will demonstrate the DataUp add-in for Excel and the DataUp web application. Both the add-in and the web application perform four main tasks:

  • Perform a best practices check to ensure good data organization
  • Help guide the user through creation of metadata for their Excel file
  • Help the user obtain a unique identifier for their dataset
  • Connect the user to a DataONE repository, where their data can be deposited and shared with others

Databib: An Online Catalog of Research Data Repositories

Presenter: Michael Witt, Purdue University

Databib is a free, global, online catalog of research data repositories. Librarians and other information professionals have identified and cataloged more than 300 data repositories that can be easily browsed and searched by users or integrated with other platforms or cyberinfrastructure. Databib can help researchers find appropriate repositories to deposit their data, and it gives consumers of data a tool to discover repositories of datasets that meet their research or learning needs. Users can submit new repositories to Databib, which are reviewed and curated by an international board of editors. All information from Databib has been contributed to the public domain using the Creative Commons Zero protocol. Supported machine interfaces and formats include RSS, OpenSearch, RDF/XML, Linked Data (RDFa), and social networks such as Twitter, Facebook, and Google+.

12,000 Human Genomes from Raw Sequence to Result, on Windows and Windows Azure

Presenter: Dong Xie, Oxford University

At the 2010 eScience Workshop, I presented my work “SYSQ – Questionnaire System for Large Scale Depression Study.” Now, two years later, we are finishing the phenotype collection and these data have already enabled us to publish more than 12 papers in various journals from an epidemiological prospective; the next round of papers are in the making on the complete dataset. Meanwhile, every two weeks, we are receiving external hard drives from a sequencing centre (2TB in size each), full of raw genome sequences coming from our patients and controls. These data need to be processed and associated with the phenotype so that we can finally find the gene for depression, after several years of hard work. This task by no means is trivial. The processing pipeline needs to be built from scratch. It brings pressure to the IT, to the bioinformatics; with limited resources and non-existent previous published work, one really need to think out of the box.

OData and Environmental Informatics

Presenter: Yan Xu, Microsoft Research

We will demonstrate how the Open Data Protocol, OData, can be used to release scientific data from silos. The demo will showcase examples of using OData as the glue to seamlessly solve data interoperability problems among heterogeneous data sources.

Tuesday, October 9, 2012

Keynote

Biology: A Move to Dry Labs | video

Chair: Dan Fay, Microsoft Research | slides

Speaker: David Heckerman, Microsoft Research | slides

Since its beginning, the wet lab has been the key driver in biological discovery. Recently, however, more and more science is getting done in dry labs, those where only computational analysis is done. The presentation will include examples, ranging from genomics to vaccine design.

Data Scientists: Part I

Chair: Gail Steinhart, Cornell University

Panel: Educating Data Scientists for Scientific Data | video

Moderator: Gail Steinhart, Cornell University

Teaching Scientific Data Management in Data Science Education and Workforce Development Programs for Science Communities | video | slides

Speaker: Robert R. Downs, Columbia University

Recent popularity of data science has led to increased recognition of the need for education and workforce development in data science. However, definitions of the term, data science, vary and often focus on techniques for data analytics and visualization, omitting scientific data management and related topics associated with data policy, stewardship, and preservation. Scientific data management encompasses a variety of concepts and methods to foster continuing access and long-term stewardship of data for current and future users. Considering the needs for scientific data management knowledge and capabilities to facilitate improved and persistent accessibility and use of scientific data throughout the data lifecycle, instruction on topics in scientific data management is recommended for data science education and workforce development programs for science communities.


Educating Scientists About the Data Life Cycle | slides

Speaker: William Michener, University of New Mexico

The research life cycle is well known and consists of an initial idea or question that, if sound, leads to submission and funding of a proposal, implementation of a study, and, ideally, to one or many publications that advance the state of knowledge. What is less well understood is how the research life cycle is related to the data life cycle. In this presentation we discuss approaches for educating scientists in eight phases of the data life cycle (for example, planning, data acquisition and organization, quality assurance/quality control, data description, data preservation, data exploration and discovery, data integration, and analysis and visualization). Specifically, we will look at the design and approaches used for developing learning modules, instructional material and resources, and an innovative three-week experiential course that enable participants to more efficiently and effectively manage their research data and compete for research funding are presented.


Priorities for Data Curation Education: Data Center Partnerships and Long-Tail Science | video | slides

Speaker: Carole Palmer, University of Illinois at Urbana-Champaign

For science to fully exploit digital data in new and innovative ways, research data will need to be collected, curated, and made accessible and usable across domains. The need for workforce development in data curation systems and services has been recognized for many years, and education programs are beginning to mature. But to continue to build strong programs in this emerging field, current data curation practice and research needs to underpin goals for professional education. Having established a specialization in data curation in 2006, we have assessed our program’s progress to date and identified areas in need of further development to respond to trends in e-science. Analysis of student placements shows interesting trends in the institutions hiring data curation specialists and the nature of the positions, and evaluation of internships provided in national data centers has suggested important areas for further investment. In addition, our recent research on disciplinary differences in data sharing and the value of long-tail data in the sciences has direct implications for further development of data curation curriculum.


Educating a New Breed of Data Scientists for Scientific Data Management | video | slides

Speaker: Jian Qin, Syracuse University

Citizen Science and Big Data

Chair: Chris Mentzel, Gordon and Betty Moore Foundation

Data scientists play active roles in the design and implementation work of four related areas: data architecture, data acquisition, data analysis, and data archiving. While any data and computing related academic unit could offer a data science program or curriculum, each of them has their own flavors: statistics would weigh heavily toward data analytics and computer science on computational algorithms. The information schools are taking a more holistic approach in educating data scientists. This presentation reports the data science curriculum development and implementation at Syracuse iSchool, which has been shaped by the quickly-changing, data-intensive environment not only for science but also for business and research at large. Research projects that we conducted on scientific data management with participation from the e-science student fellows demonstrates the need and significance of educating the new breed of data scientists who have the knowledge and skills to take on the work in the four related areas mentioned above.


The Utility of a Human/Computer Learning Network For Improving Biodiversity Conservation and Research in eBird | video | slides

Speaker: Carl Lagoze, University of Michigan

We describe our work to improve the quality and utility of citizen science contributions to eBird, arguably the largest biodiversity data collection project in existence. Citizen science (the use of “human sensors”) is especially important in a number of observation-based fields, such as astronomy, ecology, and ornithology, where the scale and geographic distribution of phenomena to be observed far exceeds the capabilities of the established research community. Our work is based on the notion of a Human/Computer Learning Network, in which the benefits of active learning (in both the machine learning sense and human learning sense) are cyclically fed back among human and computational participants.


Tools and Techniques for Outreach and Popular Engagement in eScience | video | slides

Speaker: Rafael Santos, Instituto Nacional de Pesquisas Espaciais

Public participation in scientific research takes many forms: participation of volunteers in citizen science projects, monitoring of natural resources and phenomena, volunteering of computational resources for distributed data analysis tasks, and so forth.

In this presentation, we comment on some of the computational tools, techniques, and case studies of applications that enable active public participation in scientific research. Of particular interest are applications that showcase the benefits of letting the public use the professional resources (in other words, the same data and computational resources that the scientists have access to) and return something back to the research behind it, such as applications that go beyond simple publication of scientific data or applications that use novel methods for user engagement. Examples of applications for scientific outreach that use specialized computational tools or techniques, and/or educational approaches, are also discussed.


Big Data Processing on the Cheap | video | slides

Speaker: Joe Hummel, University of California, Irvine

Getting started with big data? Generating more and more data without the hardware resources to process it? This session will help newcomers to “big data” get started processing and visualizing their data, without the need for expensive computing resources. While these techniques may not produce lightning-fast results, you can at least get started with your analysis.

Data Scientists: Part II

Chair: Kenji Takeda, Microsoft Research

What Is a Data Scientist? | video

Speakers:

  • Liz Lyon, UKOLN-DCC, University of Bath UK | slides
  • Kenji Takeda, Microsoft Research

The term, data scientist, is becoming prevalent in science, engineering, business, and industry. We will explore how the term is used in different contexts, segments, and sectors; we will examine the different variants, flavours, and interpretations and try to answer the following questions:

  • What does a data scientist really do?
  • What skills does a data scientist need? How do they acquire them?
  • What tools, technologies and platforms are used by data scientists?
  • How can we build data scientist capacity and capability for the future?

Informatics, Information Science, Computer Science, and Data Science Curricula | video | slides

Speakers: Geoffrey Fox, Indiana University

We describe a possible data science curricula based on discussions at Indiana University and experience with our Informatics, Computer Science, and Library and Information Science programs. This leads to an interesting breadth of courses and students’ interests, which could address the many job opportunities. We suggest a collaboration to build a MOOC (online) offering with one initial target: minority serving institutions.


Data Science Curricula at the University of Washington eScience Institute | video | slides

Speaker: Bill Howe, University of Washington

The University of Washington eScience Institute is engaged in a number of educational efforts in data science, including certificate programs for professionals, workshops for students in domain science, a new data-oriented introductory programming course, and a data science MOOC to be offered through Coursera in the spring. We consider the tools, techniques, research topics, and skills to be well-aligned with the data-driven discovery emphasis of eScience itself—the only difference is the applications.

We see several benefits in aligning these two areas. For example, students in science majors who are not pursuing research careers become more marketable. In the other direction, working professionals see opportunities to apply their skills to solve science problems—we have recruited volunteers from industry in this way. In this talk, I’ll discuss these activities, review our curriculum, and describe our next steps.


Publishing and eScience | video

Co-Chairs: Mark Abbott, Oregon State University; Jeff Dozier, University of California, Santa Barbara

Scientific Publishing in a Connected, Mobile World | slides

Speaker: Mark Abbott, Oregon State University

New tools for content development and new distribution channels create opportunities for the scientific community, opening new venues for collaboration, review, and self-publication. However, publishing is at the heart of the culture of science, and several centuries of experience with publishing in journals will not simply vanish. Issues of peer review, reproducibility, integrity, and scientific context will need to be addressed before these new tools take hold.

Open access is but one part of this conversation.


How to Collaborate with the Crowd: a Method for “Publishing” Ongoing Work | slides

Speaker: Jeff Dozier, University of California, Santa Barbara, Visiting Researcher Microsoft Research

The typical model for interdisciplinary research starts with a small-group partnership, typically with colleagues who have known each other for a while. They learn to articulate problems across disciplinary boundaries and discover shared interests. They successfully seek funding, and work together for several years. This model works, but can be cumbersome. An alternative model is to express a sequence of processes and data that integrate to create a suite of data products, and to identify insertion points where expertise from another perspective might be able to contribute to a better solution.


When Provenance Gets Real: Implications of Ubiquitous Provenance for Scientific Collaboration and Publishing | slides

Speaker: James Frew, University of California, Santa Barbara

We expect (or hope?) that the impending standardization of data models, ontologies, and services for information provenance will make scientific collaboration easier and scientific publishing more transparent. We propose a panel of active producers and users of provenance who will address scenarios such as:

  • “I’m a scientist, and this is what I would really like to tell someone with provenance.”
  • “I’m a scientist, and this is what I wish provenance would tell me when I use your data, join your project, or …”
  • “I build systems that capture and/or manage provenance, and this is what I’ve seen scientists actually do when they create and/or use provenance.”

Data Journal Challenge for the Fourth Paradigm-Trust through Data on Environmental Studies and Projects | slides

Speaker: Shuichi Iwata, The Graduate School of Project Design

Landscapes on recent big data issues to bridge environmental studies and social expectations are reviewed to design an e-Journal with data files and models. Data parts are keys to give semantics to original scientific papers, and also double keys for computational models. Structured data with explicit descriptions about their metadata can be managed and their traceability can be realized systematically, step by step. However, almost all available data are unstructured, fragmented, and contain ambiguities and uncertainties. Balances between data quality and freshness/costs/coverage are discussed so as to draw a road map for a data journal, referring to two preliminary case studies on materials data and data due to nuclear reactor accidents and problems.

Data Curation

Chair: Kristin Tolle, Microsoft Research

Panel: Scientific Data: the Current Landscape, Challenges, and Solutions | video | slides

Moderator: Carly Strasser, California Digital Library

Speakers:

  • Jeff Dozier, University of California, Santa Barbara
  • Chris Mentzel, Gordon and Betty Moore Foundation
  • William Michener, University of New Mexico
  • Dave Vieglais, The University of Kansas
  • Stephanie Wright, University of Washington

Funders, researchers, and public stakeholders increasingly see the need to better communicate and curate ever expanding bodies of research data. This panel will bring together many of the stakeholders in the scientific data community, including researchers, librarians, and data repositories.

Before the panel commences, we will provide a brief introduction to scientific data to facilitate discussion. We will describe the current landscape of scientific data and its management, including publication, citation, archiving, and sharing of data. We will also describe existing tools for data management. The panel discussion will focus on identifying gaps and unmet needs in order to help chart a path for future policy, service, and infrastructure development.


Novel Approaches to Data Visualization | video

Chair: George Djorgovski, California Institute of Technology


Data Visualization in Virtual Spaces and High Dimensions | slides

Speaker: George Djorgovski, California Institute of Technology

Visualization is a bridge between the quantitative content of data and human intuition and understanding. Effective visualization is a critical bottleneck as the complexity and dimensionality of data increase. I will describe some experiments in collaborative, multi-dimensional data visualization in immersive virtual reality.


CT and Imaging Tools for Windows HPC Clusters and Azure Cloud | slides

Speaker: Darren Thompson, CSIRO (Advanced Scientific Computing)

Computed Tomography (CT) is a non-destructive imaging technique widely used across many scientific, industrial, and medical fields. It is both computationally and data intensive. Our group within CSIRO has been actively developing X-ray tomography and image processing software and systems for GPU-enabled Windows HPC clusters.

A key goal of our systems is to provide our “end users”—researchers—with easy access to the tools, computational resources, and data via familiar interfaces and client applications without the need for specialized HPC expertise. We have recently explored the adaptation of our CT-reconstruction code to the Windows Azure cloud platform, for which we have constructed a working “proof-of-concept” system. However, at this stage, several challenges remain to be met in order to make it a truly viable alternative to our HPC cluster solution.


Work in Progress Toward Enhancing Multidimensional Visualization with Analytical Workflows | slides

Speaker: Dawn Wright, Environmental Systems Research Institute

Big Data, particularly from terrestrial sensor networks and ocean observatories, exceed the processing capacity and speed of conventional database systems and architectures, and require visualization in three and four dimensions in order to understand the Earth processes at play. Successfully addressing the scientific challenges of Big Data requires integrative and innovative approaches to developing, managing, and visualizing extensive and diverse data sets, but is also critically dependent on effective analytical workflows. This talk will present an emerging agenda and work in progress toward this end at Environmental Systems Research Institute.

Announcement of Jim Gray eScience Award Recipient

Host: Tony Hey, Microsoft Research | video (subsequent keynote address also on this video) | slides

Keynote: The Possibilities and Pitfalls Internet-Based Chemical Data

Chair: Tony Hey, Microsoft Research

Speaker: Antony John Williams, Royal Society of Chemistry | video (Jim Gray Award precedes keynote on this video) | slides

In less than a decade, the Internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glisters is not gold and while online searches can now provide us access to information associated with many tens of millions of chemicals; can allow us to traverse patents, publications, and public-domain databases; the promise of high-quality data on the web needs to be tempered with caution.

In recent years, the crowdsourcing approach to developing curated content has been growing. Can such approaches allow us to bring to bear the collective wisdom of the crowd to validate and enhance the availability of trusted chemistry data online or are algorithms likely to be more powerful in terms of validating data? While it is now possible to search the web by using a query language form natural to chemists—that of “structure searching the web”—increasingly, scientists are likely going to have to accept joint responsibility for the quality of data online for the foreseeable future. Their participation is likely to come through engaging in open science, the provision of data under open licenses, and by offering their skills to the community.

This presentation will provide an overview of the present state of chemistry data online, the challenges and risks of managing and accessing data in the wild, and how an Internet for chemistry continues to expand in scope and possibilities.

Speakers

Mark Abbott

Mark R. Abbotmarkabbott.jpgt is dean and professor in the College of Earth, Ocean, and Atmospheric Sciences at Oregon State University (OSU). He received his B.S. in Conservation of Natural Resources from the University of California, Berkeley, and his Ph.D. in Ecology from the University of California, Davis. He has been at OSU since 1988 and has been dean of the College since 2001. Prior to coming to OSU, he was a member of the technical staff at the Jet Propulsion Laboratory and a research oceanographer at Scripps Institution of Oceanography. His research focuses on the interaction of biological and physical processes in the upper ocean and relies on both remote sensing and field observations. He is funded by the Office of Naval Research (ONR) to explore advanced computer architectures for use in undersea platforms. He served a six-year term on the National Science Board, which oversees the National Science Foundation. He is vice chair of the Oregon Global Warming Commission, which is leading the state’s efforts in mitigation and adaptation strategies in response to climate change. He is a member of the Board of Trustees for the Consortium for Ocean Leadership. He is president-elect of The Oceanography Society.

Roberto Cesar

Roberto Cesarrobertocesar is full professor in the Department of Computer Science – IME at the University of São Paulo (USP) since 2008 and is also director of the Bioinformatics Research Center at USP. He graduated in Computer Science from Universidade Estadual Paulista Julio de Mesquita Filho (IBILCE – UNESP), and received his M.S. in Electrical Engineering from Universidade Estadual de Campinas (UNICAMP) and his Ph.D. in Physics from USP. He is a member of the Coordination Area of Computer Science of FAPESP and of the Evaluation Committee Capes (computer science). He has experience in computer science, with emphasis on graphics processing (graphics), acting on the following subjects: computer vision, pattern recognition, image processing, and bioinformatics.

George Djorgovski

George DjorgovskiDjorgovski-©BobPaz00392 is a professor of Astronomy at the California Institute of Technology (Caltech). He was also a co-director of the Center for Advanced Computing Research at Caltech, and the director of the Meta Institute for Computational Astrophysics, the first professional scientific organization based entirely in virtual worlds. After receiving his Ph.D. from the University of California, Berkeley, he was a Harvard Junior Fellow, before joining the Caltech faculty in 1987. He was a Presidential Young Investigator and an Alfred P. Sloan Foundation Fellow, among a number of other honors and distinctions, and he has authored or co-authored several hundred professional publications. His astrophysical interests include digital sky surveys; exploration of observable parameter spaces; formation and early evolution of quasars, galaxies, and other cosmic structures; and the nature of the dark energy. He was one of the founders of the Virtual Observatory concept, was the chairman of the U.S. National Virtual Observatory Science Definition Team, and is now working on the foundations of the emerging discipline of AstroInformatics. His e-Scientific interests include definition and development of the universal methodology, tools, and frameworks for data-intensive and computationally-enabled science; various aspects of data mining; virtual scientific organizations; and novel approaches to data visualization.

Robert Downs

Robert R. Downsrobertdowns is a senior staff associate officer of research at Columbia University and serves as the senior digital archivist and the acting head of cyberinfrastructure and informatics research and development at the Center for International Earth Science Information Network (CIESIN), a research and data center of the Earth Institute of Columbia University. He has been developing, managing, and conducting research on information systems for more than 20 years and currently focuses on data management and stewardship, data policy, software reuse, digital preservation, and business process design and evaluation.

Downs has served as the principal investigator or co-investigator on various projects, and has authored and co-authored numerous articles for refereed journals and proceedings. He has taught courses in management and computer science, has lectured in workshops on many topics, and has served in leadership positions on working groups, editorial boards, and program committees.

Jeff Dozier

Jeff Dozierjeffdozier has been on the University of California, Santa Barbara (UCSB) faculty since 1974 and was the founding dean of the Bren School. He has led interdisciplinary studies in two areas: one addresses hydrologic science, environmental engineering, and social science in the water environment; the other is in the integration of environmental science and remote sensing with computer science and technology. From 1990 to 1992, he was the senior project scientist for NASA’s Earth Observing System, when the configuration for the system was established. Among Dozier’s honors are the 2009 Jim Gray Award from Microsoft for his achievements in data-intensive science, and his selection as the 2010 Nye Lecturer for the Cryosphere group of the American Geophysical Union. A long-time backcountry skier, mountaineer, and rock climber, he helped lead six expeditions to the Hindu Kush range in Afghanistan and has a dozen first ascents there. The story behind the naming of Dozier Dome in the Sierra Nevada can be found in the Super Topo Climbing Forum.

Rob Fatland

Rob Fatlandrobfatland works at Microsoft Research on applications of technology to information challenges in environmental science. His career has included research in glacier dynamics and seismically-driven surface deformation based on data from synthetic aperture radar satellites. He has also worked on embedded systems technology, developing wireless sensor networks for harsh environments. At Microsoft Research, he works to release research tools, such as Layerscape (a collaboration/visualization system) and SciScope (a search engine for hydrology data), for adoption and use by both academic and operational geoscience communities.

Dan Fay

Daniel Faydanfay is the director of Earth, Energy, and Environment for Microsoft Research Connections, where he works with academic research projects focused on utilizing computing technologies to aid in scientific and engineering research. This includes his teams’ projects in Astronomy and Earth Visualization using the Microsoft Research technologies, WorldWide Telescope and Layerscape.org. Fay has project experience working with high-performance computing, grid computing, collaboration, and visualization tools in scientific research. He was previously the manager of eScience Program at Microsoft Research, where he started Microsoft’s engagements in eScience—including the Microsoft Research eScience workshop.

Ian Foster

Ian Foster ianfosteris a computer scientist whose research focuses on the acceleration of discovery in a networked world. Foster co-invented grid computing more than a decade ago, leading the October 2002 issue of Red Herring magazine to dub him “the Gridfather.” Methods and software developed under his leadership underpin many large national and international cyberinfrastructures and have helped advance discovery in such areas as high-energy physics, environmental science, and biomedicine. Grid computing has become the de facto computation standard for data-intensive, multi-institution collaboration and has helped create what has become the “cloud revolution.” Foster continues to develop innovative tools and infrastructure that enable research breakthroughs. His MacArthur Foundation- and National Science Foundation-funded RDCEP (Center for Robust Decision Making on Climate and Energy Policy) project combines the best of modern computational and economic science to guide climate and energy policy. His most recent effort, Globus Online, is a cloud-based service that transforms how researchers deal with big data—from how they manage it to how they mine it to how they share it among their colleagues. Globus Online is the recipient of a 2012 R&D 100 Award, recognizing it as one of the 100 most technologically significant products introduced in the past year.

James Frew

James Frewjamesfrew is an associate professor in the Donald Bren School of Environmental Science and Management at the University of California, Santa Barbara (UCSB), and a principal investigator in UCSB’s Institute for Computational Earth System Science (ICESS). He received his Ph.D. in Geography from UCSB in 1990. His research interests lie in the emerging field of environmental informatics, a synthesis of computer, information, and Earth sciences. He has published in remote sensing, image processing, software architecture, massive distributed data systems, and digital libraries. His current research is focused on geospatial information curation and provenance, novel methods of whole-Earth visualization, and the use of next-generation database management systems to organize and process petabytes of geospatial information.

Fabrizio Gagliardi

Fabrizio Gagliardi fabriziogagliardijoined Microsoft in November 2005 to take responsibility for the company’s Technical Computing Initiative in Europe, the Middle East, Africa, and Latin America. As part of his job, he supports and contributes to the Microsoft Research cloud computing strategy in Europe, including the incubation and the management of a major EU project. Before he joined Microsoft, he had a 30-year long scientific career at the European Centre for Particle Physics in Geneva, Switzerland, where he held several scientific and senior managerial positions, and worked with four Nobel Prize winners.

Before then and starting at the end of the ‘90s, he was among the pioneers in developing and introducing grid-computing in Europe—this led to projects like EU-DataGrid and Enabling Grids for E-Science (EGEE), of which he was principal investigator and director from 2000 to 2005.

The EGEE project developed and deployed the distributed computing infrastructure that is now used for the analysis and distribution of data coming from the Large Hadron Collider (LHC), which earlier this year demonstrated the existence of the famous “God particle” (Higgs particle). From 2004 to 2005, while still director of EGEE, he contributed to the incubation and launch of more than 10 other EU grid projects—all inspired and supported by the EU EGEE flagship.

Since 2009, Gagliardi has been the chair of the Association for Computing Machinery (ACM) European Council and he also sits in the ACM Distinguished Speakers Programme International Committee.

Dennis Gannon

Dennis Gannon dennisgannonis director of Cloud Research Strategy in the Microsoft Research Connections organization. Prior to this position, he was part of the Microsoft Research Extreme Computing Group and the Technology Policy team. Over the last two years, he has provided cloud resource to more than 90 research projects in 13 countries in collaboration with national research funding agencies. Prior to coming to Microsoft, Gannon was a professor and chairman of Computer Science at Indiana University and the science director for the Indiana Pervasive Technology Labs. Gannon’s research interests include cloud computing, large-scale cyberinfrastructure, distributed computing, computer networks, parallel programming, and computational science.

Robert Gurney

Robert Gurney isrobertgurney professor of Earth Observation Science in the School of Mathematical and Physical Sciences at the University of Reading. His research interests are in using remote sensing and other technology to understand land-atmosphere interactions. He is one of the three co-leads of the NERC Environmental Virtual Observatory pilot. He has had a wide variety of supervisory roles, including being director of the NERC Environmental Systems Science Centre for 18 years, and previously as head of NASA Goddard’s Hydrological Sciences Branch, where he was also deputy project scientist for the Earth Observing System.

David Heckerman

David HeckermDr. David Heckermanan is senior director of the eScience Group at Microsoft Research. Since 1992, he has been a researcher at Microsoft, where he has created applications including the first content-based spam filter and web services for medical diagnosis. His research is in the areas of statistics, machine learning, and artificial intelligence with applications in medical diagnosis, the design of a vaccine for HIV, and the search for genetic causes of disease. He received his Ph.D. and M.D. from Stanford University. His Ph.D. dissertation on automated medical diagnosis received the ACM doctoral dissertation award. David is an Association for Computing Machinery (ACM) Fellow, an Association for the Advancement of Artificial Intelligence (AAAI) Fellow, and a Distinguished Scientist at Microsoft.

Tony Hey

As corporattonyheye vice president in Microsoft Research, Tony Hey is responsible for worldwide university research collaborations with Microsoft researchers. Hey is also responsible the multidisciplinary eScience Research Group within Microsoft Research. Before joining Microsoft, Hey served as director of the U.K.’s e-Science Initiative, managing the government’s efforts to build a new scientific infrastructure for collaborative, multidisciplinary, data-intensive research projects. Before leading this initiative, Hey led a research group in the area of parallel computing and was head of the School of Electronics and Computer Science, and dean of Engineering and Applied Science at the University of Southampton.

Hey is a fellow of the U.K.’s Royal Academy of Engineering and was awarded a Commander of the Order of the British Empire (CBE) for services to science in 2005. He is also a fellow of the British Computer Society, the Institute of Engineering and Technology, the Institute of Physics, and the U.S. American Association for the Advancement of Science. Hey has written books on particle physics and computing and has a passionate interest in communicating the excitement of science and technology to young people. He has co-authored popular books on quantum mechanics and on relativity.

Bill Howe

Bill Howe ibillhowes the director of Research for Scalable Data Analytics at the University of Washington eScience Institute and holds an affiliate assistant professor appointment in Computer Science and Engineering, where he studies data management, analytics, and visualization systems for science applications. Howe has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, and has received paper awards for work in data-intensive computing for science. Howe serves on the program and organizing committees in the area of scientific data management, has authored two book chapters on these topics, and serves on the advisory board for companies and projects related to science data, including the SciDB project. He holds a Ph.D. in Computer Science from Portland State University under David Maier, and a bachelor’s degree in Industrial and Systems Engineering from Georgia Institute of Technology.

Joe Hummel

Joe Hummel ijoehummels an author, consultant, and tenured professor of Computer Science, with a Ph.D. from the University of California, Irvine, in the field of High Performance Computing (HPC). Joe specializes in teaching computer science to a wide range of audiences around the world, including young children, professional developers, and university faculty. With the collision of HPC and Big Data, Hummel has been developing techniques and curricular materials for helping newcomers work in these challenging areas. He is currently a visiting researcher at the University of California, Irvine, as well as adjunct faculty at the University of Illinois at Chicago and Loyola University Chicago.

James Hunt

James Huntjameshunt was trained in environmental engineering at University of California, Irvine, (B.S.), Stanford University (M.S.), and the California Institute of Technology (Ph.D.) and has been in the Civil and Environmental Engineering Department at University of California, Berkeley, since 1980. His teaching interests emphasize many aspects of water resources engineering, including water treatment and hydrology.

Hunt’s areas of research have included particle dynamics in marine systems, estuarine sediment transport, contaminant transport processes in the subsurface, and hydrologic science. In all instances, initial efforts were constrained by data management challenges of finding the existing data, documenting the source of that data, and then using models as a means of scaling that data from one location to another. With the vast and widely distributed data available in hydrologic sciences, utilization of new methodologies for data analysis and management was essential in undertaking data synthesis and developing scaling relationships for the generalization of results.

Shuichi Iwata

Shuichi Iwatashuichiiwata is Emeritus Professor of the University of Tokyo, professor at the Graduate School of Project Design, former president of Committee on Data for Science and Technology (CODATA), editor-in-chief of Data Science Journal, member of Engineering Academy of Japan, and member of the Science Council of Japan. He is now working for Data and Society, making data on science and technology available for everyone through materials design, design science. and data science. He received his Doctor of Engineering in Nuclear Engineering from the University of Tokyo.

Harold Javid

Harold Javid haroldjavidis director of the Microsoft Research Connections regional programs for North America, Latin America, and Australia/New Zealand. His team works with the academic research communities in these regions to build rich collaborations including joint centers in the United States, Brazil, and Chile; faculty summits and other events; and talent development programs such as the Microsoft Research Faculty Fellows program. Javid has a long career in research organizations, working for companies like General Electric, Boeing, and now Microsoft. He has made advances in the application of optimization and computing algorithms in industries such as power, aerospace, and pulp and paper.

Javid is the chair of the Industry Advisory Board of the IEEE Computer Society. He received his Ph.D. in Electrical Engineering at the University of Illinois Urbana-Champaign where he made advances to optimization for multiple time-scale dynamic systems.

Michael Kurtz

Michael Kurtz imichaelkurtzs an astronomer and computer scientist at the Harvard-Smithsonian Center for Astrophysics in Cambridge, Massachusetts, which he joined after receiving a Ph.D. in Physics from Dartmouth College in 1982. Kurtz is the author or co-author of more than 250 technical articles and abstracts on subjects ranging from cosmology and extra-galactic astronomy, to data reduction and archiving techniques, to information systems and text retrieval algorithms. He is a fellow of the American Physical Society. In 1988, Kurtz conceived what has now become the Smithsonian/NASA Astrophysics Data System, the core of the digital library in astronomy. He has been associated with the project since that time, and was awarded the 2001 Van Biesbroeck Prize of the American Astronomical Society for his efforts.

Carl Lagoze

Carl Lagoze SONY DSCis an associate professor in the School of Information at the University of Michigan. Over the last two decades his research has included a number of projects investigating digital libraries, web science, scientometrics and bibliometrics, and the sociotechnical aspects of cyberinfrastructure and interoperability. His research has been funded by the National Science Foundation, the Mellon Foundation, Microsoft, and the Sloan Foundation.

Elizabeth Lyon

Liz Lyonlizlyon.jpg is director of UKOLN, University of Bath, U.K., where she leads work to promote synergies between digital libraries and open science environments. She is author of major direction-setting reports and articles including Dealing with Data (2007), Open Science at Web-Scale: Optimising Participation and Predictive Potential (2009) and The Informatics Transform: Re-engineering Libraries for the Data Decade (2012).

She is associate director at the Digital Curation Centre in the U.K. and leads the UKOLN Informatics Research Group. In this role, Lyon has led a series of pioneering research data management projects: eBank, eCrystals Federation, Infrastructure for Integration in Structural Sciences (I2S2), SageCite, Patients Participate!, and Research360, all of which explored links between research data, scholarly communications, and open science. She has a doctorate in cellular biochemistry and has worked in various university libraries.

Lyon is a member of the Biotechnology and Biological Sciences Research Council Strategy Panel, exploring data-intensive research and is co-chair of the DataONE International Advisory Board. She regularly gives international keynote addresses, and has spoken on libraries and informatics, research data management, and open science in Europe, United States, Canada, China, and Australia.

Eamonn Maguire

Eamonn Maguireeamonnmaguire is the lead developer of the ISA infrastructure (isa-tools.org and isacommons.org) at the University of Oxford’s e-Science Research Center. Maguire’s background is in Computer Science (bachelor’s) and Bioinformatics (master’s) and he is undertaking a D.Phil. (Ph.D.) in Computer Science at the University of Oxford focusing on biological data and metadata visualization. Maguire previously worked at the European Bioinformatics Institute from 2008 until 2010.

David R. Maidment

David R. Maidmentdavidmaidment is the Hussein M. Alharthy Centennial Chair at the University of Texas at Austin, where he has been on the faculty in the Department of Civil, Architectural and Environmental Engineering since 1981. He is a specialist in the application of information systems to hydrology, and was the leader from 2000 to 2011 of the Hydrologic Information Systems project of the Consortium of Universities for the Advancement of Hydrologic Science, Inc (CUAHSI), which developed a services-architecture for water observations data built around a language, WaterML, that in a revised form, WaterML2, has been adopted by the Open Geospatial Consortium as a global standard for the exchange of water resources time series information. He is presently working with the ESRI and Kisters firms to create World Water Online to link people with water data, maps, and models everywhere.

Tanu Malik

Tanu Malik itanumaliks a research scientist at the Computation Institute, University of Chicago (UChicago). Her research focuses on the management, performance, and provenance of the scientific data lifecycle. Her recent work focuses on high-performance computing systems and databases, distributed data provenance, and interactive publications.

Prior to joining UChicago, Tanu was a research assistant professor at the Cyber Center and the Indiana Center for Database Systems at Purdue University. She earned her Ph.D. and M.S. in 2008 from the Department of Computer Science, Johns Hopkins University, and a B.Tech. in 1999 from the Indian Institute of Technology, Kanpur. She is a member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE).

Claudia Medeiros

Claudia Bauzer claudiamedeirosMedeiros is full professor (Computer Science) at the Institute of Computing, UNICAMP, Brazil. Her main research interests lie in facing the challenges posed by large, real world applications, which require handling distributed and very heterogeneous scientific data sources. In particular, she has coordinated large eScience projects in Brazil, involving applications in agro-environmental planning and biodiversity. In these areas, she has been principal investigator or co-principal investigator in several multi-institutional projects, in cooperation with universities and research labs in Brazil, Germany, and France.

Chris Mentzel

chrismentzelSince 2008, Chris Mentzel has been a program officer in the Science Program at the Gordon and Betty Moore Foundation. Chris is currently developing a strategy for long-term investment in “data-driven discovery” that will enable scientists to turn the scientific data deluge into opportunities to address some of today’s most important research questions.

Chris identifies the people, advanced instrumentation, and information technologies that help solve important data-rich science questions. He is an active member of the broader eScience, Big Data and digital research communities, serving on a number of advisory boards and program committees, and occasionally finds time to engage in more direct technology development, teaching/coaching, new venture strategy, and non-profit management.

Prior to his current role at the Gordon and Betty Moore Foundation, Chris worked as the manager of grants administration and as senior network engineer for the organization. Before that, he also held positions as a systems engineer and a systems integrator at the University of California, Berkeley, and at various Internet consulting firms in the San Francisco Bay Area. He received his Bachelor of Arts in mathematics from the University of California, Santa Cruz and is currently pursuing graduate studies in management science and engineering at Stanford University.

Chris van der Meijden

Chris van der Meijdenchrismeijden studied veterinary medicine from 1984 to 1990. He focused on a specialization in Veterinary Informatics from 1995 to 1999. He is currently chief information officer of the Veterinary Faculty of the Ludwig-Maximilians-University in Munich, Germany. His primary research interest is archaeo-informatics.

William Michener

Bill Michener williammichener.jpgis project director for Data Observation Network for Earth (DataONE)—a large DataNet project supported by the National Science Foundation—and is involved in research related to creating information technologies supporting data-intensive science, development of federated data systems, and community engagement and education. He has a Ph.D. in Biological Oceanography from the University of South Carolina and has published extensively in marine science, as well as the ecological and information sciences.

Barbara Minsker

Barbara Minsker barbaraminskeris professor of Environmental and Water Resources Systems Engineering and Arthur and Virginia Nauman Faculty Scholar in the Department of Civil and Environmental Engineering and Faculty Affiliate at the National Center for Supercomputing Applications. Her research uses information technology and systems analysis to improve understanding and management of complex environmental systems, with a focus on water and sustainability. She has received numerous awards for her research, including the National Science Foundation CAREER Award, Army Young Investigator Award, Presidential Early Career Award for Scientists and Engineers, the American Society for Civil Engineers’ Walter L. Huber Civil Engineering Research Prize, Xerox Award for Faculty Research, and the University Scholar Award. She earned a B.S. in Operations Research and Industrial Engineering in 1986 and a Ph.D. in Civil and Environmental Engineering in 1995 from Cornell University. She served as a policy consultant to the Environmental Protection Agency from 1986 to 1990, and has been at the University of Illinois since 1996.

Philip Murphy

Philip Murphyphilipmurphy is a senior research analyst at the Redlands Institute, University of Redlands. There, he is the principal investigator for the desert tortoise spatial decision support (SDS) / adaptive management system in development with the CEC and U.S. Fish and Wildlife Service (USFWS). At the Institute, he conducts scientific research and technology development, and serves as senior project manager for a number of large, multi-year projects with the USFWS, Department of Defense / Army Corps of Engineers, and other agencies. He is a founding member of the Ecosystem Management Decision Support Consortium, the Spatial Decision Support (Ontology) Consortium, and is the chief executive officer of Infoharvest Inc., a software company that has been creating and selling decision analysis software since 1995.

His current research interests include spatial workflow automation, budgeting prioritization for large portfolios, uncertainty estimation for complex spatial computation systems, conceptual modeling, and decision support for public participation.

Carole Palmer

Carole L. Palmercarolepalmer is director of the Center for Informatics Research in Science and Scholarship (CIRSS) and a professor in the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign. Her research investigates problems in scientific and scholarly information work, development of large-scale digital research collections, and barriers to interdisciplinary inquiry. At CIRSS, she leads a team investigating data curation needs across disciplines and the re-use value of long-tail research data. She is principal investigator (PI) on the Site-Based Data Curation at Yellowstone National Park project (Institute for Museum and Library Services [IMLS]) and co-PI on the Data Conservancy (NSF).

Jim Pinkelman

Jim Pinkelman ijimpinkelmans currently a senior director in Microsoft Research Connections, where he leads the regional collaborations efforts and serves as business manager. Prior to coming to Microsoft Research, Pinkelman led Microsoft’s U.S. academic outreach efforts to find valuable ways in which Microsoft software and services could be used by technical students and educators both in and out of the classroom.

Before joining Microsoft, Pinkelman served in senior technology roles at technology startup firms in Chicago, Illinois. In 1999, Jim co-authored a book on business intelligence, Microsoft OLAP Unleashed (Macmillan/Sams Publishing). He spent seven years as an officer in the United States Air Force as a project management engineer on space systems. He is currently a member of the Board of Advisors at the University of Washington, Bothell. He is an Accreditation Board for Engineering program evaluator for the Computing Sciences Accreditation Board. He has also served as an adjunct faculty member over the past 15 years, teaching courses in computer programming and statistics. He received a Ph.D. in Mechanical Engineering from the University of Notre Dame, where his area of research was digital signal processing.

Drew Purves

Drew Purvesdrewpurves is head of the Computational Ecology and Environmental Science group (CEES) at Microsoft Research Cambridge. Before joining Microsoft, Purves studied ecology at Cambridge University, did a Ph.D. in ecological modeling at the University of York (UK), and a five-year postdoc at Princeton. Drew’s research interest is in combining ecological theory, with large and varied datasets, via computational statistics, in order to produce quantitative, predictive models of ecological phenomena. Following Purves’ lead, the CEES group is using this approach to build new models to address global environmental challenges—for example, carbon-climate, food security, wood production, biodiversity and ecosystem function, pandemics—whilst developing new software tools to enable others to carry out this kind of ecological modeling.

Purves has published more than 30 research papers in top peer-reviewed journals, including Science, Proceedings of the National Academy of Sciences, Proceedings of the Royal Society B, and most of the top ecology-specific journals. In 2012, he was one of 40 “young scientists” worldwide invited to attend the World Economic Forum “Summer Davos” meeting in Tianjin, China. He lectures at Cambridge University and is the treasurer of the British Ecological Society, the world’s oldest ecological society.

Jian Qin

Jian Qinjianqin is an associate professor at the School of Information Studies, Syracuse University. Her research publications and teaching areas encompass knowledge modeling and organization, ontologies, metadata, scientific data management, and scientific communication. Qin initiated the Scientific Data Literacy project with funding from U.S. National Science Foundation in 2007, in which she developed and implemented a course on scientific data management. In the last three years, she has been leading an eScience Librarianship Curriculum Development project funded by the Institute for Museum and Library Services and in partnership with Cornell University Library. This project sprang off a number of scientific data management projects performed by the eScience fellows and project team members. Jian Qin was invited by health sciences library networks to give workshops and by Chinese university libraries to provide consulting services on scientific data management and services. Her research on computational representation of web resources in polymer science was funded by the OCLC Online Library Computer Center in the early days of metadata movement. She is the co-author of the book Metadata published in 2008. Jian Qin holds a Ph.D. from University of Illinois at Champaign-Urbana and an M.L.I.S. from University of Western Ontario.

Rafael Santos

Rafael Santosrafaelsantos is a senior technologist at Associate Laboratory for Computing and Applied Mathematics at the Brazilian Institute for Space Research (LAC/INPE), working with research and development of artificial intelligence, data mining, image processing, and distributed computing systems and applications. He collaborates with research and development in other departments and universities and teaches at the applied computing graduate program at INPE.

He has master’s and Ph.D. degrees from the Kyushu Institute of Technology in Japan, and has been a visiting researcher at the Johns Hopkins University, at the Brazilian National Astrophysics Laboratory, and at the Brazilian Renato Archer IT Center.

Gail Steinhart

Gail Steinhartgailsteinhart is research data and environmental sciences librarian and a fellow in Digital Scholarship and Preservation Services, Cornell University Library. Her interests are in research data curation and cyberscholarship. She is responsible for developing and supporting new services for collecting and archiving research data, and serves as a library liaison for environmental science activities at Cornell. She is a member of Cornell University Library’s Data Executive Group and Cornell University’s Research Data Management Service Group, which seek to advance Cornell’s capabilities in the areas of data curation and data-driven research. She holds M.S. degrees in Library and Information Science (Syracuse University) and Ecology and Evolutionary Biology (Cornell University), and worked for nearly 15 years in environmental research before becoming a librarian.

Karen Stocks

Karen Stockskarenstocks is a biological oceanographer by training, and currently works at the interface of cyberinfrastructure and oceanography, partnering with technical experts to develop and tailor information systems to support oceanographic and biodiversity research. She is employed as a specialist at the San Diego Supercomputer Center and currently serves as the interim director of the Geological Data Center at Scripps Institution of Oceanography, and as the data curator for the Ocean Observatories Initiative.

Stocks completed her Bachelor of Science degree in Wildlife and Fisheries Biology at the University of Massachusetts and her Doctorate in Oceanography at Rutgers University. She has been at the San Diego Supercomputer Center since 2000.

Carly Strasser

Carly Strassercarlystrasser is a marine scientist by training who transitioned from traditional research to more applied topics related to data stewardship. She uses her scientific background to contribute a unique perspective to the field of information science and all things related to research data. Strasser received her Ph.D. in Biological Oceanography in 2008 from the Massachusetts Institute of Technology-Woods Hole Oceanographic Institution (MIT-WHOI) joint program. She completed two post-doctorates on population dynamics and theoretical ecology, and then moved out of research to work with the DataONE project in 2010.

Since joining the University of California Curation Center at the California Digital Library (CDL) in 2011, Strasser has focused primarily on the development of the DataUp tool. She is also involved in the promotion and improvement of other CDL services, including the DMPTool and the Merritt Repository. Her role at CDL is to provide insight into the issues and barriers to data stewardship that prevent researchers from properly managing and archiving their data.

Kenji Takeda

Kenji Takedakenjitakeda is solutions architect and technical manager for the Microsoft Research Connections Europe, Middle-East, and Africa (EMEA) team. He has extensive experience in cloud computing, high performance and high productivity computing, data-intensive science, scientific workflows, scholarly communication, engineering, and educational outreach. He has a passion for developing novel computational approaches to tackle fundamental and applied problems in science and engineering. He was previously co-director of the Microsoft Institute for High Performance Computing, and senior lecturer in Aeronautics, at the University of Southampton, U.K.

Darren Thompson

Darren Thompsondarrenthompson is an application support specialist with the Commonwealth Scientific and Industrial Research Organisation’s (CSIRO’s) Advanced Scientific Computing group. His current work focuses on the development of high-performance computing software for X-ray imaging and computed tomography. Prior to joining CSIRO, Thompson worked for worked for the Australian Road Research Board and spent more than 10 years in private industry developing software for traffic analysis and optimization. He holds an honours degree in Computer Science from Monash University in Melbourne, Australia.

Kristin M. Tolle

Kristin M. Tolle, Ph.D.,kristintolle.jpg is a director in the Microsoft Research Connections team and a clinical associate professor at the University of Washington’s College of Medicine. Since joining Microsoft, Tolle has been awarded numerous patents and worked for several product teams, including the Natural Language Group, Visual Studio, and Excel. She is also the co-editor, with Tony Hey, of The Fourth Paradigm: Data Intensive Scientific Discovery. Prior to joining Microsoft, Tolle was a research associate at the University of Arizona Artificial Intelligence Lab. Her present research interests at Microsoft Research include: big data, facilitating time to discovery in environmental science, data curation, and data science.

Dave Vieglais

Dave Vieglaisdavevieglais is a senior scientist at the Biodiversity Institute of the University of Kansas and Director of Development and Operations for DataONE, where he oversees DataONE development and implementation of architecture, computer science research, and technological evolution through the activities of the working groups and the cyberinfrastructure. Vieglais has extensive experience in developing standards such as the Darwin Core and technical infrastructure for integrating biodiversity information at the global level.

Nigel Ward

Nigel Ward nigelwardworks as data management coordinator within the eResearch Lab at the University of Queensland’s (UQ’s) School of Information Technology and Electrical Engineering, where he manages projects developing infrastructure to collect, manage, and publish UQ research data. Ward also works as deputy director for the National eResearch Collaboration Tools and Resources (NeCTAR) project led by the University of Melbourne. In this role, he manages and co-ordinates NeCTAR’s program of 16 eResearch Tools projects developing cloud-based software tools for the Australian research community.

Ward has technical expertise in distributed systems architectures, persistent identifiers, metadata, usability, accessibility, and formal specification.

Paul Watson

Paul Watsonpaulwatson is professor of Computer Science and director of the Digital Institute at Newcastle University, U.K. He also directs the $20 million Digital Economy Hub on Social Inclusion through the Digital Economy. He graduated in 1983 with a B.Sc. in Computer Engineering from Manchester University, followed by a Ph.D. on parallel graph reduction in 1986. In the 1980s, as a lecturer at Manchester University, he was a designer of the Alvey Flagship and Esprit EDS parallel systems. From 1990 to 1995, he worked for ICL as a system designer of the Goldrush MegaServer parallel database server, which was released as a product in 1994.

In August 1995, he moved to Newcastle University, where he has been an investigator on research projects worth more than $60 million. His research interest is in scalable information management with a current focus on cloud computing; most of his research is now based on the e-Science Central cloud platform. Watson is a Chartered Engineer, a Fellow of the British Computer Society, and a member of the UK Computing Research Committee.

Antony John Williams

antonywilliamsWith the Royal Society of Chemistry (RSC) Cheminformatics team, Antony John Williams—who is vice president of Strategic Development and head of Cheminformatics for RSC—is leading the charge to show how experience, knowledge, insight, and crowd sourced contributions can build a platform to facilitate a semantic web for chemistry. ChemSpider provides the means by which that can be realized now.

Over the past decade, he held many responsibilities, including the direction of the development of scientific software applications for spectroscopy and general chemistry, directing marketing efforts, sales and business development collaborations for the company Advanced Chemistry Development (ACD/Labs). His career is built on rich experience in experimental techniques, implementation of new nuclear magnetic resonance (NMR) technologies, walk-up facility management, research and development, manufacturing support, and teaching as well as analytical laboratory leadership and management.

Born in Wales, Williams earned a B.Sc. with honors from the University of Liverpool followed by a Ph.D. from the University of London in 1988. He then moved to Canada to serve as a postdoctoral scholar at the National Research Council of Canada in Ottawa. He quickly moved into leadership positions as NMR Facility Director at the University of Ottawa, NMR Technology Leader at the Eastman Kodak Company, vice president and chief scientist at Advanced Chemistry Development in Toronto, president of ChemConnector, Inc. and then ChemZoo, Inc., where the ChemSpider project was initiated.

Michael Witt

Michael Witt michaelwittis the interdisciplinary research librarian and an assistant professor of Library Science at Purdue University. Witt is the editor-in-chief of Databib, which is a searchable directory or catalog of research data repositories. His research at the Distributed Data Curation Center (D2C2) involves the advancement of library science theory and practice to meet the evolving needs of modern, scholarly communication with a focus on research data curation.

Dawn Wright

Dawn Wrightdawnwright joined the Environmental Systems Research Institute (Esri) as chief scientist in 2011. In this role, she aids in formulating and advancing the intellectual agenda for the environmental, conservation, climate, and ocean sciences aspect of Esri’s work, while also representing Esri to the national/international scientific community. Dawn is also a professor of geography and oceanography at Oregon State University in Corvallis. She has more than 16 years of experience in working with geographic information system technology as an ocean scientist, geographer, and educator and has participated in several initiatives around the world to map, analyze, and preserve ocean terrains and ecosystems.

Stephanie Wright

Stephanie Wrightstephaniewright.jpg is a librarian at the University of Washington Libraries with a background in science librarianship and library assessment. In her current role as data services coordinator, she works with the ResearchWorks Data Services Team to develop a program to support the research data management needs of faculty and students at the University of Washington.

Dong Xie

Dong Xie dongxieis a programmer/research assistant at the Wellcome Trust Centre for Human Genetics, Oxford University. For the past 12 years, he has worked on various projects covering microarray/gene expression database, genotyping database, phenotype informatics, and more. Recently he has been busy designing a Windows Azure-based software as a service to process the enormous data generated by high-speed sequencing. Furthermore, he would like to combine the computer sciences on concurrency theory and type theory, with gene/transcription control research, so that we might have better understand how a cell does massive parallel computation in order to improve programming.

Yan Xu

Yan Xu isyanxu.jpg a senior research program manager at the Earth, Energy, and Environment group at Microsoft Research. Her research is focused on interdisciplinary computing to engage Microsoft technologies with sciences in the Earth, energy, and environmental research areas. Yan has also been driving the Transform Science effort, which aims to bridge the gaps between scientific research and science education. She joined Microsoft Research in March 2006. Prior to working at Microsoft Research, Yan was a senior software architect and worked for several startup software companies for more than 10 years. Yan received her Ph.D. in Physics from McGill University, Canada.

Chaowei Phil Yang

Chaowei Phil Yang chaoweiyang.jpgis associate professor at George Mason University. His research interest is on utilizing spatiotemporal principles to optimize computing infrastructure to support environmental science discoveries and applications. He published more than 100 papers and edited six journal special issues and a book. He founded and co-directs the NASA/GMU Joint Center of Intelligent Spatial Computing for Water/Energy Sciences (CISC). He has received many awards, such as the U.S. Presidential Environment Protection Stewardship Award in 2009. He is leading a group of international leaders from University of California, Santa Barbara; Harvard: and George Mason University to establish an National Science Foundation Industry & University Cooperative Research Program (I/UCRC) for spatiotemporal thinking, computing, and applications.

Ilya Zaslavsky

Ilya Zaslailyazaslavsky.jpgvsky is director of Spatial Information Systems Laboratory at the San Diego Supercomputer Center, University of California, San Diego. His research focuses on distributed information management systems—in particular, on spatial and temporal data integration, geographic information systems, and spatial data analysis. Zaslavsky received his Ph.D. from the University of Washington (1995) for research on statistical analysis and reasoning models for geographic data. Previously, he received a Ph.D. equivalent from the Russian Academy of Sciences, Institute of Geography, for his work on urban simulation modeling and metropolitan evolution (1990).

Zaslavsky has been leading design and technical development in several cyberinfrastructure projects, including the national-scale Hydrologic Information System, which develops standards, databases, and services for integration of hydrologic observations. He has also developed spatial data management infrastructure as part of several large projects, in domains ranging from neuroscience (digital brain atlases) to geology, disaster response (NIEHS Katrina portal), regional planning, and conservation. Over the last year, he has led the development of a cross-domain interoperability road map for the geosciences, as part of new National Science Foundation EarthCube initiative.

Videos

Keynote Presentations

Keynote: Defensible Modeling of the Biosphere

Drew Purves

01:03:40

To manage the planet on which we all depend, we need to predict the future outcome of various options. How would biofuel subsidies affect crop prices affect deforestation? CO2 emissions affect climate change affect fire? At present, we cannot make such predictions with any confidence. But, as I’ll show in this talk, a computational approach to environmental science can change that. I’ll explain how we built the first fully data-constrained model of the terrestrial carbon cycle, using Big Data, cloud computing, and machine learning. And I’ll demo similar models for global food production, Amazon deforestation, and bird biodiversity. The prototype tools on which these models have been built—for example, FetchClimate, Filzbach, WorldWide Telescope—are freely available, and will hopefully allow other scientists to adopt a rigorous approach to modeling the complexities of the biosphere.


Keynote: Biology: A Move to Dry Labs

David Heckerman

00:48:06

Since its beginning, the wet lab has been the key driver in biological discovery. Recently, however, more and more science is getting done in dry labs, those where only computational analysis is done. The presentation will include examples, ranging from genomics to vaccine design.


2012 Jim Gray Award / The Possibilities and Pitfalls Internet-Based Chemical Data

Antony John Williams and Tony Hey

01:21:24

2012 Jim Gray eScience Award Presentation

At the Microsoft eScience Workshop 2012, Microsoft Research Connections Vice President Tony Hey introduces the Jim Gray eScience Award and announces this year’s winner, Antony John Williams, who delivers the following presentation.

The Possibilities and Pitfalls Internet-Based Chemical Data

In less than a decade, the Internet has provided us access to enormous quantities of chemistry data. Chemists have embraced the web as a rich source of data and knowledge. However, all that glitters is not gold and—while online searches can now provide us access to information associated with many tens of millions of chemicals, can allow us to traverse patents, publications, and public domain databases—the promise of high quality data on the web needs to be tempered with caution.

In recent years, the crowdsourcing approach to developing curated content has been growing. Can such approaches allow us to bring to bear the collective wisdom of the crowd to validate and enhance the availability of trusted chemistry data online or are algorithms likely to be more powerful in terms of validating data? While it is now possible to search the web by using a query language form natural to chemists—that of ‘structure searching the web’—increasingly, scientists are likely going to have to accept joint responsibility for the quality of data online for the foreseeable future. Their participation is likely to come through engaging in open science, the provision of data under open licenses, and by offering their skills to the community.

This presentation provides an overview of the present state of chemistry data online, the challenges and risks of managing and accessing data in the wild, and how an Internet for chemistry continues to expand in scope and possibilities.

Monday Breakout Sessions

Panel: Open Data for Open Science—Data Interoperability

Ilya Zaslavsky, Karen Stocks, Philip Murphy, Robert Gurney, and Yan Xu

02:04:16

The goal of cross-domain interoperability is to enable reuse of data and models outside the original context in which these data and models are collected and used and to facilitate analysis and modeling of physical processes that are not confined to disciplinary or jurisdictional boundaries. A new research initiative of the U.S. National Science Foundation, called EarthCube, is developing a roadmap to address challenges of interoperability in the earth sciences and create a blueprint for community-guided cyberinfrastructure accessible to a broad range of geoscience researchers and students.

The panel discusses this and related initiatives and projects, focusing on challenges of data discovery, interpretation, access, and integration across domain information systems, assessment of their readiness for cross-domain integration, and technologies enabling interoperability in the geosciences.


Panel: Enabling Multi-Scale Science

Claudia Bauzer Medeiros, James Hunt, and Roberto Cesar

00:51:50

eScience research increasingly involves the need to facilitate multi-scale problem solving that spans wide ranges in space and time scales. It requires collaboration among researchers and practioneers from multiple disciplines, each with their own orientations towards problem identification, solution formulation, and implementation.

The panel discusses some of the challenges of working in multi-scale scenarios. Panelists present these challenges from two perspectives: application, and computing approaches.

  • The first perspective focuses on issues such as scientific profiles involved, scales considered, data collected and produced, models, and visualization needs.
  • The second viewpoint considers, among others, characteristics of data and storage structures to accommodate the wide variety of data scales and formats, language/workflow constructs that may facilitate the specification, execution, and interaction of models, and interface/interaction primitives.

The Internet of Databases—Generalizing the Archaeo Informatics Approach

Chris van der Meijden

00:33:21

One thing we have learned from our Archaeo-Data-Network is, that there is a need to split meta information of databases in two levels. The first level contains a centralized unique id and very few standard information. The second level of meta information is defined by the archaeo scientist. This can be implemented for any kind of archaeo database, so the network’s extensibility is virtually unlimited. The advantage of this dual meta approach is its flexible connectivity and therefor getting comprehensive data transparent available for general searching and mining. With this approach huge, rigid archives can be connected to small, flexible databases for scientific analysis in any scientific domain. Combined with a simple authorization management for unpublished data we see in our system the potential of being the general blueprint for an eScience infrastructure, which we call the Internet of databases.


Combining Semantic Tagging and Support Vector Machines to Streamline the Analysis of Animal Accelerometry Data

Nigel Ward

00:28:54

Increasingly, animal biologists are taking advantage of low cost micro-sensor technology, by deploying accelerometers to monitor the behaviour and movement of a broad range of species. The result is an avalanche of complex tri-axial accelerometer data streams that capture observations and measurements of a wide range of animal body motion and posture parameters. We present a system which supports storing, visualizing, annotating, and automatic recognition of activities in accelerometer data streams by integrating semantic annotation and visualization services with Support Vector Machine techniques.


Panel: Handling Big Data for the Environmental Informatics / Real-Time Environmental Observation, Modeling, and Decision Support

Barbara Minsker, Chaowei Yang, David Maidment, Jeff Dozier, Jong Lee, and Ting Ting Zhao

01:26:36

Earth observations and other environmental data collection methods help us accumulate terabytes to petabytes of datasets. This pose a grand challenge to the informatics for environmental studies. We propose this session to capture the latest development on the Big Data collection, processing, and visualization in several aspects.

With increasing near-real-time availability of embedded and mobile sensors, radar, satellite, and social media, the opportunities to improve understanding, modeling, and management of environmental systems, as well as the built and human systems that interact with environmental systems, is immense.


Active Publications

Ian Foster and Tanu Malik

01:11:05

The eScience domain brings together scientists, experts, and engineers to enterprise comprehensive, large-scale data and computational cyberinfrastructures. The objective is to advance knowledge discovery in the sciences and establish effective channels of communication between the various disciplines. Software, data, workflows, technical reports, and publications are often the modes of this communication. However, currently all these modes of communication are disconnected from each other.

E-publishing is changing the nature of scientific communication through digital publication repositories and libraries. But the larger and more pertinent issue is connecting these yet static digital e-publications repositories to large amounts of computation, data, derived data, and extracted information.


Machine Assisted Thought

Michael Kurtz

00:56:19

I suggest that there are two distinct branches of eScience, both fundamentally enabled by the explosion of capabilities inherent in the information age. The first concerns the use of numbers, measurements from arrays of sensors, outputs from simulations, and so forth. The techniques of eScience increase our ability to perceive massive amounts of data by factors of billions or trillions. I call this Machine Assisted Perception.

The second branch of eScience concerns the use of words, the verbal abstractions used by humans to communicate ideas. The new technologies of digital libraries and search engines have already substantially changed the scholarly thought process, growth in the capabilities of these technologies continues to be rapid. I call this machine/human collaboration Machine Assisted Thought.


Panel: Cloud Computing—What Do Researchers Want?

Dennis Gannon, Fabrizio Gagliardi, Marty Humphrey, and Paul Watson

01:13:40

Cloud computing for science is seeing take-up in many disciplines, but many researchers are skeptical. In this panel session, we discuss:

  • How researchers are using the cloud today
  • What they want/need for the future
  • Why they might not want to use the cloud

DemoFest 2012

Carly Strasser, Dong Xe, Eamonn Maguire, Ian Foster, Jim Pinkelman, Michael Witt, Rob Fatland, Steve Tuecke, Tanu Malik, and Yan Xu

00:12:45

At the 2012 eScience Workshop, DemoFest presenters briefly introduce their topics.

  • Layerscape: Tools for Collaborative Analysis of Complex Data

Presenter: Rob Fatland, Microsoft Research

  • Globus Online: Research Data Management as a Service

Presenter: Ian Foster, University of Chicago and Argonne National Laboratory

  • The Open-Source ISA Metadata Tracking Framework: from Data Curation and Management at the Source, to the Linked Data Universe

Presenter: Eamonn Maguire, University of Oxford

  • SOLE: Connecting Publications to Large Online Data Repositories

Presenter: Tanu Malik, University of Chicago and Argonne National Laboratory

  • DataUp: A Tool for Documenting and Sharing Scientific Tabular Data

Presenter: Carly Strasser, California Digital Library

  • Databib: An Online Catalog of Research Data Repositories

Presenter: Michael Witt, Purdue University

  • 12,000 Human Genomes from Raw Sequence to Result, on Windows and Windows Azure

Presenter: Dong Xie, Oxford University

  • OData and Environmental Informatics

Presenter: Jim Pinkelman (for Yan Xu), Microsoft Research

Tuesday Breakout Sessions

The Utility of Human/Computer Learning Network for Improving Biodiversity Conservation and Research

Carl Lagoze

00:29:54

We describe our work to improve the quality and utility of citizen science contributions to eBird, arguably the largest biodiversity data collection project in existence. Citizen science (the use of “human sensors”) is especially important in a number of observation-based fields, such as astronomy, ecology, and ornithology, where the scale and geographic distribution of phenomena to be observed far exceeds the capabilities of the established research community. Our work is based on the notion of a Human/Computer Learning Network, in which the benefits of active learning (in both the machine learning sense and human learning sense) are cyclically fed back among human and computational participants.


Educating Scientists About the Data Life Cycle

William Michener

00:27:12

The research life cycle is well known and consists of an initial idea or question that, if sound, leads to submission and funding of a proposal, implementation of a study and, ideally, to one or many publications that advance the state of knowledge. What is less well understood is how the research life cycle is related to the data life cycle.

In this presentation, approaches for educating scientists in eight phases of the data life cycle (e.g., planning, data acquisition and organization, quality assurance/quality control, data description, data preservation, data exploration and discovery, data integration, and analysis and visualization) are discussed. Specifically, the design and approaches used for developing learning modules, instructional material and resources, and an innovative three-week experiential course that enable participants to more efficiently and effectively manage their research data and compete for research funding are presented.


Teaching Scientific Data Management in Data Science Education and Workforce Development Programs for Science Communities

Robert R. Downs

00:24:35

Recent popularity of data science has led to increased recognition of the need for education and workforce development in data science. However, definitions of the term, data science, vary and often focus on techniques for data analytics and visualization, omitting scientific data management and related topics associated with data policy, stewardship, and preservation.

Scientific data management encompasses a variety of concepts and methods to foster continuing access and long-term stewardship of data for current and future users. Considering the needs for scientific data management knowledge and capabilities to facilitate improved and persistent accessibility and use of scientific data throughout the data lifecycle, instruction on topics in scientific data management is recommended for data science education and workforce development programs for science communities.


Tools and Techniques for Outreach and Popular Engagement in eScience

Rafael Santos

00:29:47

Public participation in scientific research takes many forms: participation of volunteers in citizen science projects, monitoring of natural resources and phenomena, volunteering of computational resources for distributed data analysis tasks, and so forth.

In this presentation, we comment on some of the computational tools, techniques, and case studies of applications that enable active public participation in scientific research. Of particular interest are applications that showcase the benefits of letting the public use the professional resources (in other words, the same data and computational resources that the scientists have access to) and return something back to the research behind it, such as applications that go beyond simple publication of scientific data or applications that use novel methods for user engagement. Examples of applications for scientific outreach that use specialized computational tools or techniques, and/or educational approaches, are also discussed.


Priorities for Data Curation Education: Data Center Partnerships and Long-Tail Science

Carole Palmer

00:27:27

For science to fully exploit digital data in new and innovative ways, research data will need to be collected, curated, and made accessible and usable across domains. The need for workforce development in data curation systems and services has been recognized for many years, and education programs are beginning to mature. But to continue to build strong programs in this emerging field, current data curation practice and research needs to underpin goals for professional education.

Having established a specialization in data curation in 2006, we have assessed our program’s progress to date and identified areas in need of further development to respond to trends in e-science. Analysis of student placements shows interesting trends in the institutions hiring data curation specialists and the nature of the positions, and evaluation of internships provided in national data centers has suggested important areas for further investment. In addition, our recent research on disciplinary differences in data sharing and the value of long-tail data in the sciences has direct implications for further development of data curation curriculum.


Big Data Processing on the Cheap

Joe Hummel

00:55:59

Getting started with big data? Generating more and more data without the hardware resources to process it? This session will help newcomers to ‘big data’ get started processing and visualizing their data, without the need for expensive computing resources. While these techniques may not produce lightning-fast results, you can at least get started with your analysis.


Educating a New Breed of Data Scientists for Scientific Data Management

Jian Qin

00:27:21

Data scientists play active roles in the design and implementation work of four related areas: data architecture, data acquisition, data analysis, and data archiving. While any data and computing related academic unit could offer a data science program or curriculum, each of them has their own flavors: statistics would weigh heavily toward data analytics and computer science on computational algorithms. The information schools are taking a more holistic approach in educating data scientists. This presentation reports the data science curriculum development and implementation at Syracuse iSchool, which has been shaped by the quickly-changing, data-intensive environment not only for science but also for business and research at large. Research projects that we conducted on scientific data management with participation from the e-science student fellows demonstrates the need and significance of educating the new breed of data scientists who have the knowledge and skills to take on the work in the four related areas mentioned above.


Publishing and eScience Panel

James Frew, Jeff Dozier, Mark Abbott, and Shuichi Iwata

01:28:22

Scientific Publishing in a Connected, Mobile World

Speaker: Mark Abbott

New tools for content development and new distribution channels create opportunities for the scientific community, opening new venues for collaboration, review, and self-publication. However, publishing is at the heart of the culture of science, and several centuries of experience with publishing in journals will not simply vanish. Issues of peer review, reproducibility, integrity, and scientific context will need to be addressed before these new tools take hold. Open access is but one part of this conversation.

How to Collaborate with the Crowd: a Method for “Publishing” Ongoing Work

Speaker: Jeff Dozier

The typical model for interdisciplinary research starts with a small-group partnership, typically with colleagues who have known each other for a while. They learn to articulate problems across disciplinary boundaries and discover shared interests. They successfully seek funding, and work together for several years. This model works, but can be cumbersome. An alternative model is to express a sequence of processes and data that integrate to create a suite of data products, and to identify insertion points where expertise from another perspective might be able to contribute to a better solution.

When Provenance Gets Real: Implications of Ubiquitous Provenance for Scientific Collaboration and Publishing

Speaker: James Frew

We expect (or hope?) that the impending standardization of data models, ontologies, and services for information provenance will make scientific collaboration easier and scientific publishing more transparent. We propose a panel of active producers and users of provenance who will address scenarios such as:

  • “I’m a scientist, and this is what I would really like to tell someone with provenance.”
  • “I’m a scientist, and this is what I wish provenance would tell me when I use your data, join your project, or …”
  • “I build systems that capture and/or manage provenance, and this is what I’ve seen scientists actually do when they create and/or use provenance.”

Data Journal Challenge for the Fourth Paradigm-Trust through Data on Environmental Studies and Projects

Speaker: Shuichi Iwata

The Graduate School of Project Design Landscapes on recent big data issues to bridge environmental studies and social expectations are reviewed to design an e-Journal with data files and models. Data parts are keys to give semantics to original scientific papers, and also double keys for computational models. Structured data with explicit descriptions about their metadata can be managed and their traceability can be realized systematically, step by step. However, almost all available data are unstructured, fragmented, and contain ambiguities and uncertainties. Balances between data quality and freshness/costs/coverage are discussed so as to draw a road map for a data journal, referring to two preliminary case studies on materials data and data due to nuclear reactor accidents and problems.


What Is a Data Scientist?

Kenji Takeda and Liz Lyon

00:23:38

The term, data-scientist, is becoming prevalent in science, engineering, business, and industry. We explore how the term is used in different contexts, segments, and sectors; we examine the different variants, flavors, and interpretations and try to answer the following questions:

  • What does a data scientist really do?
  • What skills does a data scientist need? How do they acquire them?
  • What tools, technologies, and platforms are used by data scientists?
  • How can we build data scientist capacity and capability for the future?

Informatics, Information Science, Computer Science, and Data Science Curricula

Geoffrey Fox

00:27:57

We describe a possible data science curricula based on discussions at Indiana University and experience with our Informatics, Computer Science, and Library and Information Science programs. This leads to an interesting breadth of courses and students’ interests, which could address the many job opportunities. We suggest a collaboration to build a MOOC (online) offering with one initial target: minority serving institutions.


Data Science Curricula at the University of Washington eScience Institute

Bill Howe

00:35:14

The University of Washington eScience Institute is engaged in a number of educational efforts in data science, including certificate programs for professionals, workshops for students in domain science, a new data-oriented introductory programming course, and a data science MOOC to be offered through Coursera in the spring. We consider the tools, techniques, research topics, and skills to be well-aligned with the data-driven discovery emphasis of eScience itself—the only difference is the applications.

We see several benefits in aligning these two areas. For example, students in science majors who are not pursuing research careers become more marketable. In the other direction, working professionals see opportunities to apply their skills to solve science problems—we have recruited volunteers from industry in this way. In this talk, I’ll discuss these activities, review our curriculum, and describe our next steps.


Novel Approaches to Data Visualization

Darren Thompson, Dawn Wright, and George Djorgovski

01:19:20

Data Visualization in Virtual Spaces and High Dimensions

Speaker: George Djorgovski

Visualization is a bridge between the quantitative content of data and human intuition and understanding. Effective visualization is a critical bottleneck as the complexity and dimensionality of data increase. I will describe some experiments in collaborative, multi-dimensional data visualization in immersive virtual reality.

CT and Imaging Tools for Windows HPC Clusters and Azure Cloud

Speaker: Darren Thompson

Computed Tomography (CT) is a non-destructive imaging technique widely used across many scientific, industrial, and medical fields. It is both computationally and data intensive. Our group within CSIRO has been actively developing X-ray tomography and image processing software and systems for GPU-enabled Windows HPC clusters.

A key goal of our systems is to provide our “end users”—researchers—with easy access to the tools, computational resources, and data via familiar interfaces and client applications without the need for specialized HPC expertise. We have recently explored the adaptation of our CT-reconstruction code to the Windows Azure cloud platform, for which we have constructed a working “proof-of-concept” system. However, at this stage, several challenges remain to be met in order to make it a truly viable alternative to our HPC cluster solution.

Work in Progress Toward Enhancing Multidimensional Visualization with Analytical Workflows

Speaker: Dawn Wright

Big Data, particularly from terrestrial sensor networks and ocean observatories, exceed the processing capacity and speed of conventional database systems and architectures, and require visualization in three and four dimensions in order to understand the Earth processes at play. Successfully addressing the scientific challenges of Big Data requires integrative and innovative approaches to developing, managing, and visualizing extensive and diverse data sets, but is also critically dependent on effective analytical workflows. This talk will present an emerging agenda and work in progress toward this end at Environmental Systems Research Institute.


Panel: Scientific Data: the Current Landscape, Challenges, and Solutions

Carly Strasser, Chris Mentzel, Dave Vieglais, Jeff Dozier, Stephanie Wright, and William Michener

01:30:17

Funders, researchers, and public stakeholders increasingly see the need to better communicate and curate ever expanding bodies of research data. This panel will bring together many of the stakeholders in the scientific data community, including researchers, librarians, and data repositories.

Before the panel commences, we will provide a brief introduction to scientific data to facilitate discussion. We will describe the current landscape of scientific data and its management, including publication, citation, archiving, and sharing of data. We will also describe existing tools for data management. The panel discussion will focus on identifying gaps and unmet needs in order to help chart a path for future policy, service, and infrastructure development.