eScience Workshop 2010

About

escience270x180Microsoft Research—in partnership with the Berkeley Water Center, the Colleges of Engineering and Natural Resources at UC Berkeley, and the Lawrence Berkeley National Laboratory—held the 2010 Microsoft Research eScience Workshop on October 11–13 in Berkeley, California.

Scaling the Science

Curreexteriornt opportunities in the physical and biological sciences and their technological applications require the means to fundamentally understand processes at the molecular level and to extend those processes to predict performance at larger scales. eScience is developing approaches for conducting this scaling and has been essential in addressing fundamental questions in biology and astronomy. While additional applications remain in the basic sciences, these fields have demonstrated pathways for advances in the applied environmental and social sciences where the linkages between scales and disciplines require focused contributions from the eScience community.

This “Scaling the Science” workshop provided opportunities to observe how eScience has provided scaling across various fields and to explore some of the challenges that remain for realizing the ambitions of the fourth paradigm.

About the Workshop

The goal of this seclaremontatsunsetventh annual cross-disciplinary workshop was to bring together scientists from diverse research disciplines to share their research and discuss how computing is transforming their work. The event also included the presentation of the third annual Jim Gray eScience Award to a researcher who has made an especially significant contribution to the field of data-intensive computing.

tonyhey_2009_closesquare_72x72Primary support for the workshop was provided by Microsoft External Research, headed by Corporate Vice President Tony Hey.

 

Highlights

bourne_tony_award_smJim Gray eScience Award 2010

Each year, Microsoft Research presents the Jim Gray eScience Award to a researcher who has made an outstanding contribution to the field of data-intensive computing. Find out who the recipient of this year’s award is.

Jim Gray eScience Award

From 2007 to 2014, The Jim Gray eScience Award recognized eight researchers for their outstanding work in the field of eScience. Recognizing these pioneers in data-intensive science has helped advance the prestige of the field and strengthen the community.

Past award recipients

2014 award

paulwatson-2014_102x140Paul Watson was awarded the 2014 Jim Gray eScience Award

Dr. Paul Watson is professor of Computer Science and director of the Digital Institute at Newcastle University UK, where he also directs the $20M RCUK Digital Economy Hub on Social Inclusion through the Digital Economy. As a Lecturer at Manchester University, he was a designer of the Alvey Flagship and Esprit EDS systems. From 1990 to 1995, he worked in industry for ICL as a designer of the Goldrush MegaServer parallel database server. In August 1995, he moved to Newcastle University, where he has been an investigator on wide range of eScience projects. His research interest is in scalable information management with a current focus on cloud computing. Professor Watson is a Chartered Engineer and a Fellow of the British Computer Society. Learn more…

2013 award

jga2013-hey-lipman210x140Tony Hey presents David Lipman with the 2013 Jim Gray eScience Award

Dr. David Lipman, M.D., is director of the National Center for Biotechnology Information (NCBI). Under his leadership, NCBI has become one of the world’s premier repositories of biomedical and molecular biology data, providing invaluable information to both the research community and the public. Every day, more than 3 million users access NCBI’s more than 40 databases. Learn more…

2012 award

escience2012_jimgrayaward-winnerAntony John Williams receives the Jim Gray eScience Award from Tony Hey at the 2012 Microsoft Research eScience Workshop

Antony John Williams is vice president of strategic development and head of Chemoinformatics for the Royal Society of Chemistry. He has pursued a career built on rich experience in experimental techniques, implementation of new nuclear magnetic resonance (NMR) technologies, research and development, and teaching, as well as analytical laboratory management. He has been a leader in making chemistry publically available through collective action: his work on ChemSpider helps provide fast text and structure search access to data and links on more than 28 million chemicals, and this resource is freely available to the scientific community and the general public. Learn more…

2011 award

jga2011_hey_abbottTony Hey presents Mark Abbott with the Jim Gray eScience Award at the 2011 Microsoft Research eScience Workshop

Mark Abbott is dean and professor in the College of Oceanic and Atmospheric Sciences at Oregon State University. He is also serving a six-year term on the National Science Board, which oversees the National Science Foundation and provides scientific advice to the White House and to Congress. Throughout his career, Mark has contributed to integrating biological and physical science, made early innovations in data-intensive science, and provided educational leadership. Learn more…

2010 award

bourne_tony_awardPhil Bourne accepts the Jim Gray eScience Award from Tony Hey at the 2010 eScience Workshop

Phil Bourne, the recipient of the third-annual Jim Gray eScience Award, is a professor in the Department of Pharmacology and Skaggs School of Pharmacy and Pharmaceutical Sciences at the University of California at San Diego. Phil is also the Associate Director of the RCSB Protein Data Bank, an Adjunct Professor at the Burnham Institute, and a past president of the International Society for Computational Biology. “Phil’s contributions to open access in bioinformatics and computational biology are legion, and are exactly the sort of groundbreaking accomplishments in data-intensive science that we celebrate with the Jim Gray Award,” notes Tony Hey, Corporate Vice President of External Research.  Learn more…

2009 award

hey_dozierJeff Dozier accepts the Jim Gray eScience Award from Tony Hey at the 2009 eScience Workshop

Jeff Dozier was presented the 2009 award in recognition of his achievements in advancing environmental science through leading multi-disciplinary research and collaboration. While presenting the award, Tony Hey stated, “Jeff Dozier’s work epitomizes what the Jim Gray eScience Award is all about … using data-intensive computing to accelerate scientific discovery and, ultimately, to help solve some of society’s greatest challenges. By combining environmental science with computer science technologies, Jeff brings a new level of understanding to climate change and its impact on our planet.” Learn about Dozier’s thoughts about environmental science in The Fourth Paradigm: Data Intensive Scientific Discovery, pages 13–19.

2008 award

carole_gobleTony Hey presents Carole Goble with the 2008 Jim Gray eScience Award

At the 2008 Microsoft eScience Workshop, the Jim Gray eScience Award was presented to Carole Goble in recognition of her contributions to the development of workflow tools to advance data-centric research. To learn about her work and the role of workflow tools in scientific research, see my experiment and The Fourth Paradigm: Data Intensive Scientific Discovery, pages 137–145.

2007 award

jga2011_hey_szalayTony Hey presents Alex Szalay with the 2007 Jim Gray eScience Award

The winner of the first Jim Gray eScience Award was Alex Szalay, professor in the Department of Physics and Astronomy at The Johns Hopkins University. Alex was recognized for his foundational contributions to interdisciplinary advances in the field of astronomy and groundbreaking work with Jim Gray.

mbf_thumbUsing Software to Enhance Healthcare

Johnson & Johnson Pharmaceutical R&D is using Microsoft Research’s Microsoft Biology Foundation to design new chemical compounds that could improve the health and quality of life of patients around the world.

Using Software to Enhance Healthcare

By Rob Knies

October 12, 2010 6:00 AM PT

Researchers at Johnson & Johnson Pharmaceutical Research and Development (J&J PRD) faced a challenge. Over the years, they have built a state-of-the-art platform to enable discovery of small-molecule drugs, but the expanding role of biologics in pharmaceutical research required a new set of tools to handle large-molecule compounds.

Developing such functionality from scratch was a daunting proposition. It would take time and resources while delaying development of novel treatments for debilitating diseases and disorders.

Researchers at Microsoft Research had a solution. Their new, open-source library of bioinformatics functions, the Microsoft Biology Foundation (MBF), part of the Microsoft Biology Initiative, was designed to address just such a challenge. When the J&J PRD researchers learned about this, they immediately became intrigued.

This confluence of need and opportunity occurred in late November 2009. Now, less than a year later, the benefit has become manifestly apparent. Instead of spending costly time building a foundation for the new biological infrastructure, J&J PRD was able to focus on delivering value-added functionality needed to facilitate development of innovative treatments that have the potential of improving the health and quality of life of patients around the world.

“By using MBF, we were able to provide our users with a greater level of functionality in less time to our users for our initial development phase in the large-molecule space.” says Jeremy Kolpak, J&J PRD senior analyst, who will be discussing his team’s MBF deployment during the 2010 eScience Workshop, being held in Berkeley, Calif., from Oct. 11-13, “It allowed us to focus on value-added functionality for our scientists and has helped us adapt to new requests quite easily.”

Such testimony brings a smile to the face of Simon Mercer, director of Health and Wellbeing for External Research, a division of Microsoft Research.

“The principal advantage of MBF,” Mercer says, “is that, because it’s free and open-source, as a programmer, you get a certain amount of prewritten functionality that you can just build on top of. It gives you more time to do the real science, because we’ve already supplied the basics.”

It didn’t take long for J&J PRD to grasp the implications of MBF.

“We were in the process of developing our own infrastructure to work with sequences,” Kolpak explains. “This was part of a larger move in our organization to improve how R&D with large molecules was performed and integrate that process with an existing and mature framework for working with small molecules.

“We have been using MBF from the day we heard of it.”

That is precisely the focus of the Health and Wellbeing effort within External Research: to collaborate openly with the bioinformatics community by applying advanced computing technologies to provide unprecedented insight into disease and human healthcare.

MBF, built on the Microsoft .NET Framework and aimed at making it easier to implement biological applications on the Windows platform, was launched in Boston on July 9 during the 11th annual Bioinformatics Open Source Conference. Since then, thousands of bioinformaticians have downloaded the tool kit.

“There are a lot of biologists who start as post-docs but don’t end up going into biological research themselves,” Mercer says. “They end up managing the data and writing the scientific applications that the biologists need to do research. They can be anywhere on the continuum between full biologists with no computing background to full computer scientists with little or no biological background.

“They work alongside the biological scientists, but they won’t necessarily be those scientists. They’ll write scripts and write programs to help the lab run, and they’ll also probably do some data analysis.”

Companies and academics that pursue such work, naturally, are more concerned with the value they can derive from using software tools than with building the tools themselves.

“I’ve heard it over and over again from executives of different pharmaceutical companies,” Mercer says. “Possibly 90 percent of their software stack has been developed in house but offers them no competitive advantage. The real crown jewels in bioinformatics are relatively small compared with the huge bulk of software they have to maintain.

“They’re often in a situation where they want to exchange data with other pharmaceutical companies on a pre-compete level, and they find that hard, because their processing pipelines are uniquely their own. A lot of commercial companies are looking for things like MBF to adopt as a common platform, so they are using the same tools, analyzing the data in the same way, and they are able to share data sets and cut costs.”

In other words, MBF helps make bioinformaticians’ work a bit simpler. That certainly appears to be the case at J&J PRD.

“We have integrated it into our data-analysis and -visualization platform, Third Dimension Explorer, which has been developed in house,” Kolpak says. “This platform is used in a multitude of different contexts.”

With regard to J&J PRD’s large-molecule exploration, he lists the ability to achieve five distinct tasks:

  • View sequences with their associated assay data to see how variations across compounds impact targets.
  • Align multiple sequences.
  • View aligned sequences and their associated metadata, such as complementarity-determining regions.
  • Extract and translate regions of sequences.
  • Work with sequences of different formats to provide a generic platform for scientists to import and analyze them in one place.

“The goal,” Kolpak says, “is to capture operations that are performed routinely and make it extremely efficient to execute in one place. But at the same time, we are not trying to replace existing sequence-analysis tools for the more complex and less used operations.”

At Johnson & Johnson Pharmaceutical R&D, there are hundreds of users of the Third Dimension Explorer tool. The MBF-related development is still being completed and rolled out, but 40 people already are using the enhanced data-analysis platform—and deriving significant benefits.

“It’s hard to quantify the amount of time it has saved us,” Kolpak says, “due to the fact we work with an agile development methodology and, for each iteration, we are finding new functionality in MBF that we can utilize. I would say that, for our initial rollout, which required a large amount of framework implementation, it saved us around three months during a six-month initial development cycle.”

Biological work might not be the first thing that comes to mind when people think about Microsoft, but it supports such scientists nevertheless.

“Inside Microsoft Research, we’ve done lots of biology,” Mercer says. “It’s not what everybody would expect, but a lot of researchers apply their computer-science research in the biological domain for healthcare. How can you apply Microsoft technologies to scientific research? We often do that through collaborations with academics, where the academic brings the biology, in this case, and Microsoft brings the computer science. Together, hopefully, we advance further than either side would have done independently.

“Eventually, you have to ask yourself the question, ‘Why don’t we just build a platform so that all of the common elements are written once and don’t need to be written again for every single project?’ And once that platform exists, and it’s open-source and free, why not give it away to the community so it can benefit?”

There are specific ways in which MBF can assist in the biological domain, such as with modularity, extensibility, and code maintenance.

“Those sorts of things that professional programmers think of aren’t necessarily the first things in the minds of those who are writing scripts to support a lab,” Mercer continues. “MBF sits in the middle, with prewritten functionality in nice, digestible chunks, very standardized.”

There are quite a few other biological libraries akin to MBF already in use, some of them for a decade or more. But over time, they have grown unwieldy, making it hard to extend them. And they tend to be written in script-based languages that have no type checking. MBF, on the other hand, offers type checking and guarantees, and it’s built atop the common-language runtime, providing the flexibility to handle any of the more than 70 languages that work with .NET, thereby making easy for a heterogeneous community to use without having to conform to a single language.

“We’ve also wrapped the individual bits of MBF as workflow activities for our Trident workflow workbench,” Mercer adds, “which is also free and downloadable. “You don’t even have to be a programmer to use MBF. You can just drag and drop and connect the building blocks together to build workflow pipelines.”

External Research attempts to understand the precise scientific challenge encountered by its MBF partners, a methodology termed scenario-based development that identifies areas where MBF can be made more useful. That methodology will be a key component of the next wave of the tool’s enhancement.

“We’re approaching our partners in the academic community and the commercial world to define those scenarios,” Mercer says, “and that’s what’s driving the direction in MBF v2. We encourage the wider community—people who download the source code, understand it, and start developing their own extensions to support their own science—to participate, because the more of those we get, the more broadly we can develop MBF. It will grow by the actions of the community, to support the science that the community wants to support.”

That, in the example of J&J PRD, is exactly what is happening.

“A lot of what is on our wish list we have been developing in stride,” Kolpak says, “mainly a visualization tool for viewing sequences, in addition to some other sequence file-format supports that contain more than just sequence data. These are all things we plan to contribute back to the MBF development.”

And the community at which MBF is focused expects to use open-source code.

“If we want to run a project that would be recognizable and familiar in form to the academic community,” Mercer says, “then that would be a software-development project that is open-source, because open-source is a very common model there. We want to get contributions from as broad a set of people as possible.

“We want scientists to get a value out of using Windows,” he concludes. “We want scientists to pick up different tools that we have and understand that they can help them do their research more effectively and reach insights more quickly than they would otherwise manage to do. We’ve got a lot of value to offer in that area.”

The folks at Johnson & Johnson Pharmaceutical Research and Development couldn’t agree more.

“I am a software developer by trade,” Kolpak says, “and by using MBF, I have the confidence that what I am providing our users is not just solid code, but also that the science behind it is accurate.”

cloudcomputing_72x72Studying the Breathing of the Biosphere

Researchers at University of California, Berkeley, work with Microsoft Research to analyze vast amounts of data without supercomputers.

Keynotes and Presentations

Monday, October 11

Keynote Presentations

UK e-Science: a Jewel or a Thousand Flowers

Malcolm Atkinson, e-Science Institute

The global digital revolution provides a fertile and turbulent ecological environment in which e-Science is a small but vital element. There is a deep history of e-Science, but coining the term and injecting leadership and modest funds had a huge impact. A veritable explosion of activity has led to a global burst of new e-Science species. Our challenge is to understand what will enable them to thrive and yield maximum benefit as the digital revolution continues to be driven by commerce and media.

Webcast

Jim Gray eScience Award Presentation

This year, Microsoft Research presents the Jim Gray eScience Award to a researcher who has made an outstanding contribution to the field of data-intensive computing. The award—named for Jim Gray, a Technical Fellow for Microsoft Research and a Turing Award winner who disappeared at sea in 2007—recognizes innovators whose work truly makes science easier for scientists.

Webcast

Making Open Science Real

Adam Bly, Seed

The future of science is open, not because it ought to be but because it needs to be. Today, science’s potential is hindered by the disconnected nature of the world’s scientific information and the closed architecture of science itself. So how do we get from here to there? How can technology make open science real?

Webcast

Tutorials

Tutorial MT1

Microsoft Biology Foundation: An Open-Source Library of Re-usable Bioinformatics Functions and Algorithms Built on the .NET Platform

Webcast

Tutorial MT2

Scientific Data Visualization using WorldWide Telescope

Webcast

Tutorial MT3

Data-Intensive Research: Dataset Lifecycle Management for Scientific Workflow, Collaboration, Sharing, and Archiving

Webcast

Tutorial MT4

Parallel Computing with Visual Studio 2010 and the .NET Framework 4

Webcast

Sessions

Session MA1 | Senses Across Scales

Webcast

Exploration of Real-Time Provenance-Aware Virtual Sensors Across Scales for Studying Complex Environmental Systems

Yong Liu, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

Development and Application of Network of Geosensors for Environmental Monitoring

Rafael Santos, INPE – Brazilian National Institute for Space Research

Session MA2 | Data Analysis Through Visualization

Webcast

BLAST Atlas: A Function-Based Multiple Genome Browser

Lawrence Buckingham, Queensland University of Technology

DIVE: A Data Intensive Visualization Engine

Dennis Bromley, University of Washington

Session MA3 | Health & Wellbeing I

Webcast

Simplifying Oligonucleotide Primer Design Software to Keep Pace with an Ever Increasing Demand for Assay Formats

Kenneth “Kirby” Bloom, Illumina Corporation

Integration of Sequence Analysis into Third Dimension Explorer Leveraging the Microsoft Biology Framework

Jeremy Kolpak, Janssen Pharmaceutical Companies of Johnson & Johnson

Session MA5 | From Environmental Science to Public Policy

Webcast

Achieving an Ecosystem Based Approach to Planning in the Puget Sound

Stephen Stanley, Washington Department of Ecology

Adapting Environmental Science Methods to Public Policy and Decision Support

Rob Fatland, Microsoft Research

Session MA6 | Complex Biological Systems in Action

Webcast

An Interactive Modeling Environment for Systems Biology of Aging

Pat Langley, Arizona State University

Session MA7 | Data-Intensive Science

Webcast

Analyzing the Process of Knowledge Dynamics in Sustainability Innovation: Towards a Data-Intensive Approach to Sustainability Science

Masaru Yarime, University of Tokyo

Data-Intensive Science for Safety, Trust, and Sustainability

Shuichi Iwata, The University of Tokyo

Session MA8 | Health & Wellbeing II

Webcast

BL!P: A Tool to Automate NCBI BLAST Searches and Customize the Results for Exploration in Live Labs Pivot

Vince Forgetta, McGill University

GenoZoom: Browsing the genome with Microsoft Biology Foundation, Deep Zoom, and Silverlight

Xin-Yi Chua, Queensland University of Technology

Tuesday, October 12

Keynote Presentation

The Reaming of Life

Philip Bourne, University of California, San Diego

Anyone can punch a hole in a piece of metal, but a reamer is needed to accurately size and finish that hole. Digital computers are the reamers of life, bringing together a vast array of disparate bits of data to provide an accurate picture of life that can be smoothly transcended across scales–from molecules to populations. Sounds heady, so why do we not fully understand the molecular basis of cancer? Why can’t we accurately model the impact of an oil spill on marine life? Why can’t we decide whether there is a tree of life or a network of life? “Well tonight we are going to sort it all out, for tonight it’s the reaming of life.”

Webcast

Sessions

Session TM1 | Data from Ocean to Stars

Webcast

Data, Data, Everywhere, nor Any Drop to Drink: New Approaches to Finding Events of Interest in High Bandwidth Data Streams

Mark Abbott, Oregon State University

Extreme Database-centric Computing in Science

Alex Szalay, Johns Hopkins University

Session TM2 | Health & Wellbeing III

Webcast

Model-Driven Cloud Services for Cancer Research

Marty Humphrey, University of Virginia

Cloud-Based Map-Reduce Architecture for Nuclear Magnetic Resonance-Based Metabolomics

Paul Anderson, Wright State University

Session TM3 | Tools to Get Science Done

Webcast

MyExperimentalScience, Extending the “Workflow”

Jeremy Frey, University of Southampton

The Conversion Software Registry

Michal Ondrejcek, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

Session TM5 | Cloud Computing and Chemistry

Webcast

oreChem: Planning and Enacting Chemistry on the Semantic Web

Mark Borkum, University of Southampton

Accelerating Chemical Property Prediction with Cloud Computing

Hugo Hiden, Newcastle University

Session TM6 | Health & Wellbeing IV

Webcast

Remote Computed Tomography Reconstruction Service on GPU-Equipped Computer Clusters Running Microsoft HPC Server 2008

Timur Gureyev, Commonwealth Scientific and Industrial Research Organisation (CSIRO)

e-LICO: Delivering Data Mining to the Life Science Community

Simon Jupp, University of Manchester

Session TM7 | Database Diversity

Webcast

SQL is Dead; Long Live SQL: Lightweight Query Services for Ad Hoc Research Data

Bill Howe, University of Washington

SinBiota 2.0 – Planning a New Generation Environmental Information System

João Meidanis, University of Campinas

Session TA1 | Enabling Scientific Discovery

Webcast

Enhancing the Quality and Trust of Citizen Science Data

Jane Hunter, The University of Queensland

Scientist-Computer Interfaces for Data-Intensive Science

Cecilia Aragon, Lawrence Berkeley National Laboratory

Enabling Scientific Discovery with Microsoft SharePoint

Kenji Takeda, University of Southampton

Session TA2 | Health & Wellbeing V

Webcast

Genome-Wide Association of ALS in Finland

Bryan Traynor, National Institute on Aging, National Institutes of Health

A Framework for Large-Scale Modelling of Population Health

John Ainsworth, University of Manchester

GREAT.stanford.edu: Generating Functional Hypotheses from Genome-Wide Measurements of Mammalian Cis-Regulation

Gill Bejerano, Stanford University

Session TA3 | Virtual Research Environments and Collaboration

Webcast

Medici: A Scalable Multimedia Environment for Research

Joe Futrelle, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

BlogMyData: A Virtual Research Environment for Collaborative Visualization of Environmental Data

Andrew Milsted, University of Southampton

RightField: Rich Annotation of Experimental Biology Through Stealth Using Spreadsheets

Matthew Horridge, University of Manchester

Session TA4 | Applications in Digital Humanities

Webcast

musicSpace: Improving Access to Musicological Data

mc schraefel, University of Southampton

Quantifying Historical Geographic Knowledge from Digital Maps

Tenzing Shaw, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

Data Intensive Research in Computational Musicology

David De Roure, Oxford e-Research Centre

Session TA5 | Agriculture, Digital Watersheds and Heterogeneous Climate Data

Webcast

Scaling Information on ‘Biosphere Breathing’ from Chloroplast to the Globe

Dennis Baldocchi, University of California-Berkeley

Agrodatamine: Integrating Analysis of Climate Time Series and Remote Sensing Images

Humberto Razente, UFABC

Session TA6 | Health & Wellbeing VI

Webcast

Correction for Hidden Confounders in Genetic Analyses

Jennifer Listgarten, Microsoft Research

BioPatML.NET and Its Pattern Editor: Moving into the Next Era of Biology Software

James Hogan, Queensland University of Technology

Session TA7 | eScience in Systems

Webcast

GRAS Support Network, Its Implementation, Operation, and Use

Fritz Wollenweber, EUMETSAT

Data Intensive Frameworks for Astronomy

Jeffrey Gardner, University of Washington

Session TA8 Archaeo Informatics | Scaling Information on 'Biosphere Breathing' from Chloroplast to the Globe

Dennis Baldocchi, University of California-Berkeley

Experiences and Visions on Archaeo InformaticsChristiaan Hendrikus van der Meijden, IT group, Veterinary Faculty, Ludwig Maximilians University; Peer Kröger, Hans-Peter Kriegel, Department of Computer Science Database Systems Group, Ludwig Maximilians University

Wednesday, October 13

Keynote Presentations

OpenSource & Microsoft: Beyond Interoperability

Sam Ramji, Apigee

Microsoft’s open source strategy has shifted over the years, from ignore to fight to interoperate. Recently they have changed course to use open source as an engine of innovation and growth for core businesses. This talk will cover details of projects that showcase the shifts in strategy and expose the underlying dynamics of open source in the software industry.

Webcast

Scaling the Science

Garrison Sposito, U.C. Berkeley; Mark Stacey, U.C. Berkeley; Stephanie Carlson, U.C. Berkeley; Charlotte Ambrose, NOAA’s National Marine Fisheries Service; James Hunt, U.C. Berkeley

The current opportunities in the physical and biological sciences and their technological applications require the means to fundamentally understand processes at the molecular scale and to extend those processes to predict performance at larger scales. As examples, material science is using resolution at the scale of an atom to predict and design devices that are orders of magnitude larger, and biological processes are dictated by interactions at molecular, cellular, organismal, population, and ecosystem levels. Spatial and temporal scaling across orders of magnitude requires analysis tools that are available for computation, aggregation, and visualization. eScience is developing approaches for conducting this scaling and has been essential in addressing fundamental questions in biology and astronomy. While additional applications remain in the basic sciences, these fields have demonstrated pathways for advances in the applied environmental and social sciences, where the linkages between scales and disciplines require focused contributions from the eScience community. This workshop provides opportunities to observe how eScience has provided the scaling across various fields and to explore some of the challenges that remain.

Webcast

Sessions

Session WM2 | Challenges of Data Standards & Tools

Webcast

Panel: Challenges of Data Standards and Tools

Deb Agarwal, LBNL/UCB; Bill Howe, University of Washington; Alex James, Microsoft; Yong Liu, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign; Maryann Martone, UCSD; Yan Xu, Microsoft Research

Session WM3 | Data and Visualization

Webcast

Scientific Data Sharing and Archiving at UC3/CDL: the Excel Add-in Project and More

John Kunze, California Digital Library/California Curation Center; Tricia Cruse, California Digital Library/California Curation Center

Visualizing All of History with Chronozoom

David Shimabukuro, University of California-Berkeley; Roland Saekow, University of California-Berkeley

Session WM4 | Health & Wellbeing VII

Webcast

Proteome-Scale Protein Isoform Characterization with High Performance Computing

Jake Chen, Indiana University

Answering Biological Questions by Querying k-Mer Databases

Paul Greenfield, CSIRO Mathematics, Informatics and Statistics

Tutorials

Tutorial WT1

CoSBiLab: Enabling Simulation-Based Science

Webcast

Tutorial WT2

Scientific Data Visualization using WorldWide Telescope

Webcast

Tutorial WT3

Data-Intensive Research: Dataset Lifecycle Management for Scientific Workflow, Collaboration, Sharing, and Archiving

Webcast

Tutorial WT4

OData – Open Data for the Open Web

Webcast

Abstracts

Exploration of Real-Time Provenance-Aware Virtual Sensors Across Scales for Studying Complex Environmental Systems

Yong Liu, Alejandro Rodriguez, Joe Futrelle, Rob Kooper, and Jim Myers, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

In this position paper, we present our extended concept and preliminary work of “Real-Time Provenance-Aware Virtual Sensors” across scales for studying complex environmental systems, especially sensor-driven real-time environmental decision support and situational awareness. The real-time provenance-aware virtual sensors can re-publish transformed “data, information and knowledge” streams as virtual sensor streams with associated provenance information to describe the causal relationships and derivation history in real-time. An early implementation of Open Provenance Model-Compliant provenance capture across heterogeneous layers of workflows, system daemons and user interactions as well as the re-publishing of the provenance-aware virtual sensors are presented to illustrate the value for environmental systems research and improvement of interoperability with Open Geospatial Consortium’s Sensor Web Enablement standards.

Development and Application of Network of Geosensors for Environmental Monitoring

Rafael Santos, INPE – Brazilian National Institute for Space Research

Some of the goals of the Brazilian National Institute for Space Research are related to research of space and the environment in general and development of tools and methods to support it. One of these research areas is the modeling and study of the interaction between the Earth’s atmosphere and the terrestrial biosphere, which plays a fundamental role in the climate system and in biogeochemical and hydrological cycles, through the exchange of energy and mass (for example, water and carbon), between the vegetation and the atmospheric boundary layer.

The main focus of many environmental studies is to quantify this exchange over several terrestrial biomes.

Over natural surfaces like the tropical forests, factors like spatial variations in topography or in the vegetation cover can significantly affect the air flow and pose big challenges for the monitoring of the regional carbon budget of terrestrial biomes.

With this motivation, a partnership involving INPE, FAPESP (the Research Council for the State of São Paulo), Microsoft Research, the Johns Hopkins University, and the University of São Paulo was created to research, develop and deploy prototypes of environmental sensors (geosensors) in the Atlantic coastal and in the Amazonian rain forests in Brazil, forming sensor networks with high spatial and temporal resolution; and to develop software tools for data quality control, integration with other sensor data, data mining, visualization and distribution.

This short talk presents some concepts, approaches, solutions and challenges on the computational aspects of our project.

BLAST Atlas: A Function-Based Multiple Genome Browser

Lawrence Buckingham, Alejandro Rodriguez, Joe Futrelle, Rob Kooper, and Jim Myers, Queensland University of Technology

BLAST Atlas is a visual analysis system for comparative genomics that supports genome-wide gene characterization, functional assignment and function-based browsing of one or more chromosomes. Inspired by applications such as the WorldWide Telescope, Bing Maps 3D and Google Earth, BLAST Atlas uses novel three-dimensional gene and function views that provide a highly interactive and intuitive way for scientists to navigate, query and compare gene annotations. The system can be used for gene identification and functional assignment or as a function-based multiple genome comparison tool which complements existing position based comparison and alignment viewers.

DIVE: A Data Intensive Visualization Engine

Dennis Bromley, Steven Rysavy, David Beck, and Valerie Daggett, University of Washington

Data-driven research is a rapidly emerging commonality throughout scientific disciplines. Recently, with the proliferation of inexpensive commodity computing clusters, synthetic data sources such as modeling and simulation are capable of producing a continuous stream of terascale data. Confronted with this data deluge, domain scientists are in need of data-intensive analytic environments. Dynameomics is a terascale simulation-driven research effort designed to enhance our understanding of protein folding and dynamics through molecular dynamics simulation and modeling. The project routinely involves exploratory analysis of 100+ terabyte datasets using an array of heterogeneous structural biology-specific tools. In order to accelerate the pace of discovery for the Dynameomics project, we have developed DIVE, a framework that allows for rapid prototyping and dissemination of domain independent (e.g., clustering) and domain specific analyses in an implicitly iterative workflow environment.

The information in the data warehouse is classified into three categories: raw data, derived data, and state data. Raw data are generated from simulations and models, derived data are produced through tools operating on the raw data, and state data constitute the record of the exploratory workflow, which has the added benefit of capturing the provenance of derived data.

DIVE empowers researchers by simplifying and expediting the overhead associated with shared tool use and heterogeneous datasets. Furthermore, the workflow provides a simple, interactive, and iterative data-oriented investigation paradigm that tightens the hypothesis generation loop. The result is an expressive, flexible laboratory informatics framework that allows researchers to focus on analysis and discovery instead of tool development.

Simplifying Oligonucleotide Primer Design Software to Keep Pace with an Ever Increasing Demand for Assay Formats

Kenneth “Kirby” Bloom, Illumina Corporation

As the pace of research and discovery in biotechnology continues to accelerate rapidly, oligonucleotide primer design software with plug-in algorithm architecture and scalable processing capabilities has become essential. With constantly changing algorithms leveraging a multitude of technologies for employing various chemistries and locus targeting techniques, the ability to manage, maintain, and extend the source code and data repositories became a hurdle for getting new products to market.

This challenge was met by creating a dynamic execution model that enables a drag-and-drop component construction through the use of Microsoft Workflow to allow for simplicity and scalability in the application. This architecture had the effect of decreasing the time needed to deliver new assays to market by 60 percent. Identifying a generic workflow pattern to support primer design also helped structure an architecture yielding a more than 700 percent speed improvement and the ability to scale the solution across multiple servers to meet burst demand scenarios.

Integration of Sequence Analysis into Third Dimension Explorer Leveraging the Microsoft Biology Framework

Jeremy Kolpak, Michael Farnum, Victor Lobanov, and Dimitris Agrafiotis, Janssen Pharmaceutical Companies of Johnson & Johnson

Third Dimension Explorer (3DX) is a powerful, internally developed .NET platform designed to address a broad range of data analysis and visualization needs across Johnson & Johnson Pharmaceutical Research & Development. 3DX employs a plugin approach that allows the development of extensions for particular tasks while sharing a common set of core analytic and visualization functionality. This architecture has allowed us to extend the 3DX platform to many areas of pharmaceutical R&D, from early drug discovery (e.g., analysis of chemical structures and their associated biological properties) to mining of electronic medical records.

As 3DX became a foundational system for small molecule pharmaceutical R&D and its use became widespread throughout the company, the need to extend its capabilities to support biologics research quickly emerged. As with small molecules, we followed a two-pronged approach: 1) integrate large molecule discovery data into our existing global discovery data warehouse known as ABCD, and 2) develop a new set of advanced sequence-activity analysis and visualization tools under the 3DX framework to leverage and complement its existing capabilities. The end result is a unique offering: a single data warehouse integrating both small and large molecule data (ABCD), and a single end-user application for mining and visualizing that data (3DX).

However, the task of expanding 3DX’s analysis capabilities to biologics was not a trivial task. Two options were available to us at that time: 1) re-implement the entire infrastructure ourselves from scratch, or 2) attempt to integrate existing tools built on disparate technology platforms. None of the solutions was appealing; the former because of resource constraints, and the latter because of the inherent maintenance and performance issues. Fortunately, it was at that time that MBF was released in beta, and offered an excellent foundation for seamless integration into our native .NET platform, providing much of the core functionality needed to meet our researchers’ needs.

The functionality that we developed enables interactive visualization and editing of multiple sequence alignments (via a customized sequence viewer plugin) and integration of data mining and analytic capabilities (e.g., BLAST searching of sequence libraries, multiple sequence alignment, sequence editing and translation, segment extraction, and so forth). While the sequence viewer is most useful when integrated into a general data mining application like 3DX, it was designed as a 3DX-independent extension of the MBF, thus providing a generic platform for viewing sequences and their associated metadata. It is our intention to make it freely available for use under the Microsoft Public License.

Achieving an Ecosystem Based Approach to Planning in the Puget Sound

Stephen Stanley, Susan Shull, and Susan Grigsby, Washington Department of Ecology; Gino Luchetti, King County DNR; Margaret Macleod and Peter Rosen, City of Issaquah; Millie Judge, Lighthouse Natural Resources Consulting, Everett

Watershed research over the past 20 years, however, has recognized that factors controlling the biological and physical functions at the site scale operate over multiple spatial and temporal scales (Naiman and Bilby, 1998; Beechie & Bolton, 1999; Hobbie, 2000; Benda, 2004; Simenstad et al., 2006; King County, 2007). This requires data at mid and broad scales from watersheds encompassing thousands of hectares. However, available data at mid and broad scales is often inaccurate and inconsistent in its coverage. This complicates the effort to understand the mechanistic relationship between the impacts of a land use activity upon a watershed process and site scale functions and environmental responses, such as low survival of salmonid eggs or flooding. Furthermore, watershed assessments require the integration of knowledge from multiple scientific disciplines. There is a lack of a common language, however, and mismatch between data sets in terms of forms of knowledge and different levels of precision and accuracy (Benda et al., 2002). As a result, the predictive ability and management utility of watershed assessment tools have been considered low (Beman, 2002). Because of these data/scale and integration issues, state and local governments have not developed a standard system for using watershed information to inform future development patterns in a manner that avoids significant long-range impacts to aquatic ecosystems. These issues have also prevented the public from adequately understanding the important role that broad scale data could play in protecting and restoring aquatic resources. To help incorporate watershed data and assessment into local planning efforts, the State of Washington is developing a watershed characterization and planning framework for Puget Sound. This includes methods to assess multiple watershed processes and integrate the results into “decision templates” (Stanley et al., 2009—in review). The templates help interpret and apply the characterization information appropriately.

Adapting Environmental Science Methods to Public Policy and Decision Support

Rob Fatland, Microsoft Research

Dozier and Gail posit a new Science of Environmental Applications, driven more by need than traditional scientific curiosity. I present here a brief elaboration on applying this idea to public policy and decision support based on an example of aquifer management on a small (22 km2) island in Puget Sound. I use the first person plural “we” to imply a community of environmental application problem solvers interested in sharing solutions in the way that scientists share research, from methods to results. In consequence these remarks concern the sociology of integrating science with decision making, a process with attendant difficulties (today) in both sharing and adopting solutions.

An Interactive Modeling Environment for Systems Biology of Aging

Pat Langley, Arizona State University

In this paper, we describe an interactive environment for the representation, interpretation, and revision of qualitative but explanatory biological models. We illustrate our approach on the systems biology of aging, a complex topic that involves many interacting components. We also report initial experiences with using this environment to codify an informal model of aging. We close by discussing related efforts and directions for future research.

Analyzing the Process of Knowledge Dynamics in Sustainability Innovation: Towards a Data-Intensive Approach to Sustainability Science

Masaru Yarime, University of Tokyo

Sustainability science is an academic field that analyzes the processes of production, diffusion, and utilization of various types of knowledge with long-term consequences for innovation. Three components can be identified in the process of knowledge dynamics system in society. Knowledge has aspects of content, quantity, quality, and rate of circulation. Actors are characterized with their heterogeneity, linkages and networks, and interactions among them. Institutions cover a diverse set of entities, ranging from informal ones such as norms and practices to more formal ones including rules and laws. Sustainability science thus deals with dynamic, complex interactions among diverse actors creating, transmitting, and applying various types of knowledge under institutional conditions. Several phases are identified in the production, diffusion, and utilization of knowledge with different actors. Gaps and inconsistencies inevitably exist among different phases in terms of the quantity, quality, and rate of knowledge processed. This effectively constitutes a major challenge in pursuing sustainability on a global scale. Different phases of the process of knowledge dynamics include problem discovery, scientific investigation, technological development, diffusion in society, reactions from stakeholders in society. These different phases are analyzed by using a data-intensive approach, assembling and integrating a diverse set of data through bibliometric analysis of scientific articles published in academic journals, patent analysis of technologies, life cycle assessment of products, and discourse analysis of mass media. Case studies of innovation on photovoltaic and water treatment technologies are conducted by assembling and integrating various types of data on the different phases of the knowledge dynamics. They suggest that gaps and inconsistencies in the knowledge circulation system would actually pose serious challenges to the pursuit of sustainability innovation.

Data-Intensive Science for Safety, Trust, and Sustainability

Shuichi Iwata and Pierre Villars, The University of Tokyo

Thoughts on “Data Commons” for data-intensive science are reported based on our preliminary studies for data-driven materials design, targeting not only at materials but also at all time-dependent properties about aging of engineering products, human bodies and degradation of environments.

Our methods are not powerful enough to predict time-dependent properties of complex system, so that we use causality and correlation in data to ensure safety margins adequate. Thus, in short, “safety” is confirmed by data, and “trust” is built by enough margins, again confirmed by data. These subjects are data-intensive from the beginning due to their inherent complexity.

Dealing with such a complexity proactively to get a set of creative holistic views on each time-dependent complexity, we propose “Data Commons” as a platform for collective knowledge. And it is to be constructed beyond the following two challenges:

  1. Horizontal comparative approaches to get perspectives by a set of two dimensional maps on deep semantics as demonstrated by our former projects LPF (Linus Pauling File)
  2. Vertical converging (=heuristic inverse/direct) approaches to a concrete target beyond ”multi-scale modeling” as tried by VEMD(Virtual Experiments for Materials Design) to bridge gaps between data and models, allowing rich diversities of scenarios

The third challenge is to drive abductive approaches so as to become free from “lock–in”, which can be attained by strategic organizations of (1) and (2) through collective knowledge. And a paradigm for the data-centric science is discussed by a preliminary study along this approach. Commitments in the collective knowledge are the key for sustainability.

BL!P: A Tool to Automate NCBI BLAST Searches and Customize the Results for Exploration in Live Labs Pivot

Vince Forgetta and Ken Dewar, McGill University; Moussa S. Diarra, Pacifc Agri-Food Research Centre, Agriculture and Agri-Food Canada; Simon Mercer, Microsoft Research

NCBI BLAST is a tool widely used to annotate protein coding sequences. Current limitations in the annotation process are in part dictated by the methodology used. The manual inspection of BLAST results is slow, tedious and limited to static analysis of textual output, while automated analyses typically discard useful information in favor of increased speed and simplicity of analysis. These limitations can be addressed using data exploration and visualization software, such as Live Labs Pivot by Microsoft, a software application that allows for the fluid exploration of large datasets in an intuitive manner. We have created a Microsoft Windows application, BL!P [blip] or BLAST in Pivot, that automates NCBI BLAST searches, fetches associated GenBank records, and converts this information into a Pivot collection. Also, BL!P provides an interface to create customized images for each BLAST match, allowing the user to perform further customizations to meet their data exploration objectives.

GenoZoom: Browsing the Genome with Microsoft Biology Foundation, Deep Zoom, and Silverlight

Xin-Yi Chua, Queensland University of Technology; Michael Zyskowski, Microsoft Research

Many current genome browsers are faced with a number of limitations, namely: they do not support smooth navigation of large scale data from high to low resolutions at rapid speed; information is limited to a predefined set of genomic data; lengthy setup is required to display user’s own genome sequences and they do not support unformatted user annotations. GenoZoom was an investigation in attempt to address these limitations by utilizing the richness enabled by Silverlight [1] and Deep Zoom [2] technologies.

Data, Data, Everywhere, nor Any Drop to Drink: New Approaches to Finding Events of Interest in High Bandwidth Data Streams

Mark Abbott, Ganesh Gopalan, and Charles E. Sears, Oregon State University

The amount of unstructured data gathered and managed annually by organizations within both the research and the business sectors is growing exponentially. Qualitatively, this shift is even more radical, as the conceptual framework for data moves from a historic, disaggregated, and static perspective, to one based on assumptions about the potentials in dynamic data management and collaboration. Knowledge extraction will require new tools to enable new levels of collaboration, visualization, and synthesis. This is not just scaling up traditional compute workflows to accommodate greater volumes; it is about scaling out to broadly dispersed data and teams that come together to work on specific business and science issues. We are using high-definition (HD) data arrays derived from range of observing systems and models as streaming data sets. The problem space is defined as the detection, annotation, and classification of events or features in the HD stream, link these with an XML-based data base, and provide web services to a broad range of network-aware devices, not just desk side workstations. We are developing a content-based high definition video search engine that integrates multiple Microsoft technologies, including a multi-touch interface to query and navigate through video clips, WPF for transitions in the interface, a SQL Server back-end with an HTTP Endpoint to search through video using MPEG-7 and CLR stored procedure integration to support MPEG-7 tasks directly within the database. Finding the data “drop” of interest will require new approaches, not simply “scaling up” the hardware and approaches we have used for the past decades. Instead, we must accommodate the “scaling out” of data sources, repositories, and users. Our research explores these new avenues to capture, analyze, visualize, distribute and present large-scale digital e-science content.

Extreme Database-centric Computing in Science

Alex Szalay, Tamas Budavari, Laszlo Dobos, and Richard Wilton, Johns Hopkins University

Scientific computing is becoming increasingly about analyzing massive amounts of data. In a typical academic environment managing data sets below 10TB is easy; above 100TB it is very difficult. Databases offer a lot of advantage for the typical patterns required for managing scientific data sets, but lack a few important features. Here we present recent projects at JHU aimed to bridge the gap between databases and scientific computing. We have implemented a framework that enables us to execute SQL Server User Defined Functions on GPGPUs, implemented a new array datatype for SQL Server and ran several science analysis tasks using these features.

Model-Driven Cloud Services for Cancer Research

Marty Humphrey, University of Virginia

The cancer Bioinformatics Grid (caBIG) is a virtual network of interconnected data, individuals, and organizations. Overseen by the NIH National Cancer Institute (NCI), caBIG is redefining how research is conducted, care is provided, and patients/participants interact with the biomedical research enterprise. Given its ambitious goal and vision, caBIG faces a huge number of technical and economic challenges. The software underlying caBIG must be user-friendly, scalable, secure, evolvable and evolving, able to find and process the relevant information necessary to the computation at hand, interoperable with other platforms, cost-effective, and so forth. Delivering on these requirements has the potential to be truly transformative, revolutionizing cancer research and transforming patient health care into a highly-personalized model.

However, it has been observed that the current software of caBIG is very restrictive—there is a tremendous learning curve necessary, whereby researchers must often become familiar with a whole new set of tools and methodologies (based on Java). caBIG is fundamentally model-driven; However, the current modeling capabilities in caBIG are rigid and ineffective, and many of the potential benefits of a model-driven architecture are not being realized. Infrastructure costs (both with respect to software design/deployment and with respect to running deployed services) are starting to overwhelm caBIG as caBIG seeks to expand.

In our prior work (Microsoft eScience Workshop 2008), we demonstrated how to create a caBIG data Service based on ADO.NET Data Services and WCF. In this talk, we demonstrate how we address these challenges through the use of Microsoft SQL Server modeling technologies, ADO.NET Entity Framework in .NET 4.0, Odata, Microsoft Visual Studio 2010, and Windows Azure to deliver model-driven cloud services for cancer research.

Cloud-Based Map-Reduce Architecture for Nuclear Magnetic Resonance-Based Metabolomics

Paul Anderson, Satya Sahoo, Ashwin Manjunatha, Ajith Ranabahu, Nicholas Reo, Amit Sheth, and Michael Raymer, Wright State University; Nicholas DelRaso, Air Force Research Laboratory

The science of metabolomics is a relatively young field that requires intensive signal processing and multivariate data analysis for interpretation of experimental results. We present a scalable scientific workflow approach to data analysis, where the individual cloud-based services exploit the inherent parallel structure of the algorithms. Two significant capabilities include the adaptation of an open source workflow engine (Taverna) that provides flexibility in selecting the most appropriate data analysis technique, regardless of their implementation details, and the implementation of several common spectral processing techniques in the cloud using a parallel map-reduce framework, Hadoop. Due to its parallel processing architecture and its fault-tolerant file system, Hadoop is ideal for analyzing large spectroscopic data sets.

MyExperimentalScience, Extending the 'Workflow'

Jeremy Frey, Andrew Milsted, Danius Michaelides, and David De Roure, University of Southampton

For the past few years there has been a lot of activity in the preservation and dissemination of the “in silico” experiments through the sharing of “workflows”. This term has been used to describe the processes that were performed by such experiments, but this term can also apply to “real” in vitro experiments, by describing the experimental steps performed by the scientist. In the past these workflows would of been recoded in a paper labbook, so the only way to share the said workflow would be to write a journal paper just around the procedure or expose full pages of the labbook. With the introduction or Virtual Research Environments (VRE) and Electronic Laboratory Notebooks (ELN) there is now a possibility for the sharing of these processes.

The MyExperimentalScience project linked the myExperiment platform with the ELN LabBlog, myExperiment is a collaborative environment in which scientists can safely publish their workflows and experimental plans, share them with groups and find those of others. Workflows, other digital objects and bundles (called Packs) can now be swapped, sorted and searched like photos and videos on the web. Unlike Facebook or MySpace, myExperiment fully understands the needs of the researcher and makes it really easy for the next generation of scientists to contribute to a pool of scientific methods, build communities and form relationships—reducing time-to-experiment, sharing expertise and avoiding reinvention. myExperiment is now the largest public repository of scientific workflows.

The Conversion Software Registry

Michal Ondrejcek, Kenton McHenry, and Peter Bajcsy, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

We have designed a web-based Conversion Software Registry (CSR) for collecting information about software that are capable of file format conversions. The work is motivated by a community need for finding file format conversions inaccessible via current search engines and by the specific need to support systems that could actually perform conversions, such as the NCSA Polyglot. In addition, the value of CSR is in complementing the existing file format registries such as the Unified Digital Formats Registry (UDFR before GDFR) and PRONOM, and introducing software quality information obtained by content-based comparisons of files before and after conversions. The contribution of this work is in the CSR data model design that includes file format extension based conversion, as well as software scripts, software quality measures, and test file specific information for evaluating software quality. We have populated the CSR with the help of the National Archives and Records Administration (NARA) staff. The Conversion Software Registry provides multiple search services. As of May 28, 2010, CSR has been populated with 183,142 conversions, 544 software packages, 1316 file format extensions associated with 273 MIME types, and 154 PRONOM identifications.

oreChem: Planning and Enacting Chemistry on the Semantic Web

Mark Borkum, Simon Coles, and Jeremy Frey, University of Southampton

This paper presents the oreChem Core Ontology (CO), an extensible ontology for the description of the planning and enactment of scientific methods. Currently, a high level of domain-specific knowledge is required to identify and resolve the implicit links that exist between the digital artefacts that are realised during the enactment of a scientific experiment. This creates a significant barrier-to-entry for independent parties that wish to discover and reuse the published data. The CO radically simplifies and clarifies the problem representing a scientific experiment to facilitate the discovery and reuse of the raw, intermediate and derived results in the correct context. In this paper, we present an overview of the CO and discuss the integration of the CO with the eCrystals repository for crystal structures.

Accelerating Chemical Property Prediction with Cloud Computing

Hugo Hiden, Paul Watson, David Leahy, Jacek Cala, Dominic Searson, Vladimir Sykora, and Simon Woodman, Newcastle University

This paper describes the use of cloud computing to accelerate the building of models to predict chemical properties. The chemists in the project have unique software—the Discovery Bus—that automatically builds quantitative structure-activity relationship (QSAR) models from chemical activity datasets. These models can then be used to design better, safer drugs, as well more environmentally benign products.

Recently, there has been a dramatic increase in the availability of activity data, creating the opportunity to generate new and improved models. Unfortunately, the competitive workflow algorithm used by the Discovery Bus requires large computational resources to process data; for example, the chemists recently acquired some new datasets which would take more than five years to process on their current, single-server infrastructure.

This is potentially an ideal cloud application as large computational resources are required, but only when new datasets become available. Therefore, in the “Junior” project, we have designed and built a scalable, Windows Azure cloud-based infrastructure in which the competitive model-building techniques are explored in parallel on up to 100 nodes. As a result, the rate at which the Discovery Bus can process data has been accelerated by a factor of more than 100, and the new datasets can be processed in weeks rather than years.

Remote Computed Tomography Reconstruction Service on GPU-Equipped Computer Clusters Running Microsoft HPC Server 2008

Timur Gureyev, Yakov Nesterets, Darren Thompson, Alex Khassapov, Andrew Stevenson, Sheridan Mayo, and John Taylor, Commonwealth Scientific and Industrial Research Organisation (CSIRO); Dimitri Ternovski, Trident Software Pty. Ltd.

We describe a complete integrated thick client-type system for remote computed tomography (CT) reconstruction, simulation and visualization services utilising computer clusters optionally equipped with multiple graphics processing units (GPUs). All computers in our system, including the user PCs, web servers, file servers, and the compute cluster nodes, are running flavours of the Windows OS, which greatly simplifies the development, installation, administration, and replication of the system. Our design is also aimed at streamlining and simplifying user interaction with the system, which differentiates it from most software available on today’s compute clusters that typically require some familiarity with parallel computing environment from the user. We briefly describe the high-level architectural design of the system as well as the two-level parallelization of the most computationally-intensive modules utilising both the multiple CPU cores and multiple GPUs available on the cluster. Finally, we present some results about the current system’s performance.

e-LICO: Delivering Data Mining to the Life Science Community

Simon Jupp, James Eales, Rishi Ramgolam, Alan Williams, Robert Stevens, and Carole Goble, University of Manchester; Simon Fischer, Rapid-l GmbH; Jorg-Uwe Kietz, University of Zurich

Life science research is generating a vast amount of data; data are produced detailing many granularities from information about molecular interactions to planetary meteorological information. One of the challenges in bioinformatics is how best to provide the biologists with the necessary tools and infrastructure to process, analyse and explore these data.

e-LICO is a project that is seeking to develop a collaborative environment using Taverna and myExperiment for scientists to build and share scientific workflows, with a specific focus on support for text and data-mining. Data-Mining is a complicated process, resulting in workflows consisting of several steps for each of: data-gathering, integration, preparation, modeling, evaluation and deployment. e-LICO utilizes existing e- science infrastructure (myExperiment, Taverna) along with integrated AI-planning techniques to build data-mining workflows (via case-based planning and hierarchical task-decomposition planning).

SQL is Dead; Long Live SQL: Lightweight Query Services for Ad Hoc Research Data

Bill Howe and Garret Cole, University of Washington

We find that relational databases remain underused in science, despite a natural correspondence between exploratory hypothesis testing and ad hoc “query answering.” The upfront costs to deploy a relational database prevent widespread use by small labs or individuals, while the development time for custom workflows or scripts is too high for interactive Q&A. We are exploring a new way to inject SQL into the scientific method, motivated by these observations:

  • We reject the conventional wisdom that “scientists won’t write SQL.” Rather, we implicate the process of data modeling, schema design, cleaning, and ingest in preventing the uptake of the technology by scientists.
  • We observe that cloud platforms, specifically the Windows Azure platform and Amazon’s EC2 service, drastically reduce the effort required to erect a production-quality database server.
  • We observe that simply sharing examples of SQL queries allows the scientists to self-train, bootstrapping the technological independence needed to allow our work to serve many labs simultaneously.

Guided by these premises, we have built a simple prototype that allows users to upload their data and immediately query it—no schema design, no reformatting, no DBAs, no obstacles. We provide a “starter kit” of SQL queries, translated from English questions provided by the researchers themselves, that demonstrate the basic idioms for retrieving and manipulating data. These queries are saved within the application, and can be copied, modified, and saved anew by the researchers. Beyond these core requirements, we seek novel features to facilitate authoring, sharing, and reuse of SQL statements, as well as analysis and visualization of results. A cloud-based deployment on Windows Azure allows us to establish a global, interdisciplinary corpus of example queries, which we mine to help users find relevant example queries, organize and integrate data, and construct new queries from scratch.

SinBiota 2.0 – Planning a New Generation Environmental Information System

João Meidanis, Pedro Feijão, Cleber Mira, and Carlos Joly, University of Campinas

In March of 1999, the State of Sao Paulo Research Foundation (FAPESP) launched a research program on characterization, conservation, restoration, and sustainable use of the biodiversity of the state, known as the “BIOTA-FAPESP” Program. Over the years, this program accumulated about 100 thousand records on observations and gathering of biological material. A new journal was founded, and the program even had impact on state laws regarding land use. Along with the program, an information system, called SinBiota, was developed to hold the data generated by its participants.

After ten years, the system is in need of a major reorganization. In this paper we cover the steps that are being undertaken to achieve this goal, including consultations with IT specialists, listening to the user community, establishing a multi-phase plan, and also present the current state of affairs, which involves research in areas such as multimedia search, cloud computing, database scalability, and so on, as well as the implementation of a prototype of the new system, in a project jointly funded by FAPESP and Microsoft Research.

Enhancing the Quality and Trust of Citizen Science Data

Jane Hunter and Abdulmonem Alabri, The University of Queensland; Catharine van Ingen, Microsoft Research

The Internet, Web 2.0, and Social Networking technologies are enabling citizens to actively participate in “citizen science” projects by contributing data to scientific programs via the web. However, the limited training, knowledge, and expertise of contributors can lead to poor quality, misleading or even malicious data being submitted. Subsequently, the scientific community often perceive citizen science data as low quality and not worthy of being used in serious scientific research. In this paper, we describe a technological framework that combines data quality improvements and trust metrics to enhance the reliability of citizen science data. We describe how trust models can provide a simple and effective mechanism for measuring the trustworthiness of community-generated data. We also describe filtering services that remove unreliable or untrusted data, and enable scientists to confidently re-use citizen science data. The resulting software services are evaluated in the context of the Coral Watch project—a citizen science project that uses volunteers to collect comprehensive data on coral reef health.

Scientist-Computer Interfaces for Data-Intensive Science

Cecilia Aragon, Lawrence Berkeley National Laboratory

Many of today’s important scientific breakthroughs are made by large, interdisciplinary collaborations of scientists working in geographically distributed locations, producing and collecting vast and complex datasets. Experimental astrophysics, in particular, has recently become a data-intensive science after many decades of relative data poverty. These large-scale science projects require software tools that support, not only insight into complex data, but collaborative science discovery. Such projects do not easily lend themselves to fully automated solutions, requiring hybrid human-automation systems that facilitate scientist input at key points throughout the data analysis and scientific discovery process. This paper presents some of the issues to consider when developing such software tools, and describes Sunfall, a collaborative visual analytics system developed for the Nearby Supernova Factory, an international astrophysics experiment and the largest data volume supernova search currently in operation. Sunfall utilizes novel interactive visualization and analysis techniques to facilitate deeper scientific insight into complex, noisy, high-dimensional, high-volume, time-critical data. The system combines novel image processing algorithms, statistical analysis, and machine learning with highly interactive visual interfaces to enable collaborative, user-driven scientific exploration of supernova image and spectral data. Sunfall is currently in operation at the Nearby Supernova Factory; it is the first visual analytics system in production use at a major astrophysics project.

Enabling Scientific Discovery with Microsoft SharePoint

Kenji Takeda, Richard Boardman, Steven Johnston, Mark Scott, Leslie Carr, Simon Coles, Simon Cox, Graeme Earl, Jeremy Frey, Philippa Reed, Ian Sinclair, and Tim Austin, University of Southampton

Scientists, researchers and engineers facing increasing amounts of data must create, execute and navigate complex workflows, collaborate within and outside their organisations, and need to share their work with others. In this paper, we demonstrate how the Microsoft SharePoint platform provides an integrated feature set that can be leveraged in order to significantly improve the productivity of scientists and engineers. We investigate how SharePoint 2010 can be used, and extended, to manage data and workflow in a seamless way, and enable users to share their data with full access control. We describe, in detail, how we have used SharePoint 2010 as the IT infrastructure for a large, multi-user facility, the µ-Vis CT scanning centre. We also demonstrate how we are creating a user-centric data management system for archaeologists, and demonstrate how SharePoint 2010 can be integrated into the everyday lives of scientists and engineers for managing and publishing their data through our Materials Data Centre, which provides an easy-to-use data management system from lab bench to journal publication via EPrints.

Genome-Wide Association of ALS in Finland

Bryan Traynor, National Institute on Aging, National Institutes of Health

We performed a genome-wide association study of amyotrophic lateral sclerosis (ALS) in Finland to determine the genetic variants underlying disease in this population. Finland is a ideal location for performing genetics studies of ALS, because it has one of the highest incidences of the disease in the world, and because the population is known to be remarkably genetically homogeneous. We genotyped a cohort of 442 Finnish ALS patients and 521 Finnish control subjects using HumanHap370 arrays, which assay more than 300,000 SNPs across the human genome. This DNA was collected by our colleague Dr. Hannu Laaksovirta, who reviews nearly all patients diagnosed with this fatal neurodegenerative disease in the country. We were pleased to find two highly significant association peaks in our GWAS, one located on chromosome 21 near the SOD1 gene which is known to have a particularly high prevalence in the Finnish population, the other located on chromosome 9p21. Together, these two loci account for nearly the entire increased incidence of ALS in Finland.

A Framework for Large-Scale Modelling of Population Health

John Ainsworth, Iain Buchan, Nathan Green, Matthew Sperrin, Richard Williams, Philip Couch, Emma Carruthers, and Eleanora Fichera, University of Manchester; Martin O’Flaherty and Simon Capewell, University of Liverpool

Statistics and Informatics methods for synthesising disparate sources of public health evidence are under-developed. This is in part due to the amount of human resource required to synthesise complex evidence, and in part due to a research environment that rewards the study of the independent effects of specific factors on health more than discovering the complexity of health. In particular, it remains difficult to compare the potential impacts of community-based prevention strategies such as smoking cessation, vs. clinical treatments such as lipid lowering drugs. Thus there is a lack of usefully complex models that might underpin the full appraisal of health policy options by the policy-makers. We present a system that enables health care professionals to collaborate on the design of complex models of population health which can then be used to evaluate and compare the impact of interventions.

GREAT.stanford.edu: Generating Functional Hypotheses from Genome-Wide Measurements of Mammalian Cis-Regulation

Gill Bejerano and Cory Y. McLean, Stanford University

Recent technological advances in DNA sequencing provide an unprecedented view of the regulatory genome in action. We can now sequence all binding events of transcription factors and transcription-associated factors, examine the dynamics of different chromatin marks, assay for nucleosome positioning and open chromatin, and more. However, attempts to interpret these data using computational tools developed for microarray analysis often fall short, leaving researchers to manually scrutinize only handfuls of their copious data.

We developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to provide the first computational tool which correctly analyzes whole genome cis-regulatory data. Whereas microarray-based methods are forced to consider only binding proximal to genes, GREAT is able to properly incorporate distal binding sites which greatly enhances resulting interpretations. Applying GREAT to ChIP-seq data sets of multiple transcription-associated factors in different contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate novel hypotheses that can be experimentally tested. GREAT can be similarly applied to any dataset of localized genomic markers enriched for known or putative cis-regulatory function.

GREAT incorporates biological annotations from 20 ontologies and has been made available to the scientific community as an intuitive web tool. Direct submission is also available from the UCSC Genome Browser via the Table Browser.

Medici: A Scalable Multimedia Environment for Research

Joe Futrelle, Luigi Marini, Rob Kooper, Joel Plutchak, Alan Craig, Terry McLaren, and Jim Myers, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

Large-scale community collections of images, videos, and other media are a critical resource in many areas of research and education including the physical sciences, biology, medicine, humanities, arts, and social sciences. Researchers face coupled problems in managing large amounts of data, analysis and visualization over such collections, and managing descriptive metadata and provenance information. NCSA is involved in a wide range of projects targeting collections that involve terabytes to petabytes of data, complex image processing pipelines, and rich provenance linking. Based on this experience, we have developed Medici—a general multimedia management environment based on Web 2.0 interfaces, semantic content management, and service/cloud-based workflow capabilities that can support a broad range of high-throughput research techniques and community data management. Medici provides scalable storage and media processing capabilities, simple desktop and web 2.0 user interfaces, social annotations, preprocessing and preview capabilities, dynamically extensible metadata, provenance support, and citable persistent data references. This talk will provide an overview of Medici’s capabilities and use cases in the humanities and microscopy as well as describe core research and development challenges in creating usable systems incorporating rich semantic context derived from distributed automated and manual sources.

BlogMyData: A Virtual Research Environment for Collaborative Visualization of Environmental Data

Andrew Milsted, Jeremy Frey, Jon Blower, and Adit Santokhee, University of Southampton

Understanding and predicting the Earth system requires the collaborative effort of scientists from many different disciplines and institutions. The National Centre for Earth Observation (NCEO) and the National Centre for Atmospheric Science Climate Group (NCAS-Climate) are both high-profile interdisciplinary research centres involving numerous universities and institutes around the UK and many international collaborators. Both groups make use of the latest numerical models of the climate and earth system, validated by observations, to simulate the environment and its response to forcings such as an increase in greenhouse gas emissions. Their scientists must work together closely to understand the various aspects of these models and assess their strengths and weaknesses.

At the present time, collaborations take place chiefly through face-to-face meetings, the scholarly literature and informal electronic exchanges of emails and documents. All of these methods suffer from serious deficiencies that hamper effective collaboration. For practical reasons, face-to-face meetings can be held only infrequently. The scholarly literature does not yet adequately link scientific results to the source data and thought processes that yielded them, and additionally suffers from a very slow turnaround time. Informal exchanges of electronic information commonly lose vital context; for example, scientists typically exchange static visualizations of data (as GIFs or PostScript plots for example), but the recipient cannot easily access the data behind the visualization, or customize the visualization in any way. Emails are rarely published or preserved adequately for future use. The recent adoption of “off the shelf” Wikis and basic blogs has addressed some of these issues, but does not usually address specific scientific needs or enable the interactive visualization of data.

RightField: Rich Annotation of Experimental Biology Through Stealth Using Spreadsheets

Matthew Horridge, Katy Wolstencroft, Stuart Owen, and Carole Goble, University of Manchester; Wolfgang Mueller and Olga Krebs, HITS gGmbH

Rightfield is an open source application that provides a mechanism for embedding ontology annotation support for scientific data in Excel spreadsheets. It was developed during the SysMO-DB project to support a community of scientists who typically store and analyse their data using spreadsheets. It helps keep annotation consistent and compliant with community standards whilst making the annotation process quicker and more efficient.

RightField is an open-source, cross-platform Java application that is available for download.

musicSpace: Improving Access to Musicological Data

mc schraefel, David Bretherton, Daniel Smith, and Joe Lambert, University of Southampton

Efforts over the past decade to digitize scholarly musicological materials has revolutionized the research process, however online research in musicology is now held back by the segregation of data into a plethora of discrete and disparate databases, and the use of legacy or ad hoc metadata specifications that are unsuited to modern demands. Many real-world musicological research questions are rendered effectively intractable because there is insufficient metadata or metadata granularity, and a lack of data source integration. The “musicSpace” project has taken a dual approach to solving this problem: designing back-end services to integrate (and where necessary surface) available (meta)data for exploratory search from musicology’s key online data providers; and providing a front-end interface, based on the “mSpace” faceted browser, to support rich exploratory search interaction.

We unify our partners’ data using a multi-level metadata hierarchy and a common ontology. By using RDF for this, we make use of the many benefits of Semantic Web technologies, such as the facility to create multiple files of RDF at different times and using different tools, assert them into a single graph of a knowledge base, and query all of the asserted files as a whole. In many cases we were able to directly map a record field from a partner’s dataset to our combined type hierarchy, but in other cases some light syntactic and/or semantic analysis needed to be performed. This small amount of work in the pre-processing stage adds granularity that significantly enriches the data, allowing for more refined filtering and browsing of records via the search UI. Significantly, although all the data we extract is present in the original records, much of it is neither exposed to nor exploitable by the end-user via our data providers’ existing UIs. In musicSpace, however, all data surfaced can be used by the musicologist for the purposes of querying the dataset, and can thus aid the process of knowledge discovery and creation.

Our work offers an effective generalizable framework for data integration and exploration that is well suited for arts and humanities data. Our benchmarks have been (1) to make tractable previously intractable queries, and thereby (2) to accelerate knowledge discovery.

Quantifying Historical Geographic Knowledge from Digital Maps

Tenzing Shaw, Peter Bajcsy, Michael Simeone, and Robert Markley, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign

An important question facing historians is how knowledge of different geographic regions varied between nations and over time. This type of question is often answered by examining historical maps created in different regions and at different times, and by evaluating the accuracy of these maps relative to modern geographic knowledge. Our research focuses on quantifying and automating the process of analyzing digitized historical maps in an effort to improve the precision and efficiency of this analysis.

In this paper, we describe an algorithmic workflow designed for this purpose. We discuss the application of this workflow to the problem of automatically segmenting Lake Ontario from French and British historical maps of the Great Lakes region created between the 16th and 19th centuries, and computing the surface area of the lake according to each map. Comparing these areas with the modern figure of 7,540 square miles provides a way of measuring the accuracy of French versus British knowledge of the geography of the Lake Ontario region at different points in time. Specifically, we present the results following the application of our algorithms to 40 historical maps. The procedure we describe can be extended to geographic objects other than Lake Ontario and to accuracy measures other than surface area.

Data Intensive Research in Computational Musicology

David De Roure, Oxford e-Research Centre; J. Stephen Downie, University of Illinois at Urbana-Champaign; Ichiro Fujinaga, McGill University

The SALAMI (Structural Analysis of Large Amounts of Music Information) project applies computational approaches to the huge and growing volume of digital recorded music that is now available in large-scale resources such as the Internet Archive. It is set to produce a new and very substantive web-accessible corpus of music analyses in a common framework for use by music scholars, students and beyond, and to establish a methodology and tooling which will enable others to add to the resource in the future. The SALAMI infrastructure brings together workflow and Semantic Web technologies with a set of algorithms and tools for extracting features from recorded music which have been developed by the music information retrieval and computational musicology communities over the last decade, and the project uses “controlled crowd sourcing” to provide ground truth annotations of musical works.

Scaling Information on ‘Biosphere Breathing’ from Chloroplast to the Globe

Dennis Baldocchi, Youngryel Ryu, and Hideki Kobayashi, University of California-Berkeley; Catharine van Ingen, Microsoft Research

We describe the challenges of upscaling of information on the ‘breathing of the biosphere’ from the scales of the chloroplast of leaves to the globe. This task—the upscaling carbon dioxide and water vapor fluxes—is especially challenging because the problem transcends fourteen orders of magnitude in time and space and involves panoply of non-linear biophysical processes. This talk outlines the problem and describes the set of methods used. Our approach aims to produce information on the ‘breathing of the biosphere’ that is ‘everywhere, all of the time’.

The computational demands of this problem are daunting. At the stand-scale one must simulate the micro-habitat conditions of thousands of leaves, as they are displayed on groups of plants with a variety of angle orientations. Then one must apply the micro-habitat information (e.g., sunlight, temperature, humidity, CO2 concentration) to sets of coupled non-linear equations that simulate photosynthesis, respiration and the energy balance of the leaves. And finally add up this information.

At the regional to global scales, there is a need to acquire merge multiple layers of remote sensing datasets at high resolution (1 km) and frequent intervals (daily) to provide the drivers of models that predict carbon dioxide and water vapor exchange. The global data products of ecosystem photosynthesis and transpiration produced with this system have high fidelity, when validated with direct flux measurements, and produce complex spatial and temporal patterns that will prove to be valuable for environmental modelers and scientists studying climate change and carbon and water cycles from local to global scales.

Agrodatamine: Integrating Analysis of Climate Time Series and Remote Sensing Images

Humberto Razente and Maria Camila N.Barioni, UFABC; Daniel Y. T. Chino, Elaine P. M. Sousa, Robson Cordeiro, Santiago A. Nunes, Caetano Traina Jr., José F. Rodrigues Jr., Willian D. Oliveira, and Agma J. M. Traina, University of São Paulo; Luciana A. S. Romani, University of São Paulo & EMBRAPA Informatics; Marcela X. Ribeiro, Federal University of São Carlos; Renata R. V. Gonçalves, Ana H. Ávila, and Jurandir Zullo, CEPAGRI-UNICAMP

Despite the scientific community not having doubts about the global warming, to quantify and to identify the causes of the average increase of the global temperature, and its consequences for the ecosystems remain urgent and of utmost importance. Mathematical and statistical models have been used to predict likely future scenarios and as an outcome, a large amount of data has been generated. The technological progress also led to improved sensors for several climate data measurements and earth’s surface imaging, contributing even more to the increasing volume and complexity of the data generated. In this context, we present new methods to filter, analyze and extract association patterns between climate data and those extracted from remote sensing, which aim at aiding agricultural research.

Correction for Hidden Confounders in Genetic Analyses

Jennifer Listgarten, Carl Kadie, and David Heckerman, Microsoft Research; Eric E. Schadt, Pacific Biosciences

Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. One way of getting at such an understanding is to find out which parts of our DNA, such as single-nucleotide polymorphisms, affect particular intermediary processes such as gene expression (eQTL), or endpoints such as disease status (GWAS). Naively, such associations can be identified using a simple statistical test on each hypothesized association. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. Our work focuses on novel statistical models that correct for these confounders. In particular, we present a novel statistical model that jointly corrects for two particular kinds of hidden structure—population structure (e.g., race, family-relatedness), and microarray expression artifacts (e.g., batch effects)—when these confounders are unknown. We also are working on models that robustly correct for confounders but which are cheap enough to be applied to extremely large data sets.

BioPatML.NET and Its Pattern Editor: Moving into the Next Era of Biology Software

James Hogan, Yu Toh, Lawrence Buckingham, Michael Towsey, and Stefan Maetschke, Queensland University of Technology

Existing XML-based bioinformatics pattern description languages are best seen as subsets or minor extensions of regular expression based models. In general, regular expressions are sufficient to solve many pattern searching problems. However their expressive power is insufficient to model complex structured pattern such as promoters, overlapping motifs or RNA stem–loops. In addition, these languages often provide only minimal support for techniques common in bioinformatics such as mismatch thresholds, weighted gaps, direct and inverted repeats, general similarity scoring and position weight matrices. In this paper we introduce BioPatML.NET, a comprehensive search library which supports a wide variety of pattern components, ranging from simple motif, regular expression or prosite patterns, and their aggregation into more complex hierarchical structures. BioPatML.NET unifies the diversity of pattern description languages and fills a gap in the set of XML-based description languages for biological systems. As modern computational biology increasingly demands the sharing of sophisticated biological data and annotations, BioPatML.NET simplifies data sharing through the adoption of a standard XML-based format to represent pattern definitions and annotations. This approach not only facilitates data exchange, but also allows compiled patterns to be logically mapped easily onto database tables. The library is implemented in C# and builds upon the Microsoft Biology Foundation data model and file parsers. This paper also introduces an intuitive and interactive editor for the format, implemented in Silverlight 4 and allowing drag and drop creation and maintenance of biological patterns, and their preservation and re-use through an associated repository. (Refer to Appendix Fig 1.0 for a snapshot of the BioPatML Pattern Editor Tool).

Availability: A demonstration video and the tool is available at the following links (requires Silverlight 4 plug-in).

GRAS Support Network, Its Implementation, Operation, and Use

Fritz Wollenweber, Francois Montagner, Christian Marquardt, and Yago Andres, EUMETSAT; Maria Lorenzo and Rene Zandbergen, ESOC

This paper will present the GRAS support network that was put into place to support the processing of the GRAS radio occultation instrument on board of the Metop spacecraft. GRAS is using GPS satellite signals received by the instrument to perform retrievals of vertical profiles of refractivity from which Temperature profiles can be computed. The presentation will describe in detail the GRAS processing, the requirements that have to be fulfilled by the GSN support network and the design and implementation of the GSN. Examples will be given from the operational use of this system in the past 3 years. Particular emphasis will be given to the details of the global GSN network, its communication links and the GSN processing Center. we will also address future evolutions of this network to cover changing and more demanding user requirements.

Data Intensive Frameworks for Astronomy

Jeffrey Gardner, Andrew Connolly, Keith Wiley, YongChul Kwon, Simon Krughoff, Magdalena Balazinska, Bill Howe, and Sarah Loebman, University of Washington

Astrophysics is addressing many fundamental questions about the nature of the universe through a series of ambitious wide-field optical and infrared imaging surveys (e.g., studying the properties of dark matter and the nature of dark energy) as well as complementary petaflop-scale cosmological simulations. Our research focuses on exploring emerging data-intensive frameworks like Hadoop and Dryad for astrophysical datasets. For observational astronomers, we are delivering new scalable algorithms for indexing and analyzing astronomical images. For computationalists, we are implementing cluster finding algorithms for identifying interesting objects in simulation particle datasets.

Experiences and Visions on Archaeo Informatics

Christiaan Hendrikus van der Meijden, Peer Kröger, and Hans-Peter Kriegel, Ludwig Maximilians University

To successfully establish the new scientific branch of archaeo informatics the main problems are based on standardization, understanding of advanced informatics (i.e., data mining) within archaeo sciences, and setting up data communication infrastructures. Our experiences are based on the development of OSSOBOOK, an intermittedly-synchronized database system that allows any authorized user to record data offline at the site and later synchronize this new data with a central data collection. Powerful data mining and similarity search tools have been integrated. The future development steps are establishing a standardized minimal electronic finding description and the implementation of an enhanced database connection interface for data mining communication techniques to set up an archaeo data network. Another focus is set on modularization, visualization, and simplification of data mining tools. Learn more.

Panel: Challenges of Data Standards and Tools

Deb Agarwal, LBNL/UCB; Bill Howe, University of Washington; Alex James, Microsoft; Yong Liu, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign; Maryann Martone, UCSD; Yan Xu, Microsoft Research

Environmental Research involves multi disciplines and players from academia, industry, and government agencies worldwide. By nature, environmental researchers are challenged with massive and heterogeneous data provided by various sources. If one “grand standard” will not work for dealing with all the required environmental data sources, how do we work together to define and adopt difference data standards? What tools are essential for making the standards successful?

Scientific Data Sharing and Archiving at UC3/CDL: the Excel Add-in Project and More

John Kunze and Tricia Cruse, California Digital Library/California Curation Center

The University of California Curation Center (UC3), part of the California Digital Library (CDL), will be working with University of California researchers, the NSF DataONE community, and Microsoft (MS) Research to create open-source MS Excel extensions (“add-ins”), that will make it easier for scientists to record and export spreadsheet data in re-usable ways, fostering integration, new uses, and hence new science. We expect that creating such add-ins for as widely deployed a tool as Excel will help to transform the conduct of scientific research by enabling and promoting data publishing, sharing, and archiving. The Excel add-in project is the primary topic of this talk.

The talk will also address the larger context of this effort being one of four “fronts” on which UC3/CDL is working to establish data publishing, sharing, and archiving as common scientific practice. While this is a complex and ambitious undertaking, we hope that by chipping away at these tractable areas, we will reduce the size of the overall challenge. The most direct of these fronts is participation as an NSF DataONE member node, contributing University of California research data to NSF DataNet. We are also a founding member of the global DataCite consortium, which is working to create standards, tools, and incentives for data producers to publish citable datasets. Finally, with support from the Moore Foundation we are writing up a comparative analysis of current practices across domains for publishing and preserving the methods, techniques, and credits in preparing data used to draw conclusions in the published literature, but that is otherwise lost for want of standard practices for capturing this “appendix” information. We will conclude with a description of the newly released EZID (easy-eye-dee) service for creating and resolving persistent identifiers for data.

Visualizing All of History with Chronozoom

David Shimabukuro, Roland Saekow, and Walter Alvarez, University of California-Berkeley

Our knowledge of human history comprises a truly vast data set, much of it in the form of chronological narratives written by humanist scholars and difficult to deal with in quantitative ways. The last 20 years has seen the emergence of a new discipline called Big History, invented by the Australian historian, David Christian, which aims to unify all knowledge of the past into a single field of study. Big History invites humanistic scholars and historical scientists from fields like geology, paleontology, evolutionary biology, astronomy, and cosmology to work together in developing the broadest possible view of the past. Incorporating everything we know about the past into Big History greatly increases the amount of data to be dealt with.

Big History is proving to be an excellent framework for designing undergraduate synthesis courses that attract outstanding students. A serious problem in teaching such courses is conveying the vast stretches of time from the Big Bang, 13.7 billion years ago to the present, and clarifying the wildly different time scales of cosmic history, Earth and life history, human prehistory, and human history. We present “ChronoZoom,” a computer-graphical approach to dealing with this problem of visualizing and understanding time scales, and presenting vast quantities of historical information in a useful way. ChronoZoom is a collaborative effort of the Department of Earth and Planetary Science at UC Berkeley, Microsoft Research, and originally Microsoft Live Labs.

Our first conception of ChronoZoom was that it should dramatically convey the scales of history, and the first version does in fact do that. To display the scales of history from a single day to the age of the Universe requires the ability to zoom smoothly by a factor of ~1013, and doing this with raster graphics was a remarkable achievement of the team at Live Labs. The immense zoom range also allows us to embed virtually limitless amounts of text and graphical information.

We are now in the phase of designing the next iteration of ChronoZoom in collaboration with Microsoft Research. One goal will be to have ChronoZoom be useful to students beginning or deepening their study of history. We therefore show a very preliminary version of a ChronoZoom presentation of the human history of Italy designed for students, featuring (1) a hierarchical periodization of Italian history, (2) embedded graphics, and (3) an example of an embedded technical article. This kind of presentation should make it possible for students to browse history, rather than digging it out, bit by bit.

At a different academic level, ChronoZoom should allow scholars and scientists to bring together graphically a wide range of data sets from many different disciplines, to search for connections and causal relationships. As an example of this kind of approach, from geology and paleontology, we are inspired by TimeScale Creator.

ChronoZoom, by letting us move effortlessly through this enormous wilderness of time, getting used to the differences in scale, should help to break down the time-scale barriers to communication between scholars.

Proteome-Scale Protein Isoform Characterization with High Performance Computing

Jake Chen and Fan Zhang, Indiana University

The study of proteomes represents significant discovery and application opportunities in post-genome biology and medicine. In this work, we explore the use of high performance computing to characterize novel protein isoforms in tandem mass spectrometry (MS-MS) spectra derived from biological samples. We perform computational proteomics analysis of peptides, by searching a new large peptide database that we custom built from all possible protein isoforms of a target proteome. Therefore, there is significantly higher complexity, both at the computational level and the biological level, involved with the proteome-scale study of these protein isoforms than the standard approaches that involve only normal MS/MS protein search databases.

To discover novel protein isoform in proteomics data, we developed a high performance computing and data analysis platform to support the following tasks: 1) conversion of raw data to open formats, 2) support for searching spectra and peptide identification, 3) conversion of search engine results to a unified format, 4) statistical validation of peptide and protein identifications, and 5) protein isoform marker annotations. By applying this platform, we show that, through human fetal liver and breast cancer case studies, that the platform can markedly increase computational efficiency to support identification of novel protein isoforms. Our results show promises for future diagnostic biomarker applications. They also point out new potentials for real-time analysis of proteomics data with more powerful computing cloud.

Answering Biological Questions by Querying k-Mer Databases

Paul Greenfield, CSIRO Mathematics, Informatics and Statistics

Short DNA sequences (‘k-mers’) are effectively unique within and across bacterial species. Databases of such k-mers, derived from diverse sets of organisms, can be used to answer interesting biological questions. SQL queries can quickly show how organisms are related and find functions for hypothetical genes. Metagenomic applications include quickly partitioning reads by family, and mapping reads onto possibly-related reference genomes. Planned work includes including functional improvements (searching over amino acid codons, querying over gene functions) and scaling the applications to work well on clusters, and possibly clouds.

Tutorials

Tutorial Abstracts

Session MT1: Microsoft Biology Foundation: An Open-Source Library of Bioinformatics Features Built on the .NET Platform

Mark Smith, JulMar Technology

The Microsoft Biology Initiative (MBI) is an effort in Microsoft Research to bring new technology and tools to the area of bioinformatics and biology. This initiative is comprised of two primary components, the Microsoft Biology Foundation (MBF) and the Microsoft Biology Tools (MBT).

The Microsoft Biology Foundation (MBF) is a language-neutral bioinformatics toolkit built as an extension to the Microsoft .NET Framework, initially aimed at the area of Genomics research. Currently, it implements a range of parsers for common bioinformatics file formats; a range of algorithms for manipulating DNA, RNA, and protein sequences; and a set of connectors to biological web services such as NCBI BLAST. MBF is available under an open source license, and executables, source code, demo applications, and documentation are freely downloadable.

The Microsoft Biology Tools (MBT) are a collection of tools targeted at helping the biology and bioinformatics researcher be more productive in making scientific discoveries. The tools provided here take advantage of the capabilities provided in the Microsoft Biology Foundation, and are good examples of how MBF can be used to create other tools.

This tutorial will provide an overview of the library, details about how to extend and re-use the library, and demonstrations of the tools released that use the library: The MSR Biology Extension for Excel and the MSR Sequence Assembler.

Sessions MT2 and WT2: Scientific Data Visualization using WorldWide Telescope

Dean Guo, Microsoft Corporation

As described in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, scientific breakthroughs will be increasingly powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.

This tutorial uses three case studies to demonstrate the application of a wide range of technologies: .NET parallel extension on multicores, distributed computing on multiple nodes with Dryad/DryadLINQ, Windows Azure, HPC, data processing automation through workflows, and visualization in WorldWide Telescope. We hope that the techniques and technologies used are applicable in other data-intensive research.

WorldWide Telescope (WWT) enables your computer to function as a virtual telescope, bringing together imagery from the best ground and space-based telescopes in the world. We are working on extending WWT to visualize scientific data on Earth.

The three use case studies are: WWT LCAPI (Loosely Coupled API), TeraPixel, and MODISAzure. We will demonstrate how to use WWT to visualize the results from these projects.

  1. LCAPI: The Worldwide Telescope “Loosely Coupled API” uses the Restful communication style between a standalone application (SA) and a Worldwide Telescope Client (WTC). We will explore using this loosely coupled interface to read time-series event data into the SA, push this data to the WTC Layer manager, control WTC Layer-based data rendering, and control WTC state (location, perspective angles, time, and time rate). From this overview we will explore both what the LCAPI enables and the potential for future directions in visualization.
  2. Terapixel Sky image – creating the largest and clearest image of the sky from the Digitized Sky Survey data. We turned 1,800 pairs of red and blue individual image plates into 1,800 colored plates, adjusted brightness of each pixel on each plate, and stitched and smoothed them together into a terapixel sky image. The image is then visualized by the WorldWide Telescope.
  3. MODISAzure – accessing the vast and varied remote sensing data from the MODIS (Moderate Resolution Imaging Spectroradiometer) on NASA’s Terra satellite and other data sources to study evapotranspiration (ET), which is key to water balance, hence key to understanding interactions between global climate change and the biosphere. We will demonstrate how we generated time series monthly ET maps for the state of California from MODISAzure results to visualize them in WWT.

Sessions MT3 and WT3: Data-Intensive Research: Dataset Lifecycle Management for Scientific Workflow, Collaboration, Sharing, and Archiving

Alex Wade, Microsoft Research

Microsoft External Research strongly supports the process of research and its role in the innovation ecosystem, including developing and supporting efforts in open access, open tools, open technology, and interoperability. These projects demonstrate our ongoing work towards producing next-generation documents that increase productivity and empower authors to increase the discoverability and appropriate re-use of their work.

This workshop will provide a deep dive into several freely available tools from Microsoft External Research, and will demonstrate how these can help supplement and enhance current repository offerings. Come learn more about how the Microsoft Research tools can help extend the reach and utility of your repository efforts. Each session will include a hands-on component so that attendees can gain a deeper technical understanding of the available toolset, which includes the following:

Session MT4: Parallel Computing with Visual Studio 2010 and the .NET Framework 4

Stephen Toub, Microsoft Corporation

The Microsoft .NET Framework 4 and Visual Studio 2010 include new technologies for expressing, debugging, and tuning parallelism in managed applications. Dive into key areas of support, including Parallel Language Integrated Query (PLINQ), cutting-edge concurrency views in the Visual Studio profiler, and debugger tool windows for analyzing the state of concurrent code. In addition to exploring such features, we will examine some common parallel patterns prevalent in technical computing and how these features can be used to best implement such patterns.

Session WT1: CoSBiLab: Enabling Simulation-Based Science

Corrado Priami, University of Trento Centre for Computational and Systems Biology

CoSBiLab is a software platform implementing the new conceptual framework of algorithmic systems biology. It is centered on the idea of representing biological elements as programs and elements interaction as message passing between the corresponding programs. This idea is guiding the programming paradigm supported by the new programming language BlenX. This approach is higher level and provides a component-based view of systems rather than reaction-based descriptions that are usually adopted in ODE or rewriting system tools. CoSBiLab allows its users to exploit compositionality and stochasticity addressing in a native way concurrency and complexity.

To make the approach intuitive, CoSBiLab has a tabular interface to model systems and to grasp data from databases so that non-experts can use CoSBiLab; i.e., programming in BlenX without having programming skills. CoSBiLab has also tools that can help inferring missing data, perform network analysis, and visualize simulation outcomes. In addition to the introduction of the conceptual framework, demos will be provided to help in understanding the software that will support the e-scientists of the future in their work.

For more information, see “Algorithmic Systems Biology,” Communications of the ACM, 52(5):80–88, May 2009.

Session WT4: OData – Open Data for the Open Web

Alex James, Microsoft Corporation

There is a vast amount of data available today and data is now being collected and stored at a rate never seen before. Much, if not most, of this data, however, is locked into specific applications or formats and difficult to access or to integrate into new uses.

The Open Data Protocol (OData) is a web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today. OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional websites.

Join us in this tutorial to learn how OData can enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools. Bring your laptop and you will have a chance to work OData into your own projects on whatever platform you choose.

Sessions MT4 and WT4—Part 2: F# for eScience: How to Take Advantage of a Managed Functional Language for Research

Cancelled