eScience Workshop 2007

About

mses07_bannercopy

The goal of this cross-disciplinary workshop was to bring together scientists from different areas to share their research and experiences of how computing has shaped their work, to provide new insights, and to change what can be done in science. The focus was on the research and the technologies that have made that research possible.

It is no longer possible to do science without doing computing.

The use of computers creates many challenges as it expands the realm of the possible in scientific research and many of these challenges are common to researchers in different areas. The insights gained in one area may catalyze change and accelerate discovery in many others.

The goal of this cross-disciplinary workshop was to bring, together scientists from different areas to share their research and experiences of how computing has shaped their work, to provide new insights and to change what can be done in science. The focus was on the research, and the technologies that have made that research possible.

The Microsoft eScience Workshop at RENCI invited contributions from any area of eScience including:

  • Modeling of natural systems
  • Knowledge discovery and merging datasets
  • Science data analysis, mining, and visualization
  • Healthcare and biomedical informatics
  • High performance computing in science
  • Innovations in publishing scientific literature, results, and data
  • The impact of eScience on teaching and learning
  • Applying novel information technologies to disaster management
  • Robotics in science
  • Scientific challenges with no obvious computing solutions

Event Sponsorship

This event was held in partnership with The Renaissance Computing Institute, Chapel Hill, North Carolina. Co-chairs for the workshop were Dan Reed, Director, The Renaissance Computing Institute, and Tony Hey, Corporate Vice President for External Research, Microsoft Corporation.

Workshop Outcomes

Around 260 eScience researchers attended over 50 presentations and viewed a poster session showcasing over 100 projects in areas as diverse as astronomy, malaria, and the use of GPUs for scientific computation. Feedback from the attendees was overwhelmingly positive, and the event also served as a venue for Microsoft groups to meet with researchers and discuss future collaborations.

Abstracts 10/21

Abstracts for Sunday, October 21, 2007

Keynote Presentation

Transforming the Sensing and Numerical Prediction of Thunderstorms through Dynamic Adaptation: People and Technologies Interacting with Weather

Kelvin K. Droegemeier, School of Meteorology, University of Oklahoma

Those who have experienced the devastation of a tornado, the raging waters of a flash flood, or the paralyzing impacts of lake-effect snows understand that mesoscale weather develops rapidly, often with considerable uncertainty with regard to location. Such weather is also locally intense and frequently influenced by processes on both larger and smaller scales. Ironically, few of the technologies used to observe the atmosphere, predict its evolution, and compute, transmit, or store information about it operate in a manner that accommodates the dynamic behavior of mesoscale weather. Radars do not adaptively scan specific regions of thunderstorms; numerical models are run largely on fixed time schedules in fixed configurations; and cyber infrastructure does not allow meteorological tools to run on demand, change configurations in response to the weather, or provide the fault tolerance needed for rapid reconfiguration. As a result, today’s weather technology is highly constrained and far from optimal when applied to any particular situation.

This presentation describes a major paradigm shift now underway in the field of meteorology — away from today’s environment in which remote sensing systems, atmospheric prediction models, and hazardous weather detection systems operate in fixed configurations, and on fixed schedules largely independent of weather — to one in which they can change their configuration dynamically in response to the evolving weather. This transformation involves the creation of adaptive radars, Grid-enabled analysis and forecast systems, and associated cyber infrastructure that operate automatically on demand. In addition to describing the research and technology development being performed to establish this capability within a service oriented architecture, I discuss the associated economic and societal implications of dynamically adaptive weather sensing, analysis and prediction systems.

Support for Large-Scale Science: Grumman Auditorium

Is Anybody Out There?: PetaOp/second Computing for SETI and Radio Astronomy

Dan Werthimer, University of California, Berkeley

I will discuss the possibility of life in the universe, SETI@home, public participation distributed computing, and real time petaop/sec FPGA based supercomputing. Next generation radio telescopes, such as the Allen Telescope Array and the Square Kilometer Array, are composed of hundreds to thousands of smaller telescopes; these large arrays require peta-ops per second of real time processing. I will describe these telescopes, and the motivation for peta-op supercomputing. Such computational requirements are far beyond the capabilities of general purpose computing clusters (e.g. Beowulf clusters) or supercomputers. Traditionally instrumentation for radio telescope arrays has been built from highly specialized custom chips, taking ten years to design and debug; they are very expensive, inflexible, and are usually out of date before they are working well. I’ll present some of the new software tools that make it relatively easy to program FPGA’s, as well as some general purpose open source hardware and software modules we’ve developed to build a variety of real time petaop/second supercomputers. More information is available at http://casper.berkeley.edu and http://seti.berkeley.edu.

Parallel Clustering in a Cheminformatics Grid

Geoffrey Fox, Xiaohong Qiu, Huapeng Yuan, Marlon Pierce, David Wild, Rajarshi Guha, Indiana University; Georgio Chrysanthakopoulos, Henrik Frystyk Nielsen , Microsoft

The eScience paradigm for Chemical Informatics links computational chemistry simulations, large archival databases such as PubChem and the rapidly growing volumes of data from high throughput devices. We have built such a Grid for scientific discovery on the interface of biology and chemistry (drug discovery). We expect eScience to need integration of both distributed and parallel technologies where Intel has highlighted the potential importance of data mining applications as synergistic with both the data deluge and the growing power of multicore systems. Our parallel programming model decomposes problems into services as in traditional eScience approaches and then uses optimized parallel algorithms for the services. This is consistent with the split between efficiency and productivity layers in the Berkeley approach to parallel computing. We implement the productivity layer with Grid workflows or Web 2.0 mashups on services that use where needed high performance parallel algorithms developed by experts and packaged as a library of services for broad use. We discuss parallel clustering of chemical compounds from NIH PubChem. We chose an improved K-Means clustering which has scaling parallelism and uses annealing on Chemical property space resolution to avoid local minima. We use the Microsoft Concurrency and Coordination Runtime (CCR) as it gives good performance at the MPI layer and use its coupling to a service model DSS that is a natural platform for the service productivity layer. The parallel overhead consists of Windows thread scheduling, memory bandwidth limitations and CCR synchronization overheads and totals 10-15% (speedup of 7 on an 8 core system) for realistic PubChem application with the load imbalance from scheduling being the dominant effect.

A Data Diffusion Approach to Large Scale Scientific Exploration

Ioan Raicu, Yong Zhao, Ian Foster; University of Chicago, Alex Szalay, The Johns Hopkins University

Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a data diffusion approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and cached to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-to-peer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have developed Falkon, which provides dynamic acquisition and release of resources and the dispatch of analysis tasks to those resources. We have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astrophysics, medicine, and other domains, with varying datasets, workloads, and analysis codes.

Computational Modeling in the Life Sciences: Redbud A+B

Emergent Geometric Order: A Model of Cell Division in Proliferating Tissue Networks

Radhika Nagpal, Harvard University

In multi-cellular tissues, simple cell behaviors can lead to complex global properties, from wound repair to rapid change in morphology. Understanding the relationship between local cell decisions and system-level behaviors is critical for many reasons: to form and validate cell behavior hypotheses, to predict the effect of aberrant cell behaviors, and to provide new (and sometimes counter-intuitive) insights into tissue behavior. In this talk, I will present our recent work on an abstract model of cell division in the developing fruit fly wing that led to a novel insight into the robustness of proliferating epithelial tissues [1]. Based on time-lapse movies of early wing development, we developed a simple logical model, represented by a first-order Markov chain, of the cell division process and its impact on the graph topology of the tissue network. This mathematical model led to an unexpected prediction: that the stochastic process of cell division will drive the proliferating tissue, as a whole, to adopt a fixed distribution of polygonal cell shapes, regardless of initial tissue topology. This predicted distribution is strongly observed in diverse organisms, not only fruit fly, but also hydra and frogs, suggesting that this may be a fundamental property of proliferating epithelial tissues. Epithelial tissues are ubiquitous throughout the animal kingdom and form many structures in the human body. This work suggests a simple emergent mechanism for regulating cell shape and topology during rapid proliferation, and has many implications for multi-cellular development and disease that we are now investigating.

[1] Gibson, Patel, Nagpal, Perrimon, “The Emergence of Geometric Order in Proliferating Metazoan Epithelia”, Nature 442(7106):1038-41, Aug 31, 2006.

Programming Biology

Andrew Phillips, Microsoft Research

This talk presents a programming language for designing and simulating computer models of biological systems. The language is based on a computational formalism known as the pi-calculus, and the simulation algorithm is based on standard kinetic theory of physical chemistry. The language will first be presented using a simple graphical notation, which will subsequently be used to model and simulate a couple of intriguing biological systems, namely a genetic oscillator and a key pathway of the immune system. One of the benefits of the language is its scalability: large models of biological systems can be programmed from simple components in a modular fashion. The first system is a genetic oscillator built from simple computational elements. We use probabilistic analysis of our model to characterize the parameter space in which regular oscillations are obtained, and validate our calculations through simulation. We also explore different levels of abstraction for our model by exploiting the modularity of our approach, which allows increasing levels of detail to be included without changing the overall structure of the program. Our design principles could in future be used to engineer robust genetic oscillators in living cells. The second system is an executable computational model of MHC Class I Antigen Presentation. This is a key pathway of our immune system, which is able to detect the presence of potentially harmful intruders in our cells, such as viruses or bacteria. By simulating and analyzing this model, we gain some insight into how the pathway functions and offer an explanation for some of the variability present in the human immune system.

A Multiset-based Model of Agents: Computability and Robustness

Matteo Cavaliere, Radhu Mardare, Sean Sedwards, MSR -UNITN CoSBi

We present a modeling framework and computational paradigm called Colonies of Synchronizing Agents (CSAs), which abstracts the intracellular and intercellular mechanisms of biological tissues. Our motivation is to describe complex biological systems in a formal way, such that it is possible to model, analyze and predict their properties. From these analyses we may also gain insight to inform the creation of new computational devices and techniques. The core model is based on a multi set of agents (which can be thought of as populations of cells or molecules) in a common environment. Each agent has contents in the form of a multi set of atomic objects (i.e., chemicals or the properties of individual molecules) which are updated by rewriting rules. Hence, the model has a certain elegant simplicity, being essentially a multi set of multi sets, acted upon by multi set rewriting rules. Rules may act on individual agents (thus representing intracellular action) or may synchronize the contents of pairs of agents (representing intercellular action). An extended model includes Euclidean space and rules to facilitate the movement of agents within the space. The extended model also includes rules to control agent division and agent death, thus providing a full repertoire of common cell and molecule primitive behaviors. The formal basis of the model allows us to investigate static and dynamic properties of CSAs using tools from computer science, e.g., from automata theory, logic and game theory. In this way we hope to model and investigate complex biological phenomena, such as exist in the immune system and morphogenesis. In particular, we are interested in robust pattern formation, which is the basis of complexity in nature.

Scholarly Publication: Bellflower A+B

The SCOPE System: Scientific Compound Object Publishing for eScience

Jane Hunter, Kwok Cheung, University of Queensland

Scientists are under increasing pressure to publish their raw data, derived data and methodology along with their traditional scholarly publication, in open archives. The goal is to enable the verification and repeatability of results by other scientists in the field and hence encourage the re-use of research data and a reduction in the duplication of research effort. Many scientists and scientific communities would be willing to do this, if they had simple, efficient tools and the underlying infrastructure to streamline the process. Currently there are relatively few tools to support these new forms of scientific publishing and those that do exist are not integrated with existing repository infrastructure. In this presentation, I will describe a system that we have been developing to streamline the process of authoring, publishing and sharing compound scientific objects with built-in provenance information. SCOPE is a graphical authoring tool that assists scientists with the task of: creating a compound scientific publication package (as an OAIORE named graph); attaching a brief metadata description and a Creative Commons license to the object; and publishing it to a Fedora repository for discovery, re-use, peer-review or e-learning. The SCOPE system comprises:

  • A Provenance Explorer window which uses RDF graphs generated from laboratory and scientific workflow systems (e.g., myTea, Kepler, Taverna) to visualize provenance.
  • A publishing window into which the author drags and drops nodes from the provenance explorer.

External objects (retrieved via an IE web browser) may also be dragged and dropped into the window to form new nodes. Relationships between nodes can either be dynamically inferred (using the Algernon inference engine) or (in the case of external objects) manually defined by the author who draws and tags new links.

OAI Object Reuse and Exchange: Interoperability for eScience

Carl Lagoze, Cornell University; Herbert Van de Sompel, Los Alamos National Laboratory

Information objects used in eScience are frequently compound in nature consisting of a variety of media resources with rich inter-relationships. Infrastructure to support the expression and exchange of information about these compound information objects is essential for the deployment of eScience across disciplines. The Open Archives Initiative – Object Reuse and Exchange (OAI-ORE) Project is developing standards to facilitate discovery, use and re-use of these new types of compound scholarly resources by networked services and applications. The Andrew W. Mellon Foundation, Microsoft, and the NSF fund OAI-ORE. In this talk, we will introduce a preliminary version of the core OAI-ORE data models and protocols. These models are based on the notion of bounded, named aggregations of web resources in which the resources and their relationships are typed. Protocols cover discovering, retrieving representations, and updating the constituency of such aggregations. This allows, for example, other web resources, including other compound objects, to reference these aggregations to express citation, annotation, provenance, and other relationships vital to the scholarly process. We also introduce use cases illustrating possible applications of the OAI-ORE work. The eChemistry project, funded by Microsoft, will deploy an OAI-ORE-based infrastructure for exchange of molecular-centric information and the linkage of that information to researchers, experiments, publications, etc. A digital preservation prototype illustrates how the Web-centric approach of OAI-ORE could empower the Internet Archive to readily archive compound information objects.

arXiv.org ePrint Web Service Application Programming Interface

Julius Lucks, Cornell University; Simeon Warner, Cornell Information Science; Thorsten Schwander, arXiv.org; Paul Ginsparg, Cornell University

The Cornell University e-print arXiv, is a document submission and retrieval system that is heavily used by the physics, mathematics and computer science communities. It has become the primary means of communicating cutting-edge manuscripts on current and ongoing research. The openaccess arXiv e-print repository is available worldwide, and presents no entry barriers to readers, thus facilitating scholarly communication. Manuscripts are often submitted to the arXiv before they are published by more traditional means. In some cases they may never be submitted or published elsewhere, and in others, arXiv-hosted manuscripts are used as the submission channel to traditional publishers such as the American Physical Society, and newer forms of publication such as the Journal for High Energy Physics and overlay journals. The primary interface to the arXiv has been human-oriented html web pages. In this talk, we outline the design of a web-service interface to the arXiv, permitting programmatic access to e-print content and metadata. We discuss design considerations of the interface that facilitate new and creative use of the vast body of material on the arXiv by providing a low barrier to entry for application developers. We outline mock-applications that will greatly benefit from this interface, including an alternative arXiv human interface that is designed to preserve contextual information when performing searches. We finish with an invitation to participate in the growing developer community surrounding the arXiv web-services interface.

Sensors and Mapping: Grumman Auditorium

From Raw Measurements to Knowledge Discovery in Environmental Wireless Sensor Networks

Andreas Terzis, Alex Szalay, Katalin Szlavecz, Johns Hopkins University

It is possible today to build large wireless sensor networks (WSNs) for observing the natural environment. For example, a network currently under deployment by the authors can generate 1.7 GB of raw sensor measurements per year. However, scale is not the only challenge these datasets present. They are also incomplete, noisy and highly redundant. Measurements are taken at finite locations with finite frequency potentially missing events with limited spatial and temporal scope. The sensors used in these networks are prone to errors and failures. Finally, collected data present the proverbial needle in the haystack problem: scientists are interested for subtle signals superimposed on diurnal and seasonal cycles. While similar problems have been explored in other domains, the added challenge posed in the context of WSNs is that these problems have to be solved in an online and distributed way. Off-line processing of previously collected data is inadequate since delivering uninteresting data is very expensive: they consume precious resources until they are deposited to archival storage. Moreover, closed-loop sensor networks cannot afford to operate with incomplete or faulty data, since they make decisions (e.g. engage actuators) that depend on previous measurements. We present statistical techniques, based on principal component analysis, for detecting and classifying novel events in environmental wireless sensor networks. These events are defined as deviations from underlying normal trends. Our techniques can be used to dynamically adjust the network’s behavior and to detect faulty sensors. The proposed mechanisms are lightweight enough to be implemented on existing motes. We argue that our approach is the first step towards sensor networks that can discover the underlying structures of the phenomena they monitor.

Visualization of Fossil Records Using Microsoft Virtual Earth

Jeff Gehlhausen, Stephanie Puchalski, Mehmet Dalkilic, Claudia Johnson, Erika Elswick, Indiana University

A growing debate about global warming has forced reexamination of the fossil record in novel ways. We examine the chiton. We approach this problem by creating visualization and mining system for the organism, specifically where it was discovered. Our visualization tool makes use of a novel Virtual Earth Clustering technique originally published on (viavirtualearth.com) that efficiently groups items based on the zoom level in Virtual Earth. As the mouse is used to zoom into the canvas of Virtual Earth, more detailed plot data points are shown–and as the mouse is used to zoom out of Virtual Earth, the data points are aggregated into clusters so less data is plotted on the map. Since the clusters are groups by location and zoom level, data is simply being compacted in order to improve performance. Much of the application’s functionality is derived from the JavaScript MouseOver functionality. As different plots are placed on the map according to the different phylogenetic filters, a MouseOver event presents the user with a popup box displaying the name and ID of the chiton. The popup also shows how many records are contained in the cluster. There may many records at a point–the next and previous buttons on the popup box allow the user to cycle through the different records. Additionally, more detailed Time and Phylogenetic information is presented below the map. The data points are placed on the map according to different Phylogenetic constraints–currently the user can select different Order, Suborder, and Family taxonomic data regarding the Chiton dataset. The application uses the ASP.NET AJAX Toolkit Controls to query the server and retrieve the correct taxonomic data for its parent in the Phylogenetic hierarchy after an event is fired due to a selection in the drop-down list.

SEAMONSTER: Sensor Webs in Environmental Science and Education

Dennis Fatland, Microsoft; Matt Heavner, Eran Hood, Cathy Connor, University of Alaska Southeast

The South East Alaska MOnitoring Network for Science, Telecommunications, Education, and Research is a NASA-sponsored smart sensor web project designed to support collaborative environmental science with near-real-time recovery of large volumes of environmental data. The Year One geographic focus is the Lemon Creek watershed near Juneau Alaska with expansion planned for subsequent years up into the Juneau ice field and into the coastal marine environment of the Alexander Archipelago and the Tongass National Forest. Implementation is motivated by problems in hydrology and cryosphere science as well as by gaps in the relationship between science, technology, and education. We describe initial results from 2007, underlying system architecture, and project initiation of inquiry-driven classroom learning.

Databases: Redbud A+B

Data Integration for Genome-wide Association Studies of Human Diseases

Qi Sun, Lalit Ponnala, Cornell University

Managing data for high throughput genomics studies requires researchers to deal with issues including the integration of heterogeneous data sets, the use of tools for data access, and the issues arising around confidentiality. We have been working together with researchers of Dr. Ron Crystal’s group of Weill Cornell Medical School to develop a data processing pipeline for their COPD project (Chronic Obstructive Pulmonary Disease, which is the 4th leading cause of death in the United States). Microsoft SQLServer 2005 was used as the database engine for this project. We developed a schema that can accommodate the heterogeneous multi-media clinical data, and the high-throughput genotyping data from multiple platforms. The built-in SQLServer encryption functions were used for storing sensitive patient information. On the client side, users can enter and retrieve data through VSTO add-ins for Excel, as most researchers are already familiar with using Excel. Our experience showed that the SOAP web service and the Excel based client applications can be a versatile solution for data integration of high throughput genomics and proteomics projects.

rCAD - RNA Comparative Analysis Database

Robin Gutell, Weijia Xu; University of Texas at Austin; Stuart Ozer, Microsoft

Comparative studies of RNA sequences can decipher the structure, function and evolution of cellular components. The tremendous increase in available sequences and related biological information creates opportunities to improve the accuracy and details of these studies while presenting new computational challenges for performance and scalability. To fully utilize this large increase in knowledge, this information must be organized for efficient retrieval and integrated for multi-dimensional analysis. With this, biologists are able to invent new comparative sequence analysis protocols that will yield new and different structural and functional information. Therefore, there is a constant need to reinvent existing turn-key based computational solutions to accommodate the increasing volume of data and new type of information. Managing sequences in a relational database will provide an effective, scalable method to access large scale of data of different types through the mature set of database machinery, and a simplified programming interface for analyzing stored information through SQLs. Based on Microsoft SQLServer, we have designed and implemented the RNA Comparative Analysis Database-rCAD which supports comparative analysis of RNA sequence and structure, and unites, for the first time in a single environment, multiple dimensions of information necessary for alignment viewing, sequence metadata, structural annotations, structure prediction studies, structural statistics of different motifs, and phylogenetic analysis. This system provides a query-able environment that hosts efficient updates and rich analytics. We will show how the performance and scalability of basic analysis tasks, such as covariation analysis can be improved using rCAD. We will also demonstrate the flexibility of using rCAD to form SQL solutions for innovative and complicated analysis problems.

Data Cubes for Eco-science - Can One Size Fit All?

Catharine Van Ingen, Microsoft Research; Deb Agarawl, Berkeley Water Center

Many ecological and hydrological science collaborations are starting to use relational databases to collect, curate and archive their data. Data cube (OLAP) technology can be used in combination with a relational database to simply and efficiently compute aggregates of temporal, spatial and other data dimensions commonly used for data analysis. Over the last year, we’ve built a number of data cubes to support different ecological science goals. This talk explores the commonalities between these cubes and the differences along with some of the reasons for each. While we are hand crafting each cube today, our goal is a methodology that produces a family of cubes that can be used across a number of scientific investigations and related disciplines.

Algorithms: Bellflower A+B

Efficient Methods for Enabling Genome-wide Computing

Wei Wang, Fernando Pardo Manuel de Villena; University of North Carolina

With the realization that a new model population was needed to understand human diseases with complex etiologies, a genetically diverse reference population of mice called Collaborative Cross (CC) was proposed. The CC is a large, novel panel of recombinant inbred (RI) lines that combines the genomes of genetically diverse founder strains to capture almost 90% of the known variation present in laboratory mice and that is designed specifically for complex trait analysis. CC becomes the focal point for cumulative and integrated data collection, giving rise to the detection of networks of functionally important relationships among diverse sets of biological and physiological phenotypes and a new view of the mammalian organism as a whole and interconnected system. The volume and diversity of the data offers unique challenges, whose solutions will advance both our understanding of the underlying biology and the tools for computational analysis. The data will eventually contain high-density SNPs (Single Nucleotide Polymorphism), or even whole genome sequences, for hundreds of CC lines and millions of phenotypic measurements (molecular and physiological) and other derived variables. New data mining and knowledge discovery techniques are in need for efficient and comprehensive analysis. In collaboration with geneticists, we are developing novel and scalable data management and computational techniques to enable high throughput genetic network analysis, real-time genome-wide exploratory analysis, and interactive visualization. The methods are designed to support instant access and computation for any user-specified regions and enable fast and accurate correlation calculation and retrieval of loci with high linkage disequilibrium. The outcome is a fast SNP query engine that allows for large permutation evaluation and association tests and interactive visualization.

Fast Algorithms for Particle Physics and Protein Folding

Alexander Gray, Georgia Institute of Technology

The FASTlab develops novel algorithms and data structures for making the analysis of massive scientific datasets possible, with state-of-the-art machine learning/statistical methods and whatever other operations on data lay on our scientist collaborators’ critical paths. I’ll describe two of our latest projects.1.A central long-standing problem in protein folding is the determination of an approximating energy function which is both tractable and accurate enough to achieve realistic folds. Working with Jeff Skolnick, one of the leaders in the field, we are approaching the problem via machine learning rather than chemical theory alone. Using a customized machine learning method and fast algorithms allowing the use of massive datasets of protein conformations, we appear to be outperforming state-of-the-art hand-built energy functions in preliminary qualitative results, and we believe we have only begun exploring this new paradigm.2.Starting next summer, the Large Hadron Collider will generate 40 million data points per second, continuously for 15 years. Since most of this data must be discarded, much activity will surround the decisions about which events to keep, aka trigger tuning. We are working with the ATLAS detector team on new data structures for fast high-dimensional range querying, which will allow interactive trigger tuning on the physicist’s desktop, and ultimately a scheme for automatic trigger tuning. We are excited about the possibility of computer science playing a critical role in the world’s largest scientific experiment.

Experiences with Distributed and Parallel Matlab on CCS

Anne Trefethen, Daniel Goodman, Stef Salvini; Oxford University

Matlab has become one of the essential computational tools for many engineers and scientists. The environment enables quick development of applications with integrated visualization and many toolboxes aimed at specific algorithmic or application areas. One concern for Matlab users has become the issues of tackling larger-scale problems, and utilizing multiple processors –be that clusters of processors or a system with multicore processors. In response to these user requirements the MathWorks have also developed a toolbox for the support of distributed Matlab enabling applications that are suited to embarrassingly parallel or loosely coupled computations and this is increasingly supporting more fine-grained applications. We will give an initial report on our use of this environment on a number of applications on the Microsoft CCS, considering the ease of integration with the system and providing a view of future tools and techniques for enhancing the existing system.

Abstracts 10/22

Abstracts for Monday, October 22, 2007

Plenary Presentation

Computing: The Future of Science and Innovation

Daniel A. Reed, Chancellor’s Eminent Professor, University of North Carolina at Chapel Hill, Director, Renaissance Computing Institute

Ten years – a geological epoch on the computing time scale. Looking back, a decade brought the web and consumer email, digital cameras and music, broadband networking, multifunction cell phones, WiFi, HDTV, telematics, multiplayer games, electronic commerce and mainstream computational science. It also brought spam, identity theft, software insecurity, globalization, information warfare, blurred work-life boundaries, distributed sensors and inexpensive storage and clusters. What will another decade of technology advances bring to scientific discovery? As Yogi Berra famously noted, “It’s hard to make predictions, especially about the future.” Without doubt, though, scientific discovery via computing is moving rapidly from a world of homogeneous parallel systems to a world of distributed software, virtual organizations and high-performance, deeply parallel systems. In addition, a tsunami of new experimental and computational data and a suite of increasingly ubiquitous sensors pose equally vexing problems in data analysis, transport, visualization and collaboration. This talk describes a Renaissance vision and approach to solving some of today’s most challenging scientific and societal problems using powerful new computing tools likely to emerge over the next decade.

Environmental eScience

COVE: A Collaborative Ocean Visualization Environment

Keith Grochow, University of Washington

It is clear that in order to better understand our planet we need to better understand the oceans. In order to accomplish this, the University of Washington is currently creating the design for an unprecedented oceanographic sensor system off the coast of the northwestern United States. This will provide a continuous presence on the ocean floor and throughout the water column in areas of scientific interest. In order to most effectively compare and create designs for the basic infrastructure and specific experiments, we have developed COVE – an interactive environment to allow a broad range of scientists and engineers to work together in this activity. Through a combination of bathymetry and data visualization, interactive layout tools, and workflow integration COVE provides an intuitive shared environment to quickly and cheaply test ideas and compare various approaches. We have deployed the system across the current design team with positive results and are investigating community outreach and educational scenarios for our work.

An Adaptive Programming Framework for Data and Event Driven eScience

Dennis Gannon, Indiana University; Beth Plale, Indiana University

This presentation describes work on programming framework that combines rule-based event monitoring and adaptive workflow systems. The domain of application for this work spans a wide variety of problems in eScience in which external events, such as those detected by sensors, human actions, or database updates must automatically trigger computationally significant actions. The results of these actions may feed back into the event stream and trigger additional actions, or they may adaptively alter the path of other computations. The philosophy underlying this work is driven by an observation on the iterative nature of knowledge discovery. In data-driven application domains, knowledge acquisition is often a discovery process carried out by successive refinement. A scientist has an initial hunch about the behavior of a process or system. He/she progresses to an answer through a process that combines and repeats discovery and hypothesis testing. The work will be validated by its applicability to three diverse use-cases. The first involves monitoring Doppler radar streams for severe storm signatures and automatically launching tornado forecast workflows. The second involves the iterative exploration of ligand-protein binding workflows in the drug discovery pipeline. In both of these cases the computational and data analysis actions are non-trivial and require distributed resources. The third use-case addresses the issue of maintaining and optimizing underlying computational resources and adapting workflow of other active queries in the system in response to significant events. The proposed programming model is a reactive rule model, which we conjecture can very fruitfully leverage complex event detection and adaptive distributed workflow.

Applications of eScience in Environmental Sciences

Robert Gurney, University of Reading; Jon Blower, University of Reading; Ned Garnett, NERC

Environmental science is undergoing a revolution. The availability of frequent global observations, particularly from satellites, the availability of high performance computing to model the environment at high spatial and temporal resolutions to model processes explicitly, and the ability to organize, compare, visualize and exchange observations and model results gives us for the first time an ability to predict both natural and human-induced changes and put error budgets about the predictions. eScience is vital for this revolution, allowing sharing both data and models, and allowing large model ensembles to be run easily. Several examples will be given, drawing on UK experience with its environmental eScience programs. A particular focus is on climate prediction, where climate models with reduced resolution can be run on clusters as ensembles to understand the errors in predictions, either by running very many member ensembles, or by running for very long periods to understand changes over geologic time. It has been found, for instance, that climate over the next century could warm by up to 12C, or not at all, despite the generally published warming being predicted between about 2-5C with smaller ensembles. Other results of confronting ocean models with observations are that the ocean help to understand the ocean overturning and its variability in the 20th Century, and to aid predictions of coastal waters globally as well as the role of sea ice in the Arctic in the global system All of these projects have led to internationally-leading sets of publications. We will also look forward to where the next computing developments are needed to sustain these environmental developments, and discuss how this dialogue between environmental and computer scientists can be strengthened.

Knowledge Modeling and Discovery

In Support of eScience: The Shift from Information Retrieval to Information Synthesis

Catherine Blake, UNC Chapel Hill

Cyber infrastructure provides students, scientists and policy makers with an unprecedented quantity of information, for example in biomedicine PubMed adds 12,000 new citations each week and the top chemistry journals publish more than a hundred thousand articles in a single year. Despite advances in information access, the quantity of information far exceeds human cognitive processing capacity. Consider a breast cancer scientist who must sift through the 12,600 articles published during the 28 months required to conduct a systematic review, a process used to resolve conflicting evidence. In addition to quantity, evidence related to the complex research questions posed by scientists transcend traditional disciplinary boundaries and thus require a multi-, inter-, or trans- disciplinary approach. I will describe how recent advances in natural language processing, specifically in recognizing textual entailment and in generating multi-document summaries can enable new kinds of e-science. This next generation of information tools recognizes contradictions and redundancies that are inevitable in the information intensive environment in which a scientist operates. Using existing systems that account for complex interdependencies between scientific articles as examples, I will show how these systems embody the shift from the information retrieval to information synthesis. I will conclude with preliminary results from Claim Jumper, a system that captures the spirit of gold-miners searching for nuggets of knowledge in a new frontier, and reflects a scientist’s transition through traditional disciplinary boundaries. Given a topic, query and set of articles, Claim Jumper generates a fluent well organized summary from published literature that accounts for redundancy.

Computational Discovery of Explanatory Process Models

Pat Langley, Arizona State University

Most research on computational knowledge discovery has focused on descriptive models that only summarize data and utilized formalisms developed in AI or statistics. In contrast, scientists typically aim to develop explanatory models that make contact with background knowledge and use established scientific notations. In this talk, I present an approach to computational discovery that encodes scientific models as sets of processes that incorporate differential equations, simulates these models’ behavior over time, incorporates background knowledge to constrain model construction, and induces the models from time-series data. I illustrate this framework on data and models from a number of fields, including ecology, environmental science, and biochemistry. In addition, I report on recent extensions that draw upon additional knowledge to reduce search, that combine models to lower generalization error, and that handle data sets with missing observations. Moreover, rather than aiming to automate construction of such models, I describe our efforts to embed these methods in an interactive software environment that lets scientist and computer jointly create and revise explanations of observed phenomena. This talk describes joint work with Kevin Arrigo, Stuart Borrett, Matthew Bravo, Will Bridewell, and Ljupco Todorovski.

Visualizing Research Connections: Enhancing Creativity in Hypothesis Generation on the Web

Sherrilynne Fuller, University of Washington

The introduction of sophisticated web search engines has greatly improved the ability of scientists to identify relevant research, however, the retrieved set of articles, even when utilizing the advanced search capabilities of search engines such as Google Scholar or PubMed overwhelms even the most diligent researcher. Tools are needed to help scientists to rapidly tease out relevant research findings that will contribute to their picture of potential directions for future research and will thus enhance the hypothesis generation process. Specialized query and visualization tools are needed which support extraction and navigation of research findings and enhance a natural question/answer approach which is a critical aspect of the hypothesis generation process. For example, oeGiven a connection between x and y what else do we know about other connections to each that suggest a mechanism for action? At the present time web search engines do not extensively leverage findings from critical information retrieval research regarding document structure or research about the behavior of scientists particularly in the areas of hypothesis generation and scientific creativity. A review of relevant research findings from the University of Washington and elsewhere will be presented and potential directions for future work will be discussed.

Synthetic Biology

A Novel Computational Method to Infer Signal Transduction Networks from Indirect Causal Evidence

Bhaskar DasGupta, UIC

Our (Albert, DasGupta, Dondi, Kachalo, Sontag, Zelikovsky, Westbrooks) work proposes a novel computational method to solve the biologically important problem of signal transduction network synthesis from indirect causal evidence. This is a significant and topical problem because there are currently no high-throughput experimental methods for constructing signal transduction networks, and the understanding of many signaling processes is limited to the knowledge of the signal(s) and of key mediators’ positive or negative effects on the whole process. We illustrate the biological usability of our software by applying it to a previously published signal transduction network and by using it to synthesize and simplify a novel network corresponding to activation induced cell death in large granulomic leukemia. Our methodology serves as an important first step in formalizing the logical substrate of a signal transduction network, allowing biologists to simultaneously synthesize their knowledge and formalize their hypotheses regarding a signal transduction network. Therefore we expect that our work will appeal to a broad audience of biologists. The novelty of our algorithmic methodology based on non-trivial combinatorial optimization techniques makes it appealing to a broad audience of computational biologists as well. The relevant software NET-SYNTHESIS is freely downloadable.

Some references: R. Albert, B. DasGupta, R. Dondi, S. Kachalo, E. Sontag, A. Zelikovsky and K. Westbrooks, A Novel Method for Signal Transduction Network Inference from Indirect Experimental Evidence, Journal of Computational Biology, 14 (7), 927-949, 2007. R. Albert, B. DasGupta, R. Dondi and E. Sontag, Inferring (Biological) Signal Transduction Networks via Transitive Reductions of Directed Graphs, to appear in Algorithmica.

A Syntactic Model to Design and Verify Synthetic Genetic Constructs Derived from Standard Biological Parts

Jean Peccoud, Virginia Bioinformatics Institute; Yizhi Cai, Virginia Bioinformatics Institute

The sequence of artificial genetic constructs is composed of multiple functional fragments, or genetic parts, involved in different molecular steps of gene expression mechanisms. Biologists have deciphered structural rules that the design of genetic constructs needs to follow in order to ensure a successful completion of the gene expression process but these rules have not been formalized making it challenging for non-specialists to benefit from the recent progress in gene synthesis. We show that context-free grammars (CFG) can formalize these design principles. This approach provides a path to organizing libraries of genetic parts according to their biological functions which correspond to the syntactic categories of the CFG. It also provides a framework to the systematic design of new genetic constructs consistent with the design principles expressed in the CFG. Using parsing algorithms, this syntactic model enables the verification of existing constructs. We illustrate these possibilities by describing a CFG that generates the most common architectures of genetic constructs in E. coli. Genocad allows biologists to experiment with the algorithms outlined in this presentation.

Designing Synthetic Genomes with BioStudio, and Analyzing Their Social Networks Using Graph Algorithms

Joel Bader, Johns Hopkins University; Jef Boeke, Johns Hopkins School of Medicine; Sarah Richardson, Johns Hopkins School of Medicine

Eukaryotic genomes, including ours, have many parts — introns, transposons, redundant genes — whose functions remain obscure or appear unnecessary. To understand how life works, we are redesigning the yeast genome by refactoring its organization and inserting ‘debug code’ for downstream functional tests. Producing the synthetic genome requires a team of designers who are supported by BioStudio; an integrated design environment (IDE) modeled on IDE and revision control systems for software developers. The synthetic life will be programmed to jettison genome chunks under wet-lab triggers, producing a combinatorial regression test of gene dependencies. The gene dependency network resembles a social network or WWW network, except that edges are more akin to social antipathy rather than friendship. We will present initial results of new data mining algorithms we have developed to analyze antipathy networks, including graph diffusion algorithms similar to PageRank but adapted for negative edge weights, and a variational Bayes method for fuzzy clustering of gene modules. Portions of this work were supported through the Microsoft eScience program.

E-Neuroscience

A Virtual Fly Brain

Jano Van Hemert, National e-Science Centre; Douglas Armstrong, University of Edinburgh; Malcolm Atkinson, National e-Science Centre

Research into animal and human health covers a vast array of biological components and functions. Yet strategies to simulate biological systems across multiple levels, by integrating many components and modeling their interaction, are largely undeveloped. We will explore how this challenge can be approached by considering how to build a virtual fly brain. This offers a new proving ground for collaboration between e-Scientists, biologists and neuroinformaticists. Mental Health accounts for 11% of global disease burden, it is growing rapidly yet it is one of the most challenging areas for drug discovery and development. Realistic models that capture the processes of the human brain would provide new insights into the diagnosis and treatment of certain disorders. However, to achieve this, we need to begin by working from much simpler models. The brain of the Drosophila contains in the region of 100,000 neurons; it provide perhaps the simplest brain capable of what we would consider complex behavior, much of which offers insight into animal and human cognition. The genome was sequenced in 2000 and efforts to improve its functional annotation are highly integrated (www.flybase.org). Of the estimated 12,000 Drosophila genes, more than 2,000 are conserved in human disease indications. In order to bring together the many disciplines, the e-Science Institute of the UK has sponsored a theme to allow the establishment of programs with a point of focus for bioinformatics and neuroinformatics in Drosophila, such that gaps in the current databases, biological domain and modeling/simulation efforts can be identified and translated into new projects. In the context of e-Science, the project shall serve as a test bed for the new service oriented platform to enable a distributed data integration and data mining infrastructure, which will be developed in a European project.

Algorithms for Petascale Analysis of Neural Circuitry

Hanspeter Pfister, IIC, Harvard University; Michael Cohen, Microsoft Research; Jeff Lichtman, Harvard University; Clay Reid, Harvard University; Alex Colburn, Washington University

Determining the detailed connections in brain circuits is a fundamental unsolved problem in neuroscience. Understanding this circuitry will enable brain scientists to confirm or refute existing models, develop new ones, and come closer to an understanding of how the brain works. However, advances in image acquisition have not yet been matched by advances in algorithms and implementations that will be capable of enabling the analysis of neural circuitry. The primary challenges are:

  1. The robust alignment of high resolution images of 2D slices of neural tissue to construct three dimensional volumes
  2. Interactive visualization of volumetric petascale data
  3. Automatic segmentation of neural structures such as axons and synapses to define neural circuit elements
  4. Network analysis to enable the comparison of neural circuits

The Harvard Center for Brain Science (CBS) and the Harvard Initiative in Innovative Computing (IIC) in collaboration with Microsoft Research are addressing these challenges by developing algorithms and tools to enable petascale analysis of neural circuits. In particular, we are developing algorithms capable of reconstructing large 3D volumes from collections of 2D images, scalable interactive visualization tools, semi-automatic segmentation methods for neural circuitry, and novel graph matching tools for connectivity analysis of neural circuits.

Bootstrapping the Practical Semantic Web for eNeuroscience: Alzheimer Disease Research Hypotheses in RDF

Tim Clark, Harvard IIC

Alzheimer Disease (AD) and other neurodegenerative disorders (Parkinson’s, Huntington’s, ALS, etc.) are the poster children for science on the semantic web. Progress in these fields is dependent upon coordination and integration of knowledge developed in many research subspecialties, from genetics to brain imaging. Despite a massive increase in the quantity of information generated as research findings in AD over the past 20 years, there is still not consensus on the etiology of the disease. Integrating the findings of many disparate specialist fields into testable hypotheses, and evaluating competing hypotheses against one another, is still extremely challenging. Our group has developed an ontology of scientific discourse adapted for practical use by neuroscientists, with an annotation tool for organizing knowledge around semantically structured hypotheses. This tool has enabled us, working with leading AD researchers and science editors, to compile many of the core hypotheses developed by scientists in the field for presentation to the research community in semantic web format with their supporting evidence, relationships to other hypotheses at the level of claims, and other key related information. This model of wrapping scientific context in the form of semantic metadata, around more traditional digital content on the web, is extensible across many biomedical research disciplines. We believe it will enable more rapid and efficient progress towards curing several devastating diseases.

Development of NeuroElectroMagnetic Ontologies (NEMO): A Framework for Mining Brainwave Ontologies

Dejing Dou, University of Oregon

Event-related potentials (ERP) are brain electrophysiological patterns created by averaging electro-encephalographic (EEG) data, time-locking to events of interest (e.g., stimulus or response onset). In this paper, we propose a generic framework for mining and developing domain ontologies and apply it to mine brainwave (ERP) ontologies. The concepts and relationships in ERP ontologies can be mined according to the following steps: pattern decomposition, extraction of summary metrics for concept candidates, hierarchical clustering of patterns for classes and class taxonomies, and clustering-based classification and association rules mining for relationships (axioms) of concepts. We have applied this process to several dense-array (128-channel) ERP datasets. Results suggest good correspondence between mined concepts and rules, on the one hand, and patterns and rules that were independently formulated by domain experts, on the other. Data mining results also suggest ways in which expert-defined rules might be refined to improve ontology representation and classification results. The next goal of our ERP ontology mining framework is to address some long-standing challenges in conducting large-scale comparison and integration of results across ERP paradigms and laboratories. Along this direction, we are conducting two research projects: i) semantic data modeling and query answering based on ERP ontologies and ii) mapping discovering from multi-modality ontologies (i.e., surface space vs. source space). In a more general context, this work illustrates the promise of an interdisciplinary research program, which combines data mining, neuro-informatics and ontology engineering to address real-world problems. This talk is mainly based on our paper in KDD 2007.

Social Networking Tools for Science

The Future Scientific Information System

Michael Kurtz, Smithsonian

In the past approximately two decades the promise of future computer network based data and information systems has become a reality exceeding all but the most optimistic predictions. Presently the once separate concepts of measured data, processed data, scientific paper, scientific journal, author, reader, publisher, and library are merging in often unexpected ways. In the next couple of decades standardized data formats, standardized data reduction capabilities and deep standardized mark-up will combine with automated work flow systems to form scientific communication units (journal articles or their successors) with profound capabilities for additional discovery. Our literature will become alive. Clearly once we create a system where journal articles permit reader mediated discovery, we will also have a system which will permit automated discovery by software agents, and meta discovery by meta software agents looking at the output of the “lower level” agents, and so on. In this talk I will discuss these changes from a structural communications point of view, complimentary to the data/database centric view of Szalay and Gray (Nature 440, 413 (2006).

Blogs, Logs, and Pods - a Way to a Smart Laboratory

Jeremy Frey, University of Southampton

oeData Data everywhere but not anytime to think is a possible mantra for the problems of scientific data overload. The CombeChem Project (www.combechem.org) takes a holistic approach to the undertaking of scientific laboratory research with a view to improve the quality, accessibility and re-use of chemical information. The project is investigating the use of e-Science technologies based on the idea of Publication@Source. This approach highlights the researcher’s responsibility to collect the scientific data with the fullest possible context from the start of the research process and to ensure that none of the material or context is lost as the data is processed, refined, analyzed and disseminated. Part of the Combechem project investigated how e-Science could provide the mechanisms to support this ideal of laboratory research in a globalized and multidisciplinary world. I will illustrate how the ideas of e-Science together with the current collaborative tools such as oeBlogs, oeWikis, can be applied to provide a oeSmart Laboratory Environment to work with the researchers to improve the quality of research. Examples will be given on the use of tablet PCs as Electronic Laboratory Notebooks (ELN) that enable the recording of semantically rich statements about the research process, the use of blogs as laboratory notebooks for collaborative research. Similarly examples for the oelaboratory perspective where e-Science technology has been used to enhance remote monitoring and control of smart laboratories, by elevating laboratory equipment to first class members of the networked community, by converting simple equipment to oeBlogjects. Once the information has been collected, techniques are needed to view, integrate and review the information in the ELNs and Blogs and I will discuss how the oeSemantic Web and oeWeb 2.0 can play their part in this.

Using the Web for Science, and Science for the Web

James Hendler, RPI

Work in creating infrastructures for scientific collaboration has largely given way to the generalized technologies of the Web, with scientists now relying on blogs, “overlay” journals and scientific wikis. The next Web technology likely to affect scientists is the emerging Semantic Web. In the UK, an investment in eScience saw a number of approaches that used this technology to provide better data integration, management of scientific workflows, and the provenancing of information in scientific systems. The use of ontologies in scientific applications matched Semantic Web technologies well, and projects showed how this technology could be applied to the needs of scientists. These techniques have been gaining wider visibility with their integration into large efforts such as the new “Encyclopedia of Life.” At the same time that the Web has been having significant impact on science, the inverse has also been true. Major breakthroughs in new applications for the Web have come about by using scientific tools to analyze the Web and identify techniques scalable to Web sizes. The best known of these is the PageRank algorithm that gave rise Google. Other efforts have included developing architecture and engineering efforts to help steer web development, and social science efforts to understand why some technologies and/or efforts have scaled well (such as Wikipedia) while others have not achieved the critical mass to succeed. We present both capabilities that new Web technologies will provide to science and the need for a better science of the Web. We argue that more interaction between the traditional sciences and the emerging “web scientists” will lead to new synergies that will have revolutionary impact both on the use of the Web in science and the science of the Web. -This included joint work with T. Berners-Lee, W. Hall, N. Shadbolt, and D. Weitzner.

Semantic Eco-Blogging: Toward a Global Human Sensor Net

Joel Sachs, Cynthia Parr, UMBC; David Wang, University of Maryland; Andriy Parafiynyk, Microsoft; Timothy Finin, UMBC

Eco-blogs are becoming popular amongst both amateur nature lovers and working biologists. Subject matter varies, but entries typically include date, location, observed taxa, and description of behavior. These observations can be an important part of the ecological record, especially in domains (such as invasive species science) where amateur reporting plays an important role, and in the study of environmental response to climate change. To enable our goal of a human sensor net, we have developed SPOTter, a Firefox extension that enables the easy creation of RDF data by citizen scientists. SPOTter is not tied to a particular blogging platform, and can be used both to add semantic markup to one’s own blog posts, and to annotate posts or images on other websites, such as Flickr. Once RDF is generated, we can apply much of the machinery we have developed as part of the SPIRE project. This includes Swoogle, our Semantic Web search engine; Tripleshop, our distributed dataset constructor; and ETHAN, our evolutionary trees and natural history ontology. We are then able to issue queries like: what was the northernmost spotting of the Emerald Ash Borer last year? show all sightings of invasive plants in California; etc. We experiment with our approach on the Fieldmarking blog. We also expressed in RDF the 1200 observations from the first Blogger Bioblitz, and, through integration with other ontologies, were able to respond to ad-hoc queries. Our talk will demonstrate how eco-blog posts end up on the Semantic Web, where they can be integrated with existing natural history information, and queried. We will illustrate how scientists can share data by annotating it with RDF, publishing it via plug-ins to popular software, and making it accessible via new tools and Web mashups. Issues of provenance and reliability will also be addressed.

Shared Tasks and Shared Infrastructure

Enhancement, Deployment, and Generalization of Metadata Technologies

Steve Harris, Oxford University; Jim Davies, Oxford University Computing Laboratory; Charles Crichton, Oxford University Computing Laboratory; Peter Maccallum, CR-UK Cancer Research Institute, Cambridge; Lorna Morris, CR-UK Cancer Research Institute, Cambridge

The Software Engineering Group at Oxford are developing a model- and metadata-driven architecture for research informatics. The architecture is being evaluated for use in large-scale clinical trials on both sides of the Atlantic, and is being integrated with the NCI cancer Biomedical Informatics Grid (caBIG); it is being enhanced and generalized for a wider range of applications, and for use in other scientific disciplines. We present the achievements to date, and the lessons learnt in developing frameworks for semantics-driven data acquisition and processing. We report on the deployment of the (open) architecture on widely-used commercial technology – Microsoft SharePoint, Office, and .NET services – aimed at the expectations and requirements of a wide range of stakeholders. We discuss extensions of the approach to the model-driven development of laboratory information systems, including semantic annotation of tissue collections and microarray data. We discuss techniques for identifying candidate semantic metadata elements from existing artifacts (developed in collaboration with the Veteran’s Health Administration) and the deployment of federated metadata registries (developed in collaboration with caBIG). We explain how the approach may be generalized to cover semantics-driven data acquisition and processing in other disciplines.

SenseWeb: Shared Macro-scopes for Scientific Exploration

Aman Kansal, Microsoft Research; Feng Zhao, Microsoft Research; Suman Nath, Microsoft Research

Many advances in science come from observing the previously unobserved. However, developing, deploying, and maintaining the instrumentation required to observe the phenomenon under investigation is a significant overhead for scientists. In most cases, scientists are restricted to collecting data using limited individual resources. As a first step to overcome this limitation, central archives for sharing data have emerged, so that data collected in individual experiments can be re-used by others. We take the next step in this direction: we build an infrastructure, SenseWeb, to enable sharing the sensing instrumentation itself among multiple teams. The key idea is as follows. A scientist deploys sensors to observe a phenomenon, say soil moisture, at their site. The sensors are shared over SenseWeb. Other scientists interested in soil moisture can conduct experiments using these sensors through SenseWeb. Further, other ecologists may deploy similar sensors at their sites and share them. The scientist can now use SenseWeb to access not only her own sensors but also these other similar ones. What emerges is a oemacro-scope of shared sensors measuring the phenomenon at a scale that no single scientist could instrument alone. New experiments are enabled, providing new insights by probing a phenomenon from multiple sites. Barrier to discovery is reduced as many experiments can begin without deployment overhead. SenseWeb addresses challenges in supporting highly heterogeneous sensors, each with their own capability, precision, or sharing willingness. It is built for scalability, allowing multiple concurrent experiments to access common resources. Its map based web interface provides data visualization. Our prototype is currently used by nearly a dozen research teams to share sensors observing different phenomenon ranging from coral ecosystems to urban activity.

Distributed Annotation of High Resolution Biological Images

Eric Rouchka, University of Louisville; Yetu Yachim, University of Louisville

Background: Biological imaging techniques coupled with the affordability of large scale storage systems has made it possible to construct databases of high resolution images. It is not uncommon for such images to exceed 500 Mb in size. Conventional approaches for viewing and manipulating these images have typically been reserved for desktop applications that tend to be slow and resource-exhaustive. While this sort of approach may be acceptable for single user applications, the internet has made the possibility for geographically sparse research teams to form. Bandwidth bottlenecks do not allow for the effective real-time sharing of these high resolution images without loss of detail due to image compression. Results: We have created a system, YMAGE, for the storage, distribution, and shared annotation of high resolution images. Users of the YMAGE system will be able to create and connect to YMAGE registered servers distributed across the internet. YMAGE allows for the resolution of the images to be maintained by only requesting and sending the viewable region of the image, which can be changed by using zooming utilities. The initial application of YMAGE is for in-situ hybridization images, such as those created through the Allen Brain Atlas project. However, the extensibility of YMAGE allows for high resolution images of any nature to be shared across geographically distributed research groups without loss of information due to image compression. YMAGE users login to a shared user database where they are validated. Each image can be assigned as belonging to a group of users, including a public group. Users are able to view annotations for each image assigned by various research groups, and are able to add their own annotations as well.

Biomedical Informatics Research Network: A National Collaboratory Fostering a New Biomedical Culture and Infrastructure to Hasten the Derivation of New Understanding and Treatment of Disease

Jeffrey Grethe, Univ. of California, San Diego; Mark Ellisman, NCMIR, University of California, San Diego

The Biomedical Informatics Research Network (BIRN) promotes advances in biomedical and health care research through the development and support of a cyber infrastructure that facilitates data sharing and fosters a new biomedical collaborative culture. Sponsored by the NIHs National Center for Research Resources, BIRNs infrastructure consists of a cohesive implementation of key information technologies and applications specifically designed to support biomedical scientists in conducting their research. By intertwining concurrent revolutions occurring in biomedicine and information technology, BIRN is enabling researchers to participate in large-scale, cross-institutional research studies where they are able to acquire, share, analyze, mine and interpret data acquired at multiple sites using advanced processing and visualization tools. Some core components of this infrastructure, designed around a flexible large-scale grid model, include: a scalable and powerful data integration environment that allows users to access multiple databases as if they were a single database; the use and development of ontologies and data exchange standards; a user portal that provides a common user interface, encouraging greater collaboration among researchers and offering access to a powerful suite of biomedical tools. The growing BIRN consortium currently involves more than 40 research sites that participate in one or more BIRN related projects. The BIRN Coordinating Center is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of the biomedical and clinical research being pursued. Building on this foundation, the NIH has recently released Program Announcements that encourage researchers to use the BIRN infrastructure to share data and tools or use the infrastructure to federate significant data sets.

Abstracts 10/23

Abstracts for Tuesday, October 23, 2007

Plenary Presentation

HIV Vaccine Design

David Heckerman, Microsoft Research

I will describe several challenges in the design of an HIV vaccine and show we have addressed them with statistical models in combination with high-performance computing. The statistical models we use are generative models, sometimes called graphical models or Bayesian networks. I will also discuss how these models can be used for genome-wide associations studies—the search for connections between our DNA and disease. Finally, I will talk about how our work with scientists has led to improvements in statistical methods for learning generative models from data.

Data Management and Standardization

A Repository-based Framework for Capture, Management, Curation, and Dissemination of Research Data

Simon Coles, University of Southampton

Based on the e-Bank-UK and Repository for the Laboratory, R4L projects; a working model for a scientific data capture, management, curation and dissemination framework will be presented. The eCrystals repository has been constructed on an institutional repository platform and has been configured to ingest small molecule crystallographic data generated by the UK National Crystallography Service, whilst the R4L repository supports a range of different types of analytical chemistry data. This model addresses the current escalating the data deluge problem through integration of digital libraries technologies with both the research laboratory and also with established publication and dissemination routes. The institutional model provides a potential mechanism for the long term archival and availability of information in a manner that enables the capture of its research data output through integration into the laboratory environment. The repository ingest process ensures full capture of laboratory data and effective metadata creation at the point it is generated. A private archive provides effective management of the data, whilst an embargo procedure allows dissemination of results through a public archive in a timely manner. A schema for the dissemination of crystallographic data has been devised through consultation with the community which enables effective harvesting by data centers and third party aggregator services. The use of persistent identifiers provides a mechanism to permanently link the conventional scholarly article with its associated underlying dataset. Current work is investigating the issues associated with the construction of a federation of data repositories (institutional and subject based) and its long term integration into the publishing and chemical information provision processes.

CARMEN: Managing Data in Scientific Discovery

Tom Jackson, University of York; Jim Austin, University of York

We describe work within the CARMEN e-Science project which is addressing the challenge of creating & managing experimental data and methods within the context of Neuroscience research. The traditional research approach of testing a hypothesis and publishing the results is hampered in situations where others need to build on the results and need access to the data or original methods. CARMEN addresses this issue by allowing scientists to share data & methods within a collaborative Grid environment. The CARMEN platform (a CAIRN) is a Grid-based, shared data and services repository. Central to the data management challenge is providing the capability to data-mine large time-series data sets; the initial application of the system is spike train data from nerve cell recordings. We are examining how data can be represented effectively for diverse experimental methods. Also central to the investigations is providing generalized methods which allow users to publish and share their software services on the CAIRN for wider collaborative research. Initial investigations have shown that MATLAB is a common programming platform. Hence, we aim to provide an interactive MATLAB environment on the CAIRN, with a library of dynamically deployable services. A sister project, DAME, developed a distributed signal search engine, called Signal Data Explorer (SDE), which provides a platform for managing, viewing and searching time-series data. Interoperability between this and other software services is being investigated in the project. Currently, SDE invokes search services on data nodes using the PMC technology (Pattern Match Controller), allowing data to be searched remotely. We will generalize this function to allow any process to be run on remote data. CARMEN is keen to seek early engagement with Grid communities to facilitate constructive evaluation of the proposed approach.

AnIML - Analytical Information Markup Language, an International Effort toward an XML Standard for Analytical Chemistry

Mark Bean, GlaxoSmithKline

AnIML – Analytical Information Markup Language, an International Effort toward an XML Standard for Analytical Chemistry A retrospective on creation of XML standards based on XML schema must include discussion on selecting a home for the standard, obtaining consensus, maintaining momentum over the years, and extensions of the standard in unexpected directions. AnIML XML standard development was hosted by the ASTM E13.15 Committee on Analytical Data and IUPAC, both international standards bodies, partially funded by the National Institute of Standards (NIST), but created by efforts of volunteers across the globe with analytical domain and XML expertise. Merely scheduling meetings spanning a 9-hour time zone presented a challenge, and both wiki and net meeting technologies proved invaluable. AnIML consists of a flexible core XML schema which can be stretched around any analytical data set according to rules specified by extensible Technique Definition documents (LC, UV, NMR, MS, etc.) created by domain experts. These Technique Definitions can be extended to meet vendor- or industry-specific needs without breaking the core schema. This allows vendor-neutral applications (generic AnIML viewers) to be written. AnIML draws on experience with prior standards (JCAMP, ANDI). We will illustrate with examples from the core schema, Technique Definition documents, and AnIML xml files containing LCMS data.

Scientific Workflow

Data Management Challenges of Large-Scale, Data-Intensive Scientific Workflows

Ewa Deelman, USC-Information Sciences Institute

Many scientific applications such as those in astronomy, earthquake science, gravitational-wave physics, and others have embraced workflow technologies to do large-scale science. Workflows enable researchers to collaboratively design, manage, and obtain results which involve hundreds of thousands of steps, access Terabytes of data, and generate similar amounts of intermediate and final data products. Although workflow systems are able to facilitate the automated generation of data products, many issues still remain to be solved. These issues exist in different forms in the workflow lifecycle. During workflow creation appropriate input data need to be discovered. During workflow mapping and execution data need to be staged in and staged-out of the computational resources. As data are produced, they need to be archived with enough metadata and provenance information so that they can be interpreted and shared among collaborators. This talk will describe the workflow lifecycle and discuss the issues related to data management at each step. Examples of challenge problems will be given in the context of the following applications: CyberShake, an earthquake science computational platform, Montage, an astronomy application, and LIGO’s binary inspiral search, a gravitational-wave physics application. These computations, represented as workflows, are running on today’s national cyber infrastructure such as the OSG and the TeraGrid and use workflow technologies such as Pegasus and DAGMan to map high-level workflow descriptions on to the available resources and execute the resulting computations. The talk will describe the challenges, possible solutions, and open issues faced when mapping and executing the large-scale workflows on the current cyber infrastructure. Particular emphasis will be given to issues related to the management of data throughout the workflow lifecycle.

myExperiment: Social Networking for Workflow-using eScientists

Carole Goble, University of Manchester; David De Roure, The University of Southampton

Workflows are scientific objects in their own right, to be exchanged and reused. myExperiment is an initiative to create a social networking environment for workflow workers. myExperiment is also planned as a market place to oeshop for workflows and services; a gateway to other environments; and a platform to launch workflows. We are currently beta testing the first phase of myExperiment “the social network “amongst a group of Life Scientists who develop Taverna workflows (www.mygrid.org.uk). We present the motivation for myExperiment and sketch the proposed capabilities. We report on the technical, political and sociological issues and our experiences so far. Our greatest challenge is how we work with the inherent self-interest of the scientist to gain trusted and enthusiastic participation in an inherently altruistic activity that relies in the network effects of many members.

Scientific Workflows as Configurable, Resilient Data Transducers

Bertram Ludaescher, Shawn Bowers, Timothy McPhillips, Daniel Zinn; University of California, Davis

Interest in scientific workflows has grown considerably in recent years. It is now generally recognized that workflows enable scientists to harness IT in new ways, thus promising to dramatically accelerate scientific discovery in the future. Advantages of scientific workflows over other solutions include workflow automation, optimization, and result reproducibility (via a provenance framework). An often overlooked, but crucially important area is modeling and design of scientific workflows. We believe that better support for rapid development, adaptation, and evolution of workflows is on the critical path to widespread adoption of this technology: e.g., workflow designs should be resilient to procedural changes (task insertion, removal, modification) and schema changes (in inputs and outputs). To this end, we compare extant models of computations (MoCs) underlying different kinds of workflows, i.e., current scientific workflows and more traditional business workflows. Our MoCs comparison includes task dependency graphs (DAGs) common in Grid workflows, Petri nets which are foundational for most business workflow approaches, dataflow process networks found in a number of scientific workflow systems, and XML stream processing models among others. We argue that in many scientific applications, data coherence is a crucial but often neglected aspect that is rarely found in current MoCs, resulting in MoCs (such as vanilla Petri nets and process networks) that are unnecessarily cumbersome and not resilient to change. To overcome such shortcomings, we propose a simple hybrid MoC that elegantly combines features from several MoCs and paradigms. In essence, our hybrid MoC views a scientific workflow as a configurable, pipelining data transducer over XML data streams. By exploiting an assembly line paradigm, data coherence and resilience to changes are achieved as well.

Data Mining

Carolina ChemBench (C-ChemBench): A Web-based Cheminformatics Expert System for the Analysis and Prediction of Biological Screening Data

Alexander Tropsha, UNC-Chapel Hill; Julia Grace, UNC-Chapel Hill; Hao Xu, UNC-CH; Tongan Zhao, UNC-CH; Chris Grulke, ; Berk Zafer, ; Diane Pozefsky, UNC-CH; Weifan Zheng, NCCU

The NIHs Roadmap includes the Molecular Library Initiative (MLI) and the PubChem repository of biological assays of chemical compounds. In 2005 the MLI formed the Molecular Library Screening Centers Network, MLSCN. As of May 2007, there were 256 MLSCN bioassays deposited in PubChem for over 140,000 chemicals making PubChem already the largest publicly available repository of bioactivity data. It promises to be comparable to “if not exceeding “the largest bioinformatics databases. The Carolina Center for Exploratory Cheminformatics Research (CECCR) was founded with the Roadmap funding in 2006 to develop research cheminformatics tools and software to address data mining and knowledge discovery challenges created by the MLI and PubChem projects. We have developed and deployed a prototypic cheminformatics web server called C-Chembench. It includes modules designed to address the needs of all constituent groups of chemical biology and drug discovery specialists, i.e., computational chemists (Model Development Module), biologists (Predictions Module), chemists (Library Design Module), and bioinformaticians (CECCR Base Module). We shall discuss several cheminformatics-specific data mining and knowledge discovery technologies (such as Quantitative Structure Activity Relationship Modeling) for biological assay data analysis and provide several successful examples of applications. Our technologies (that also rely on distributed computing) afford robust and validated models capable of accurate prediction of properties for molecules not included in the training sets. This focus on knowledge discovery and property forecasting brings C-ChemBench forward as the major data-analytical and decision support cheminformatics server in support of experimental chemical biology research.

Mining the Sky in the Real Time: Automated Detection, Classification, Dissemination, and Follow-up of Transient Astronomical Events

George Djorgovski, Caltech; Roy Williams, Caltech; Ashish Mahabal, Caltech; Andrew Drake, Caltech; Matthew Graham, Caltech; Ciro Donalek, Caltech; Eilat Glikman, Caltech

We describe an example of a real-time mining of massive data streams, taken from the rapidly developing field of time-domain astronomy and synoptic digital sky surveys. A typical scientific process consists of an iterative loop of measurements, their analysis, additional measurements implied by the analysis, etc. Typically this occurs on time scales of months or years. But if the relevant time scales of phenomena under study are in the range of minutes to hours, the process must be automated, with no humans in the loop – especially if the data flux and volume are in TB or PB range. We describe a system to discover, classify, and disseminate astronomical transient events, which involves a robotic telescope network with a feedback: automatically requested follow-up observations are folded back into an iterative analysis and classification of observed events. These may include a variety of cosmic explosions (e.g., supernovae, cataclysmic variables), inherently variable objects (e.g., stars, quasars), moving objects (asteroids, dwarf planets), and possibly even some previously unknown types of objects and phenomena. Rapid and automated follow-up is essential for their physical understanding and scientific use. The system operates within a broader Virtual Observatory framework, and it includes a variety of computational components, from data reduction pipelines to federated archives, web services, machine learning, etc. It represents a test bed for many technologies needed for a full scientific exploitation of current and forthcoming synoptic sky surveys. It also has a broader relevance for other situations which require an automated and rapid data mining of massive data streams.

Towards Text Mining Terabytes of Text Documents

Firat Tekiner, University of Manchester; Sophia Ananiadou, University of Manchester

The continuing rapid growth of data and knowledge expressed in the scientific literature has spurred huge interest in text mining (TM). The individual researcher cannot easily keep up with the literature in their domain, and knowledge silos further prevent integration and cross-disciplinary knowledge sharing. The National Centre for Text Mining (NaCTeM) offers TM services to the academic community allow users to apply TM techniques to a variety of problems in their areas of interest. NaCTeM is entering a new phase, where the goal is to move from processing abstracts to full texts and to data mine the voluminous results to discover relationships yielding new knowledge. The expansion to new domains and the increase in the scale will massively increase the amount of data to be processed by the Centre (from Gigabytes to Terabytes). This work we are investigating approaches using high performance computing (HPC) to tackle the problem of data deluge for large-scale TM applications. Although TM applications are data independent, data handling of large text data is an issue when full text data is considered due to the problem sizes in consideration. Each of the steps in the TM pipeline adds further information to the initial raw text and data size increases as processing progresses throughout this process. The initial work focuses on tagging and parsing of text using TM applications and scaling up to 64-128 processors has been achieved. However, when scaling to a larger number of processors, data and work distribution will also be an issue due to the unstructured nature of the data available. In addition, we are aiming to create a framework to move and handle the large amount of data between many processes in TM pipeline. In this work, we will be discussing the challenges encountered when we mine large number text and future work needed to text mine full papers.

Posters

Open Source eScience Geospatial Visualization Using .NET Technology

Patrick Hogan, NASA

The need for massive communication and dynamic sharing of scientific data has never been greater than it will be in the world that awaits our children. The ability to integrate, analyze, and exchange both local and global information is critical to maximizing our understanding of our circumstances, whether for ground-truthing of satellite data (Earth’s carbon budget), coalescing field data for regional projections (North Africa to North India locust intervention), or simply innovative analyses coming from world-wide access to global data, and whether it be on behalf of academia, governments, or enfranchised individuals from the global community. This realm of scientific understanding needs the kind of innovation that comes from coding environments that provide the greatest opportunity for the development of solution-based technology. Competition in this realm should be based purely on results engendered by access to the scientific data. The .NET programming environment provides a compelling solution for scientific endeavors to maximize solution-based analyses and it also equally serves the geospatial visualization technology needed to effectively share this information.

Optimizing Life Sciences Data Transfer to Mobile Devices

Greg Quinn, University of California, San Diego

Within the past few years, numerous cell phone platforms have come to market that provide more than sufficient technical capability to enable advanced information visualization. Accompanying these advances in telecommunications hardware is the increasing maturity and capability of Smart Phone operating systems such as Windows Mobile 6.0. This has led to the increasing dependence of people from all walks of life on their cell phone to provide not only telecommunications functionality but also Internet-based information access and entertainment capability. Here we describe work in progress to utilize the Windows Communications Foundation capability in the .Net Framework version 3.0 to efficiently serve bioinformatics data on-the-fly to Smart Phones devices running the Windows Mobile operating system. We will also discuss the use of binary-formatted data transfer as a means to increase the download and processing efficiency of Protein Data Bank (PDB) data stored in a Microsoft SQL Server database.

Smart Irrigation Control based on Cognitive Wireless Sensor Networks

Supratik Mukhopadhyay, Utah State University; Krishna Shenai, University of Toledo; Ramesh Bharadwaj, NRL

World demand for fresh water is increasing, and competition for allocation of water between the urban and agricultural sectors is rapidly growing in arid and semi-arid climates. This has brought an emphasis on intensive water management to achieve greater system efficiencies, especially in irrigated agriculture in arid regions such as the western US. Further, studies by the FAO (Food and Agricultural Organization) and others predict that in the coming 20 years, this competition for water will present potentially serious economic, political, and social problems for much of the population in both the urban and rural areas of developing countries, especially in the arid and semi-arid regions of the world. We present a novel irrigation control system to intelligently and reliably manage large soil and water ecological system for environmental and agricultural applications. Reliability is an important concern in precise monitoring and control of soil and water properties, since any malfunction can result in financial as well as environmental disaster. Our controller consists of novel sensor and uses state-of- the-art distributed information fusion and networking technologies for multi-zone implementation. It integrates intelligent sensor coordination and data fusion techniques to access, retrieve, process, and communicate with disparate wireless sensors in an ad-hoc manner to deliver reliable dynamic decisions and provide adequate information management. Our approach drastically reduces the hardware cost almost by a factor of 10 and removes the main bottleneck in irrigation control arising from wired sensors. Apart from this it provides a smart control mechanism with formal reliability guarantees that is reconfigurable at runtime in response to changing requirements.

Programming in the Large: Integrating Simulation and Visualization

Christoph Hoffmann, Voicu Popescu, Purdue University

Visualization is a core task in scientific computations, and in interdisciplinary settings it becomes even more important in view of the need to communicate insights across disciplinary expertise in the team. We explain how to integrate state-of-the-art finite element analysis and visualization systems. Instead of replicating functionality of one system in the other, we federate the systems by automated translation of FEA results into a form suitable for the animation/visualization system. This includes bridging the gap between different geometry conceptualizations, inverting and visually concretizing abstractions convenient for FEA, deriving visualization strategies that scale with the number of simulation elements and states, and placing the simulation results in the context of the surrounding scene. We demonstrate our approach with the recently completed simulation and animation of the crash of AA-11 into the North Tower of the World trade Center, a video that has been downloaded more than 1.3M times to date. We discuss some of the research issues that arose and describe some of the benefits for the FEA when high-end visualization is considered part of the effort. In the broader context, our work finds applications in VR training, in forensics, and in communicating with a wide audience outside of the scientific community.

Declarative and Efficient Querying on Biological Datasets

Jignesh Patel, University of Michigan

Modern life sciences explorations often need to analyze and manage large volumes of complex biological data. Unfortunately, existing life sciences applications often employ awkward procedural querying methods and use query evaluation algorithms that do not scale as the data size increases. For example, data is often stored in flat files and queries are expressed and evaluated by programs written in Python. The perils of employing such procedural querying methods are well known to a database audience, namely a) severely limiting the ability to rapidly express complex queries, and b) often resulting in very inefficient query plans as sophisticated query optimization and evaluation methods are not employed. The problem is likely to get worse in the future as many life sciences datasets are growing at a rate faster than Moore’s Law. Furthermore, the queries that scientists want to pose are also rapidly increasing in their complexity. The focus of this talk is on a database approach to querying biological datasets. The talk describes ongoing work in the Periscope project in which we are developing a system for declarative and efficient querying on biological graphs and sequence databases. This talk will also highlight how these database methods allow a scientist to work in a loop of a) first posing queries, b) viewing the results, c) then refining and reposing a modified query, and d) continuing through this iterative process until an answer has been found. The efficiency of the system enables the scientist to explore even large biological databases in real time.

Creating and Querying Workflows by Analogy

Claudio Silva, Juliana Freire, Carlos Scheidegger, David Koop, Huy Vo; University of Utah

Workflow systems have recently emerged as an alternative to ad-hoc approaches to constructing computational tasks widely used in the scientific community. These systems can capture complex analysis processes at various levels of detail and systematically capture the provenance information necessary for reproducibility, result publication, and sharing. Although the benefits of using workflow systems are well known, the fact that workflows are hard to create and maintain has been a major barrier to wider adoption of the technology in the scientific domain. Constructing complex analysis processes requires expertise in both in the domain of the data being explored, and in using a number of different analysis and visualization tools. Furthermore, the path from “data to insight” requires a laborious trial-and-error process, where users successively assemble, modify, and execute multiple workflows. We advocate a data-centric view of workflow-based computational processes, where the workflows and information about their evolution are stored, along with their impact on the data they manipulate. This information captures detailed provenance of the steps followed in exploratory processes. We propose a new frame work that lets users explore and re-use this detailed provenance information through intuitive interfaces. Our framework consists of two key components: a query-by-example interface for querying workflows whereby users query workflows through the same familiar interface they use to create them; and a mechanism for semi-automatically creating and refining workflows by analogy}, without requiring users to directly manipulate or edit the workflow specifications. In this talk, we will describe the framework and demonstrate its use in VisTrails (www.vistrails.org), a publicly-available open-source system.

Scientific and Technological Challenges in Developing a Real-Time Syndromic Surveillance System

Vicki Hertzberg, Douglas Lowery-North, Walter Orenstein, James Buehler, Lance Waller, Eugene Agichtein; Emory University

Rapid detection of disease outbreaks and response to cases is an important public health function. Definitive diagnoses and subsequent reporting can lag initial case presentation by days or weeks, a critical weakness in outbreak detection. In addition, timely notification of outbreaks to healthcare providers by a central public health authority is also crucial. However, the best strategies for such notification have not been determined. We describe here the potential for developing a real-time syndromic surveillance (SS) system using three healthcare systems in a large urban area with reciprocal interface from the state PH agency. These systems cover patients presenting in the hospital emergency departments (four adult, three pediatric) and primary care clinics as well as related laboratory and radiology orders. This system presents many scientific and technological challenges. How can we best integrate data sets within and between systems rapidly? Is there benefit to monitoring the health status of a particularly vulnerable population comprising one of the hospitals? What tools are necessary to detect “blips” suggesting events of interest? Can we automate epidemiologic investigation of such events? Can we apply performance improvement tactics to reduce waste and improve value in SS data collection, analysis, and reporting? How can free text records, such as dictations, be utilized to improve sensitivity and positive predictive value of SS? How can we best give meaningful real time feedback to clinicians regarding PH alert information? What is the most valuable information to provide to these clinicians? What are the most valuable actions for providers to accomplish with such information? Should space be reserved in electronic?

RAY: A System Supporting Multiple Contending Scanning Queries on Large Scientific Data Sets

Robert Grossman, Dave Hanley, University of Illinois; Jennifer Schopf, Argonne National Laboratory

Many applications perform queries to large scientific data sets that involve scanning the entire data set in the sense that each record must be checked to see if a given condition is satisfied. In contrast, there is often an implicit assumption by the database developers that latency must be optimized, and an expectation that data is indexed in such a way that a relatively small amount of the data needs to be retrieved in order to satisfy the query. We are interested in the case seen by applications including SDSS, BLAST, and others in which there are multiple contending scanning queries and the end user wishes to optimize total throughput. In this paper, we define a system called RAY that collects scanning queries as they arrive, presents them with the entire database chunk by chunk, and releases them after the entire database has been scanned, thereby increasing the performance of multiple contending scanning queries by reducing the number of aggregate disk reads. We present experimental studies using a large astronomy data set from the Sloan Digital Sky Survey and realistic queries from that experiment that touch varying amounts of data, from 100% down to 20%. We show that RAY is significantly faster than directly passing the queries to the database. When 100% of the data is touched this can be true even when there is no contention, and for less data touched in the scan, RAY can achieve better performance for as few as 2 or 3 contending scanning queries.

Controlled Sharing of Scientific Data using SecPAL

Marty Humphrey, Sang-Min Park, Jun Feng, Norm Beekwilder, Glenn Wasson, Jason Hogg, Brian LaMacchia, Blair Dillaway; University of Virginia

Access control policy languages today are generally one of two extremes: either extremely simplistic, or overly complex and challenging for even security experts to use. In this presentation, we explicitly identify requirements for an access control policy language for scientific data and then consider six specific data access use-cases that have been problematic in multiinstitutional collaborations: attribute-based access, role-based access, “role-deny” access, impersonation-based access, delegation-based access, and capability-based access. We evaluate the Microsoft Research Security Policy Assertion Language (SecPAL) against those requirements, specifically in the context of these six use-cases involving GridFTP.NET. We find that while some of these six use-cases are individually possible via existing authorization systems, we believe that SecPAL uniquely offers a single approach that meets the requirements of a multi-institutional access control policy language, thereby creating support for a wide range of expanded scenarios for controlled sharing of scientific data.

Science 2.0

Bora Zivkovic, Public Library of Science

Online technologies are fundamentally changing the world of science: how research is performed, how science is taught and communicated, and how scientists’ networks are formed. Meteoric rise in number, quality and prestige of Open Access journals, rise in interest in Open Notebook Science, proliferation of science blogs, increased use of existing social networks (e.g. Facebook) and formation of science-specific networks (e.g., Postgenomic, Connotea), all contribute to big changes in the structure of the scientific enterprise which upset the traditional model.

Online Notes-Taking-Sharing System

C. Augusto Casas, St Thomas Aquinas College

Taking notes is the most common activity of students in the classroom. College students’ use of technology has increased significantly in the last several years. Students now attend class armed with PDAs, laptops and especially cell phones. These last devices are more than a telephone. Cell phones include calculators, web browsers, instant messaging software, phone books, digital cameras, video players, calculators, and games. Research conducted by the author found that students can benefit academically from such technology. More specifically, class experiments demonstrated that using personal computers to take and share notes student class participation and test scores increase. Microsoft Office Live Meeting was used as the underlying technology. Lectures were given to students divided in two groups. One group shared notes with Live Meeting. The other group took notes individually. A day after the lecture both groups took the same test. The experiment was conducted multiple times with different pools of students. Results showed that students using the notes-taking-sharing system were more actively engaged in class and scored better in the test. The results were consistent across all groups tested. With the Live Meeting system, each student was assigned a section of an online whiteboard. Each student took notes in her/his area while looking at the notes taken by classmates. At the end of the lecture students that use the online system could save and keep a copy of the online whiteboard. The experiments showed that students are more likely to engage in class and less likely to be distracted with other activities when they are working within this collaborative environment. The next research phase intends to determine if such a system helps disadvantaged students.

Understanding Computational Requirements for Preservation and Reconstruction of Computer-Assisted Decision Processes

Peter Bajcsy, Sang-Chul Lee, NCSA/UIUC

We discuss the problem of understanding computational requirements for preservation of computer-aided decisions. Computer-aided decisions increasingly impact our society. These decisions have to be documented semi-automatically and the electronic records have to be appraised and understood in terms of the preservation and reconstruction cost. Currently there is no simulation framework that could support understanding and forecasting of computational requirements for preservation purposes. Our objective has been to develop such an exploratory simulation framework that allows archivists and other users to explore and evaluate computational costs as a function of several key preservation variables of appraised records. Thus, the application of our simulation framework is in supporting investigations of preservation tradeoffs and improving appraisals of electronic records. We first outline such prototype simulation software called Image Provenance To Learn (IP2Learn) that has been developed for a class of computer-aided decisions based on visual image inspection. The current software enables to explore some of the tradeoffs related to (1) information granularity (category and level of detail), (2) representation of provenance information, (3) compression, (4) encryption, (5) watermarking and steganography, (6) information gathering mechanism, and (7) final report content (level of detail) and its format. The simulation software consists of Image Viewer (visual inspection of images), Event Tracker (information gathering), Event Reviewer (decision reconstruction), and Final Report Editor (semi-automatic report generation). We will also illustrate example tradeoff studies using IP2Learn for a specific image inspection task.

Rapid Adoption of Visualization Cyber infrastructure in the Atmospheric Sciences Classroom

David Lee, Perry Samson, Erik Hofer, University of Michigan

In early 2007 the department of Atmospheric Oceanic and Space Sciences (AOSS) and the School of Information (SI) at the University of Michigan collaborated on the installation of a 50 million pixel OptIPortal, or tiled display, utilizing OptIPuter technologies for applications spanning high-resolution image exploration to multi-modal atmospheric visualizations. In addition to research and persistent display tasks, the OptIPortal was incorporated into the undergraduate curriculum by requiring use of the display in demonstrating their understanding of principals in atmospheric sciences. This presentation discusses the rapid adoption of ultra high resolution visualization cyber infrastructure in a classroom setting. The AOSS student group demonstrated the ability to effectively utilize advanced cyber infrastructure using the interfaces provided by a software stack, enabling them to rapidly prototype compelling applications that take advantage of the high resolution display despite the technical complexity of the system. Utilizing these tools, the students produced projects ranged from conventional PowerPoint presentations, to distributed and parallel rendering of movie files, to dynamic multi-modal and multi-resolution weather visualizations to aid in the prediction or understanding of atmospheric phenomena. In analysis of their achievements, observations and interactions with the student group provided insight into how the OptIPuter software driving the tiled display enabled students to rapidly prototype meaningful visualizations aiding their course projects. Considering these results we are optimistic that these experiences point to the feasibility and utility of the introduction of OptIPortals to the classroom as well as lessons for the next generation of control software for high resolution displays.

Grid2Win: Porting gLite to Windows-based Platforms

Fabio Scibilia, Dario Russo, INFN-Catania

The grid paradigm has emerged as the next step in the evolution of distributed computing. The gLite middleware (http://www.glite.org) is one of the most popular grid middlewares and it is developed in the context of the EGEE project (http://www.eu-egee.org) which built the largest grid infrastructure for e-Science in the world. At present, gLite essentially runs on Linux platforms and this has up to now taken Microsoft Windows users and applications out of the EGEE infrastructure. The aim of the Grid2Win project is to port basic gLite services to run under MS-Windows to let Windows user’s access to grid facilities as well as to make possible the integration of Windows applications with the grid. Among all gLite services, we focus on the User Interface (UI), which is the set of command line tools to access the grid resources, and the Computing Element (CE), which is the grid service managing the computing power of the grid. Each CE wraps a Local Resource Management System (LRMS) exploiting its computing power. Using Cygwin as a POSIX emulation environment, we successfully ported the gLite User Interface to run under MS-Windows XP and developed a GUI on top of it. Moreover, we ported the Torque/MAUI (free release of the PBS job scheduler) based CE as first Windows CE. Encouraged by the results obtained, we also successfully managed to integrate Microsoft Compute Cluster Server (CCS) into gLite as first Windows native LRMS recognized by gLite. The presentation will make the point on the activities carried out so far as well as on the future plans.

ChemXSeer: An eChemistry Web Search Engine and Repository

Lee Giles, Prasenjit Mitra, Levent Bolelli, Xiaonan Lu, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, James Z. Wang, Karl Mueller, William Brouwer, James Kubicki, Barbara Garrison, Joel Bandstra, Pennsylvania State University

In chemistry, the growth of data has been explosive, and timely, effective information and data access is critical. We propose the NSF-funded ChemXSeer architecture, a portal for academic researchers in environmental chemistry, which integrates the scientific literature with experimental, analytical and simulation datasets. ChemXSeer will be comprised of information crawled from the web, manual submission of scientific documents and user submitted datasets as well as scientific documents and metadata provided by major publishers. Information crawled by ChemXSeer from the web and user submitted data will be publicly accessible whereas access to publisher resources can be provided by linking to their respective sites. Thus, instead of being a fully open search engine and repository, ChemXSeer will be a hybrid, limiting access to some resources. ChemXSeer intends to offer some unique aspects of search not yet present in other scientific search services. We are developing algorithms for the extraction of tables, figures, equations and formulae from scientific documents enabling users to search on those fields. ChemXSeer intends to provide the search features including; full text search Author, affiliation, title and venue search Figure and table search Equation and formulae search, citation and acknowledgement search, and citation linking and statistics. For dataset search, we are developing tools that automatically annotate published data representations such as figures, and that permit researchers to annotate their datasets by providing both document-level and attribute-level metadata in OAI-PMH format to facilitate searching data more effectively both at the attribute and semantic levels, browsing datasets, and linking to existing scientific literature and other datasets.

Design and Synthesis of Minimal and Persistent Protein Complexes

David Green, Steven Skiena, Stony Brook University

A major problem in synthetic biology is the tendency of bacterial systems to eliminate any genes that do not directly benefit the organism, as a result of natural selection favoring shorter genome lengths, which can be replicated more quickly. We are working on advances in computational protein and gene design that directly address this problem. We have previously demonstrated an algorithm capable of creating the shortest nucleotide sequence that encodes two given proteins, taking advantage of multiple reading frames and the redundancy of the genetic code. We also have expertise in computational approaches to the redesign of proteins to satisfy particular functions. We are currently working to integrate these technologies in achieving two particular goals. The first involves the interleaving of an antibiotic resistance gene with a particular protein whose expression is desired. Challenging bacteria containing this construct with the appropriate antibiotic will lead to a selective pressure to keep the inserted gene; as the sequence of the protein of interest overlaps this coding sequence, the deletion of the desired protein from the genome will be avoided. Secondly, we are developing methods to directly reduce the coding length for a given protein, taking a two-step approach: (1) redesign a multi-domain protein consisting of a single polypeptide sequence into a protein complex; (2) overlap the coding sequences of the two components, leading to a substantially reduced length of DNA that codes for a functionally equivalent protein. Our approach integrates protein design, coding-sequence optimization, and validation in a experimental context to address a major problem in the long term viability of synthetic biological networks. We will present our initial results in targeting these problems.

Computational Biology Applications Suite for High Performance Computing (BioHPC.net)

Jaroslaw Pillardy, Cornell University

One of the challenges of High Performance Computing (HPC) is the user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, MrBayes et al. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 10,000 job submissions per year, many of them requiring massively parallel computations. It also has a built in user management system which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The suite will be released along with its source code this year. The system consists of a web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server and file server. Users can interact with their jobs and data by a web browser, ftp or e-mail. Remote HPC clusters can be accessed via JSDL protocol. The interface is accessible at http://BioHPC.net/.

Accelerating Scientific Computations using a GPU: Fast N-Body Simulation with CUDA

Jan Prins, University of North Carolina, Chapel Hill; Lars Nyland, Mark Harris, Nvidia Corp.

Acceleration of computational kernels using a GPU is becoming simpler using improved GPU programming models. We examine the all-pairs computational kernel for N-body simulation and its implementation using the NVIDIA CUDA programming model. We show how the parallelism available in the all-pairs computational kernel can be expressed in the CUDA model and how various parameters can be chosen to effectively engage the full resources of the first GPU to support the CUDA model, the NVIDIA GeForce 8800 GPU. We report on the performance of a familiar N-body kernel for astrophysical simulations. For this problem the GeForce 8800 calculates over 10 billion interactions per second performing 100 integration time steps per second to simulate a system with 10,000 bodies. At 20 flops per interaction, this corresponds to a sustained performance in excess of 200 gigaflops. This is close to the theoretical peak performance of the GeForce 8800 GPU. The all-pairs approach is typically used as a kernel to determine the forces in close-range interactions. The all-pairs method is then combined with a faster method based on a far-field approximation of longer range forces, which is only valid between parts of the system that are well separated. In all cases, a fast all-pairs kernel is essential to the overall performance of the n-body simulation.

Virtual Institute for Integrative Biology (VIIB): an eScience Paradigm for Latin America

David Holmes, Life Science Foundation; Fernado González-Nilo, Center for Bioinformatics and Molecular Simulation; Raúl Isea, Apartado Postal 40336

This presentation examines the case of the Virtual Institute for Integrative Biology (VIIB) as a Latin American paradigm for achieving global collaborative eScience. Biology has emerged as one of the major areas of focus of scientific research worldwide, providing new challenges in eScience and grid computing. Whereas major efforts to meet these challenges have been mounted in various parts of the world, less appears to have been accomplished in Latin America and the VIIB was developed to fill this need. The scientific agenda of the VIIP includes: construction and operation of databases for comparative genomics of particular relevance to Latin America, bioinformatics services and protein simulations for biotechnological and medical applications. Human resource development through shared teaching, co-sponsored students and seminars is also an integral component of the collaborative effort. eScience challenges include: connectivity concerns, high performance computing (HPC) limitations, development of a customized Grid framework, language issues, maintenance of open access without compromising security and the dissemination of scientific and technical information. Finally, it was recognized that computational frameworks and flexible workflows were required to efficiently exploit shared resources without causing impediments to the user who has little interest in the underlying information technology (IT). Overall, the VIIB has proved an effective way for small teams to transcend the critical mass problem, to overcome geographic limitations and to harness the power of large scale, collaborative science; as such, it may prove a useful model for promoting additional eScience initiatives in Latin America and other emerging regions.

eScience in Biomedical Engineering Research: Cancer Modeling and Simulation

Nahuel Olaiz, Esteban Mocskos, Mariano Perez Rodriguez, Lucas Colombo, Alejandro Soba, Cecilia Suarez, Graciela Gonzalez, University of Buenos Aires; Luis Nuñez, Argonne National Laboratory; Marcelo Risk, Guillermo Marshall, University of Buenos Aires

Here we describe an application in biomedical engineering. In cancer tumor drug treatment nothing can reach tumor cells without passing through the vessel wall and the interstitial matrix. Physicochemical and physiological barriers could hinder the main transport mechanisms, thus leading to heterogeneous therapeutic agent accumulation and some cells remaining untreated. Use of electric currents in chemotherapy greatly enhances drug transport and delivery. Cancer electrochemical treatment consists in the passage of an electric current, whether direct (EChT) or micro-/nano-pulsed (ECT), through two or more electrodes inserted locally in the tumor tissue. Extreme pH changes at tissue level (EChT) or the creation of membrane porous channels at the cell level (facilitating penetration of anticancer drugs into the cell, ECT), are the main tumor regression mechanisms. We study tumor drug transport for cancer treatment with nanoparticles (loaded with therapeutic agents) during EChT and ECT through a combined modeling methodology: in vivo with BALB/c mice bearing a subcutaneous tumor, in vitro with multi-cellular spheroids and collagen gels, and in silicon using the Nernst-Planck, Poisson and Navier-Stokes equations for ion transport, electric field distribution and fluid flow, respectively. The main goal is to find nano-particle/drug combinations, electric field intensities and pulse frequencies that optimize tumor treatment. In this interdisciplinary approach we use I-labs web based for confocal and fluorescent microscopy image processing, and HPC computing on a low latency cluster under MS CCS platform. Preliminary results suggest that using nano charged drugs and tuned electrical fields, significantly increases drug .

Measuring Circadian Activity Rhythms for Home Healthcare: Clinical Potentials and Home Automation Benefit

Gilles Virone

This summary presents a custom Software for Automatic Measurement of Circadian Activity Deviation called SAMCAD. The primary goal of this software is to extract, from raw activity data collected through passive monitoring, Circadian Activity Rhythms (CAR) or home human behaviors, for various types of populations who may benefit from a home assistive technology. Based on a pattern mining algorithm, SAMCAD establishes the life rhythm of a resident in approximately three weeks from empirical observations, then tracks for any behavioral changes eventually occurring during daily life at home. Early clinical trials show the potential to detect chronic pathologies such as urinary infections or to evaluate cognitive decline or rehabilitation treatments. The knowledge of life habits, given by a derived type of CAR activity patterns based on the user presence in every room, permits also to setup various home automation functions such as power management. For example, half duplex radio transmissions which are highly solicited during long-term in-home wireless activity monitoring in sensor networks, can be efficiently regulated for energy saving by mapping motes’ behavior to the resident behavior, while preserving a high quality of monitoring. The detection of the deviation of these home behaviors, part of the CAR model, can be as well useful in the field of privacy to re-enforce rules based systems dealing with dynamic Role Based Access Control. Privileges to access personal medical data belong first to patients. However, they may be willing to automatically provide permissions to caregivers in case of shortterm at-risk situations (falls, cardiac arrests), or in longer situations involving abnormal CAR behavioral context. Such behavioral anomalies, which may be indicative of a cognitive decline, can be used to warn caregivers for investigations.

Computational insights into the social life of zebras

Tanya Berger-Wolf, University of Illinois at Chicago; Daniel Rubenstein, Princeton University; Mayank Lahiri, Chayant Tantipathananandh , University of Illinois at Chicago; David Kempe, University of Southern California; Habiba Habiba, University of Illinois at Chicago; Jared Saia, University of New Mexico

Computation has fundamentally changed the way we study nature. Recent breakthroughs in data collection technology, such as GPS and other mobile sensors, are giving biologists access to data about wild populations that are orders of magnitude richer than any previously collected. Such data offer the promise of answering some of the big ecological questions about animal populations: Unfortunately, in this domain, our ability to analyze data lags substantially behind our ability to collect it. In particular, interactions among individuals are often modeled as social networks where nodes represent individuals and an edge exists if the corresponding individuals have interacted during the observation period. The model is essentially static in that the interactions are aggregated over time and all information about the time and ordering of social interactions is discarded. We show that such traditional social network analysis methods may result in incorrect conclusions on dynamic data about the structure of interactions and the processes that spread over those interactions. We have extended computational methods for social network analysis to explicitly address the dynamic nature of interactions among individuals. We have developed techniques for identifying persistent communities, influential individuals, and extracting patterns of interactions in dynamic social networks. We will present our approach and demonstrate its applicability by analyzing interactions among zebra populations and identifying how the structure of interactions changes with demographic status.

Time-Space Continuity of Daily Maps of Fractional Snow Cover and Albedo from MODIS

Jeff Dozier, James Frew, University of California, Santa Barbara

Using reflectance values from the 7 MODIS “land” bands with 250 or 500m resolution, along with a 1km cloud product, we estimate the fraction of each 500m pixel that snow covers, along with the albedo of that snow. Such products are then used in hydrologic models in several mountainous basins. The daily products have data gaps and errors because of cloud cover and sensor viewing geometry. Rather than make users interpolate and filter these patchy daily maps without completely understanding the retrieval algorithm and instrument properties, we use the daily time series in an intelligent way to improve the estimate of the measured snow properties for a particular day. We use a combination of noise filtering, snow/cloud discrimination, and interpolation and smoothing to produce our best estimate of the daily snow cover and albedo. We consider two modes: one is the “predictive” mode, whereby we estimate the snow-covered area and albedo on that day using only the data up to that day; the other is the “retrospective” mode, whereby we reconstruct the history of the snow properties for a previous period.

A Swiss-Army Knife for Parallel Sequence-Search in Biomedical Informatics

Jeremy Archuleta, Wuchun Feng, Eli Tilevich, Virginia Polytechnic Institute and State University

The biomedical and life sciences communities make heavy use of BLAST (Basic Local Alignment Search Tool) to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between pairs of sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. For example, it can be used for phylogenetic profiling, bacterial genome annotation, and pathogen detection. Unfortunately, BLAST has proven to be too slow to keep up with the current rate of sequence acquisition. Searching for a given sequence against the nucleotide database takes nearly three times longer today than it did in 2004 despite faster hardware. Thus, we created mpiBLAST, a novel parallelization of BLAST that runs on many OS platforms, including Microsoft Windows. mpiBLAST can deliver super-linear speed-up and scale to tens of thousands of processors due to an array of integrated features including database and query segmentation, advanced job scheduling and load balancing, and parallel I/O. Currently, mpiBLAST v1.4 delivers 305-fold speedup when running on a 128-processor cluster. By abstracting the execution characteristics of sequence-search algorithms such as BLAST, mpiBLAST has evolved to efficiently transform any given serial sequence-search tool into a parallel one, thus delivering the above performance to an entire class of sequence-search algorithms. This new version of mpiBLAST (v2.0) achieves the above by utilizing “mixing layers” to separate functionality into complementary modules and “refined roles” within each layer to improve the inherently modular design, thus enhancing maintenance and extensibility, e.g., allow advanced algorithmic features to be developed and incorporated while routine maintenance of the code base persists.

Global Climate Warming in the Machine Room

Wuchun Feng, Virginia Polytechnic Institute and State University

For decades now, the notion of performance has been synonymous with speed. For example, the performance of supercomputers running on our n-body cosmology code may have improved nearly 10,000-fold since 1992; the performance per watt only improved 300-fold and the performance per square foot only 65-fold. The “mere” 300-fold increase in performance per watt implies that supercomputers are not making as significant advances in power efficiency as in performance; interdependently, the relatively miniscule 65-fold increase in performance per square foot (or alternatively, performance per square meter) means that advances in space efficiency, when compared to performance, have been virtually non-existent. These smaller gains in efficiency oftentimes result in the design and construction of new machine rooms, and in some cases, require the construction of entirely new buildings. Unfortunately, this particular focus has led to the emergence of supercomputers that consume egregious amounts of electrical power and produce so much heat that extravagant cooling facilities must be constructed to ensure proper operation. In addition, the emphasis on speed as the performance metric has adversely affected other performance metrics, e.g., reliability. As a consequence, all of the above has contributed to an extraordinary increase in the total cost of ownership (TCO) of a supercomputer. Therefore, we espouse the importance of being green in high-performance computing and even argue for a complementary list to the TOP500: The Green500 List.

E-Malaria: Getting into the Blood of Young Scientists

Jeremy Frey, University of Southampton

The e-Malaria project aimed to bring together 16-18 year old school students with university researchers to explain aspects of computational drug design using the example the hunt for new anti-malarial drugs. Malaria kills a child every thirty seconds, and 40% of the world’s population lives in countries where the disease is endemic. Resistance to existing drugs is increasing and with global warming the range of the malaria carrying mosquitoes is expected to increase, so there is a growing need for new drug compounds. The challenge was presented to school students who to use a distributed drug search and selection system via a web interface to design potential drugs to act on the DHFR enzyme. The project makes use of industrial code for the docking study (“GOLD” from CCDC) and as such presents valuable lessons in how to achieve the integration of industrial programs into a “free” outreach environment. The results of the trials are displayed in an accessible manner, giving students an opportunity for discussion and debate both with peers and university researchers, to lean about computational drug design and Chemistry in general. The initial outreach project was extended to provide a similar challenge for undergraduate chemists as part of a chemical informatics course. For this course more complex design and modeling challenges were devised, that used the same e-Malaria core programs, but at a level relevant to more advances chemical skills. The types of problems devised will be illustrated in the presentation.

Xbox Science: Video Games Where Everybody Wins!

Leonard McMillan, University of North Carolina at Chapel Hill

What if solving nature’s puzzles was entertaining as well as fulfilling? Would you rather play a first-person shooter, or be the first person to figure out a gene’s function? Or is it possible to do both? This is the challenge that I gave a class graduate students. We explored the potential of game interfaces, game-design principles, and game production approaches for constructing bioinformatics tools. You might ask why? 1) Set-top Supercomputers. The most powerful computer in most homes today is a video-game console. Today’s machines boast multiple cores and 100+ MFlop performance with high-end graphics. Moreover, at $299, they represent one of the best MFlop per dollar ratios in history. 2) Most bioinformatics applications stink. Typical bioinformatics tools require their user to be literate in statistics, computer science, and biology. Imagine if, in order to drive a car, you had to simultaneously be a test-driver, mechanic, and combustion engineer. This is what is expected of today’s biologists. Lab software focuses on function and features rather than usability. In contrast, video game manuals are seldom read. Is it possible to build scientific tools that are usable by anyone? Can we make them fun? 3) Leverage an insatiable resource. Can we harness the minds and reflexes of the billion-plus gamers worldwide to find cures for disease with incentives of being a high scorer rather than securing drug-patent rights? Many of the tasks confronted by biologists amount to combinatorial puzzles, not unlike the game “Bejeweled”. A biologist may spend years searching for patterns within a gene expression array. What if hundreds of gamers joined in, and explored their datasets in parallel? In this talk, I will share our experiences in writing video games with a purpose. This will include discussions of some of the underlying biology, as well as game demonstrations.

Green Computing: A Power-Aware Run-Time System for Datacenter Environments

Wuchun Feng, Virginia Polytechnic Institute and State University

Since the advent of the computer, performance has always been defined with respect to speed. As a consequence, microprocessor vendors have not only doubled the number of transistors (and speed) every 18-24 months, but they have also doubled the power densities. Consequently, keeping a datacenter environment functioning properly requires continual cooling and exhaust, thus resulting in substantial operational costs, e.g., the annual cost of powering and cooling computer servers worldwide is fast approaching the annual spending on new machines. In addition, the increase in power densities has led to a decrease in system reliability, thus leading to lost productivity. To address these problems in the datacenter, we present a power-aware scheduling algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on the performance of datacenter workloads. We evaluate our power-aware scheduling algorithm on actual platforms based on AMD and Intel platforms, which support PowerNow! and demand-based switching, respectively. For sequential and parallel scientific workloads in datacenters, the energy savings averages 20% and 25%, respectively, with maximum energy savings reaching as high as 70%. The energy savings for business workloads in datacenters is even higher given their transaction-based execution profiles.

Frontiers in metadata management for e-Science applications: the S-OGSA approach

Oscar Corcho, Paolo Missier, Pinar Alper, Sean Bechhofer, Carole Goble; University of Manchester

eScience applications are usually characterized by their distributed and knowledge-intensive nature, what poses interesting new metadata management challenges, such as metadata distribution across application components, access control, evolution, etc. Given the role of metadata in these applications, we think that it should be treated as a first class entity, coexisting with other entities in the system (Web services, datasets, sensors, documents, etc.). This shift in the treatment of metadata allows dealing appropriately with the previous challenges. This is what we propose in the S-OGSA architecture (which stands for Semantically-enriched Open Grid Service Architecture, originally proposed as a semantic extension of Grid applications), and what we have implemented in its supporting reference technological infrastructure. In S-OGSA, metadata can refer to any first-class entity that an application is dealing with (services invoked by a workflow engine, datasets, sensors, scientific documents, etc.), and it can be represented in multiple forms (natural language documentation, user-defined tags, ontology instances, etc.). Metadata is stored in metadata containers, called Semantic Bindings, which are linked to the entities that they refer to and which can be accessed either independently or jointly in a system, regardless of their physical distribution. Access control can be applied with different levels of granularity, since Semantic Bindings may contain small or large pieces of metadata from a specific resource, and metadata lifetime can be managed by means of appropriate event-driven notification mechanisms that trigger transitions between metadata states. We describe the main design principles of S-OGSA and how they can be applied in different e-Science scenarios, with examples of a prototype developed in the domain of satellite image quality analysis.

Model and Architecture for Policy-Based Governance

Munindar Singh, Yathiraj Udupi, North Carolina State University

Collaboration among peers is common in large-scale scientific computing (as in production grids). Often, resources (e.g., data, compute servers) need to be shared among multiple parties in a manner that respects both the overall needs of the collective and the individual. The famous example of preemptive scheduling is a case in point. Currently, computational support for collaborative resource sharing is inadequate. A common approach is to apply policy engines. This poses two challenges. One, when autonomous peers interact, a centralized policy engine cannot make decisions for all of them. Two, current approaches lack a deep conceptual model of how collaboration takes place in scientific computing (or service engagements broadly). We define Governance as the process by which peers achieve agreement about how they will administer themselves. We contrast governance with management, which (as the current mindset) applies to a superior managing his or her subordinates — clearly inapplicable among peers. We have developed a conceptually well-grounded approach for Governance. This models organizations based upon our formalization of commitments. Each organization is defined in terms of the standing commitments among its members. These commitments constrain the members’ behaviors. Organizations can enter into contract with one another. Our conceptual model includes a rich vocabulary by which interactions among peers (such as for administering organizations) can be captured, and appropriate policies stated for each peer to satisfy both collective and individual needs. This is how we achieve policy-based governance. A multi-agent prototype demonstrates our model and architecture. Our research seeks to capture important technical properties of policy-based governance. This presentation summarizes work previously reported in AAAI 06 and SCC 06 and 07.

High Performance Computing Mortgage Pricing Project

Richard Buttimer, The University of North Carolina at Charlotte

Mortgages are one of the major fixed-income investment classes in the U.S. They are held by financial institutions, pension funds, mutual funds, and hedge funds. They are also frequently held in the investment portfolio of non-financial firms. Mortgages are an extremely complex financial instrument for a variety of reasons: they are long-lived, they are extremely interest rate sensitive, and they have embedded within them the borrower’s options to default and prepay. In practice, mortgage pricing is nearly always done through very lengthy and computationally-intensive Monte Carlo simulation. Microsoft, RENCI, and UNC Charlotte are working together to develop a mortgage pricing system utilizing the Microsoft Hosted High Performance Computing system. This system will initially be used in advanced MBA courses. Students in these courses will be assigned the task of managing simulated mortgage portfolios similar to those held by large money-center banks. They will utilize the pricing model to determine not only the prices of the securities they hold, but also their risk characteristics. The system will also provide prices and risk characteristics for a variety of alternative investment and hedging vehicles. This system will provide the students with a near “real world” mortgage portfolio management experience. Microsoft, RENCI, and UNC Charlotte will each gain experience with hosted high-performance computing applications. Although the system will initially utilize a publicly-available model, the Office of Thrift Supervision (OTS) regulatory model, the model could potentially be expanded to be a commercially viable system.

Informative Robotic Sensing for Environmental Applications

Amarjeet Singh, Maxim Batalin, William Kaiser; University of California, Los Angeles

Networked InfoMechanical Systems (NIMS) provide a family of robotic platforms for diverse environment monitoring applications. We provide an overview of these systems and their applicability through several real world sensing campaigns that provided scientists with the data at a scale and resolution that was not previously possible. The new class of observational methods is also supported by experimental design that optimizes measurement fidelity by combining knowledge of measurement objectives, phenomena models, and system constraints. We have developed and demonstrated the generally applicable, Iterative experimental Design for Environmental Applications (IDEA), methods and systems to efficiently use distributed sensing and computing for understanding the high spatial and temporal variability associated with environmental applications. Next, we model the observed natural system as a Gaussian Process and present a resource-cost-aware informative path planning approach. In this approach, we compute a set of most informative observation locations that can be visited by the mobile robot with a constraint on the upper bound of the resource capacity of the robot, such as limited sensing time or limited battery capacity. For this NP hard problem, we provide strong approximation guarantees for the single robot scenario and extend it for multiple robots providing near optimal approximation guarantee. The NIMS family of sensing systems, together with a systematic experimental design approach that also involves phenomena modeling, enabled the first high resolution imaging of several important scientific phenomena such as contaminant concentration and algal bloom dynamics. This work is currently being applied to survey entire river systems in interdisciplinary investigations providing scientists with important new characterization of primary national water resources.

Enhanced kNN-QSAR Modeling of Aquatic Toxicity of Diverse Organic Compounds Tested by Fathead Minnows

Lin Ye, Hao Zhu, Alexander Golbraikh, Alexander Tropsha, University of North Carolina at Chapel Hill

Predictive models for acute fish toxicity (96 hour fathead minnow LC50) have been developed. A dataset consisting of 587 molecules with experimentally determined LC50 values was compiled. The entire dataset was randomly divided into modeling set (470 compounds) and external validation set (117 compounds) and this procedure was repeated ten times to generate 10 modeling-validation set pairs. Molecular descriptors were calculated by Dragon and MolConnZ software for all compounds in every subset. Each modeling set was split into multiple training-test sets using a diversity sampling approach. QSAR models were developed for individual training sets by kNN methods and the resulting models were validated using the respective test sets. The models that satisfied the cutoff (both leave-one-out cross-validation Q2 for the training set and linear fit R2 for the test set greater than 0.6) were kept. All the successful models were used to make the consensus prediction of the external validation set. The statistical results of all 10 external validation experiments were similar (R2 range from 0.67 to 0.83, Mean Absolute Error (MAE) range from 0.46 to 0.66). The results were improved by removing outliers of the modeling set compounds in the chemical space before model development: for the external validation sets the range of R2 was between 0.76 and 0.82, and MAE was 0.41 and 0.44.

Context-aware Optimized Sensing of Physiological Signals

Winston Wu, Maxim Batalin, William Kaiser, University of California, Los Angeles

Recent advancement in micro sensor technology permits miniaturization of conventional physiological sensors. Combined with low-power, energy-aware embedded systems and low power wireless interfaces, these sensors now enable patient monitoring in home and workplace environments in addition to the clinic. Low energy operation is critical for meeting typical long operating lifetime requirements. Important challenges appear as some of these important physiological sensors, such as electrocardiographs (ECG), introduce large energy demand because of the need for high sampling rate and resolution, and also introduce limitations due to reduced convenience of user wearability. Energy usage of the distributed sensor node systems may be reduced by activating and deactivating sensors according to real-time measurement demand. Indeed, as will be described, not all the physiological sensors are required at all times in order to achieve high certainty diagnostics. Our results show that with proper adaptive measurement scheduling, an ECG signal from a subject may be needed for analysis only at certain times, such as during or after an exercise activity. This demonstrates that autonomous systems may rely on low energy cost sensors combined with real time computation to determine patient context and apply this information to properly schedule use of high cost sensors, for example, ECG sensor systems. We have implemented a wearable system based on standard widely-used handheld computing hardware components. This system relies on a new software architecture and an embedded inference engine developed for these standard platforms. The performance of the system is evaluated using experimental data sets acquired for subjects wearing this system during an exercise sequence. This same approach can be used in context-aware monitoring of diverse physiological signals in a patient’s daily life.

Using Low-Cost A-GPS Cell Phones and Web Mapping Applications for Multi-Jurisdictional Emergency Response Mobilizations

Uma Shama, Lawrence Harman, Juozas Baltikauskas, Daniel Fitch, Glen Kidwell; Bridgewater State College

We document the collaboration of the GeoGraphics Laboratory at Bridgewater State College and the Town of Brewster (MA) Fire and Rescue Department to develop a low-cost automatic vehicle location system using commercial-off-the-shelf (COTR) military-specification cell phones and web mapping applications to provide situational awareness and post-action analysis for emergency response command and control personnel in a mobilization involving multiple jurisdictions. Using open-source software, a program was written to send assisted-global positioning systems (A-GPS) data at very high refresh rates (2-4 seconds) using inexpensive data-only cell phones and standard Internet communications. The web mapping application provides a rich no-cost display of the AVL data on public domain web service http://www.geolabvirtualmaps.com/ (Southeastern MA Emergency Response) with the capacity to add custom features defined by the local emergency response and emergency management personnel. It is hosted on Microsoft Virtual Earth but uses GeoRSS standards for creating points, lines and areas for geographic objects added to the application. It also provides a dynamic reverse geo-coding feature that displays the nearest street address on the vehicle location label of the web display for emergency response commanders. The system was tested as a part of the Fourth of July Provincetown (MA) Fireworks Mobilization involving ambulances and emergency response personnel from six towns. This presentation will provide the design features, a geo-spatial analysis of the mobilization and debriefing of the mobilization commander. This assessment will critique the performance of the technology before, during and after the mobilization.

The Virtual Space Interaction Test Bed (VISIT)

Thomas Finholt, Erik Hofer, David Lee University of Michigan

The School of Information at the University of Michigan recently launched the Virtual Space Interaction Test bed (VISIT) project. VISIT demonstrates a number of “ultra-resolution” collaboration capabilities. Using OptIPortals of varying sizes (e.g., arrays of commodity LCD displays coupled with computing clusters and high performance networking), VISIT supports visualization of images and data at very high resolution (currently 50 megapixels) alongside uncompressed HD video of distant collaborators. Previous use of OptIPortals has emphasized collocated collaboration and visualization. A key feature of VISIT is distributed installation of OptIPortals to enable distant collaboration. Requirements for distant collaboration are much different. For example, with limited or reduced shared visual access, it is necessary to create or simulate many of the cues used in shared spaces to coordinate conversation and to orient to common visual references. Therefore, VISIT explores the use of multi-modal sensor data, artifacts (e.g., shared electronic posters), and visual cues to allow distributed collaborators to use OptIPortals both to conduct their scientific work better as well as to improve awareness of the availability and presence of remote colleagues. This model of OptIPortal use emphasizes socio-technical aspects of the technology, seeking to produce gains in scientific understanding by improving the process of collaboration, as well as through the introduction of advanced visualization capabilities. Therefore, a key goal of VISIT is evaluation of use in terms of the impact on creation and maintenance of social network ties among scientists, research performance (e.g., time to produce publications), and usability.

Enabling Pivot Charts on Massive Multidimensional Datasets

Mehrdad Jahangiri, Cyrus Shahabi; University of Southern California

Spreadsheets allow us to perform complex data analysis on scientific datasets. However, they cannot operate efficiently on large multidimensional datasets generated by the current data acquisition methods. Current science practice is to store the original data in databases or ftp sites and then manually generate a smaller subset of the data (by sampling, aggregating, or categorizing). Yet, this time-consuming process suffers from one major drawback. By losing the detailed information and working with the second-hand dataset, we conduct a biased study of the data by verifying our known hypothesis rather than being surprised with unknown facts. One of the mostly exercised functionalities of spreadsheets is to generate meaningful plots over the data. However, to the best of our knowledge no other work has studied plots as “queries” on large datasets. A Plot query summarizes how a fact changes over a set of attributes and is visually represented in various forms of charts. The valuable insight provided by these queries comes from the illustrated relationship among the plot points. Thus it is essential to preserve this relationship in approximate or progressive answering rather than conserving the accuracy of each individual plot point. Here, we propose a wavelet-based technique that exploits I/O sharing across plot points to evaluate the query progressively and efficiently. The intuition comes from the fact that we can decompose a plot query into two sets of aggregate and slice-and-dice queries. Subsequently, we can effectively compute both as investigated in our earlier studies. Our technique is not only efficient as an exact algorithm but also very effective as an approximation method in case of limited query time or storage space. We believe this study can proactively lead us toward building an interactive pivot chart on massive multidimensional datasets.

An Infrastructure for Combining Geospatial Research with Computational Intensive Social Sciences

Tiberiu Stef-Praun, Ian Foster, Computation Institute/University of Chicago; Robert Townsend, Economics Dept/ University of Chicago

We report on a project that seeks to scale up this approach to larger quantities of data, more computationally demanding analytic methods, and a larger population of economist and student users. At the core of this project is an infrastructure that integrates spatial data services for organizing, accessing, analyzing, and displaying spatial data, and computational services that allow for the distributed processing of models on Grid-enabled resources. Integration via Web Services allows users to pose questions that are answered by extracting data from GIS data sources, running substantial computations on that data and depositing derived data back into the spatial data store.

The Data Playground: An Data-Driven Workflow Specification Environment

Carole Goble, Andrew Gibson, Matthew Gamble, Katy Wolstencroft; The University of Manchester; Tom Oinn, The European Bioinformatics Institute

Workflow environments like Taverna (http://www.mygrid.org.uk/) are great for scientists who have a clear understanding of their task and goals. However, a significant amount of bioinformatics does not have such well defined goals. We present the Data Playground, an environment designed to encourage the uptake of workflow systems in bioinformatics through more intuitive interaction by focusing the user on their data rather than on the processes. A prototype plug-in for the Taverna workflow environment shows how we can promote the creation of workflow fragments by automatically converting the users’ interactions with data and Web Services into a more conventional workflow specification. We claim that this exploratory mode is more natural to users, and enables workflow development by example.

Combinatorial QSAR Analysis of Histone Deacetylase Inhibitors and QSAR-based Database Mining

Hao Tang, Alexander Tropsha, Simon Wang, The University of North Carolina at Chapel Hill; Alan Kozikowski, University of Illinois at Chicago; Bryan Roth, The University of North Carolina at Chapel Hill

Histone deacetylases (HDAC) play a critical role in transcription regulation. Small molecule HDAC inhibitors are an emerging target for treating cancer and other cell proliferation diseases. Several previous reports have studied 3D Quantitative Structure- Activity Relationship (QSAR) to assess the possibility of computer based drug mining for HDAC inhibitors. We employed variable selection k Nearest Neighbor approach (kNN) and Support Vector Machines approach (SVM) to generate QSAR models for 59 chemically diverse compounds with inhibition activity on class I histone deacetylase. MOE and MolConnZ based 2D descriptors were combined with kNN and SVM approaches independently to improve the predictability of models. Rigorous model validation approaches were employed including randomization of target activity (Y-randomization test) and assessment of model predictability by consensus prediction on two external datasets. Highly predictive QSAR models were generated with leave-one-out cross validation R2 (q2) values for the training set and R2 values for the test set as high as 0.81 and 0.80, respectively with MolconnZ /kNN approach and 0.94 and 0.81, respectiveley with MolconnZ/SVM approach. Validated QSAR models were then used to mine four chemical databases which included a total of over 3 million compounds resulting in 48 consensus hits, including two reported HDAC inhibitors not included in the original data set.

Provenance in Kepler-based Scientific Workflow Systems

Meiyappan Nagappan, North Carolina State University; Ilkay Altintas, San Diego Supercomputing Center ; George Chin, Pacific Northwest National Lab; Daniel Crawl, San Diego Supercomputing Center; Terence Critchlow, Pacific Northwest National Lab; David Koop, University of Utah; Jeffrey Ligon, North Carolina State University; Bertram Ludaescher, University of California, Davis; Pierre Mouallem, North Carolina State University; Norbert Podhorszki, University of California, Davis; Claudio Silva, University of Utah; Mladen Vouk, North Carolina State University

Scientific workflow management systems are used to automate scientific discovery. Increasing complexity of such workflows, and sometimes legal reasons, is fueling a demand for more run-time and historical information about the workflow processes, outputs, environments, etc. Properly constructed run-time and provenance information collection framework can help manage, integrate and display the needed information. In this paper we present the provenance system developed by the Department of Energy Scientific Data Management Enabling Technology Center’s Scientific Process Automation group. The solution adds to the successful Kepler scientific workflow support system by integrating Kepler with a standard LAMP – Linux Apache MySql PHP environment to provide a very flexible and readily deployable (K)LAMP scientific workflow support environment for e-science. The solution is sufficiently modular to allow use of other workflow engines and other component solutions. This paper discusses the architecture of the solution, its deployment and some of the principal challenges it is solving: how to collect provenance information in a standardized and seamless way and with minimal overhead, how to store this information in a permanent way so that the scientist can come back to it at anytime, and how to present this information to the user in a logical manner. Also, part of the issue is privacy policies and strict security policies that apply to Department of Energy (DoE) national laboratories.

Discovery of Novel Geranylgeranyltransferase Inhibitors through Virtual Database Mining

Yuri Peterson, Duke University; Simon Wang, The University of North Carolina at Chapel Hill ; Patrick Casey, Duke University; Alexander Tropsha, The University of North Carolina at Chapel Hill

Geranylgeranyltranferase inhibitors (GGTIs) are small molecule drugs that inhibit C20 lipid modification of CaaX motif proteins. Attenuating function of these proteins will provide therapeutic benefit in cancer, inflammation, multiple sclerosis, viral infection (HepC/HIV), apoptosis, angiogenesis, rheumatoid arthritis, atherosclerosis (vascular disease), psoriasis, glaucoma and diabetic retinopathy. However, there are only two publicly known chemical scaffolds available for GGTIs at present. We have developed the combinatorial quantitative structure-activity relationship (QSAR) models for 48 known GGTIs, using k-nearest neighbor (kNN) method, automated lazy learning (ALL) method and partial least square (PLS) method. The models were rigorously validated based on several statistical criteria, including the randomization of the target property (Y-randomization), the verification of the training set models’ predictive power using test sets, and the establishment of the models’ applicability domain. The validated QSAR models were used to mine major publicly available chemical databases, including the National Cancer Institute database of ca. 250,000 compounds, the Maybridge database of ca. 54,000 compounds, the ChemDiv database of ca. 630,000 compounds, the WDI database of ca. 59,000 compounds, and the ZINC 7.0 database of ca. 6,500,000 compounds. These searches resulted in multiple consensus hits and had revealed several new chemical scaffolds for GGTIs. They had been validated by biological assays and patented recently. This study illustrates that the combined application of predictive QSAR modeling and database mining may provide an important avenue for rational computer-aided drug discovery.

A Disease Search Engine for Early Incidence Warning and Monitoring

Hanan Samet, University of Maryland; Jagan Sankaranarayanan, University of Maryland; Michael Lieberman, University of Maryland; Adam Phillippy, University of Maryland

eScience techniques can be used to understand the source and spread of disease epidemics to contain future outbreaks, thereby possibly reducing the potentially massive toll on human life in underdeveloped nations. Even though epidemiological information is available for many pathogenic microbes, incidence reports are scattered and are difficult to summarize. We have built a system to automatically extract, classify and organize incidence reports based on geographic location and type for analysis by domain experts. Documents from the U.S. National Library of Medicine (www.pubmed.gov) and the World Health Organization (www.who.int) have been tagged according to their spatial and temporal relationships to specific disease occurrences, and presented graphically via a map interface. This work has leveraged our experience with the SAND Spatial Browser and Spreadsheet to provide spatial and textual search capabilities on the web (e.g., documents on “influenza” near “Hong Kong”). Users can also see the phrases in the documents that satisfy the query, thereby facilitating easy verification as well as dismissal of false positives due to errors in identification of geographical references, which are difficult to avoid. The user interface also provides the ability to restrict the search result to a particular time period. In addition, newspaper articles have been tagged and indexed to bolster the surveillance of ongoing epidemics, while examining past epidemics using our system leads to improved understanding of the sources and spreading mechanisms of infectious diseases. In our paper, we will describe the design of our system which combines state of the art technologies from different areas of computer science and demonstrate the working and the usefulness of our system.

Collection Processing and Comparative Studies in GPFlow

James Hogan, Paul Roe; Queensland University of Technology

Modern scientific enquiry, particularly in bioinformatics, is increasingly characterized by fine-grained comparative analyses over large data sets. Such studies require the automation of software tools to operate across multiple data values, and sensible strategies for managing the explosion of outputs which may result. Modern scientific workflow systems, therefore, must provide support for these activities and for the active involvement of the user in selection, combination and filtering. In this talk we present a new version of the GPFlow scientific workflow system which provides extensive support for collection processing, but does so in a manner largely transparent to the user, and which avoids the need for the scientist to take direct control of operational plumbing. GPFlow is a novel, web-accessible workflow system which makes large-scale comparative studies accessible without programming, and eases the transition from small-scale experimentation to large scale serious analyses. In a typical comparative study, several tools and services are used in concert, and all must be lifted to operate across sets of values to implement the analysis, with some components drawing upon outputs from multiple precursors. Data must be combined and filtered at each end of the process. The model and its implementation are presented in the context of a core Bioinformatics problem – the search for regulatory motifs. The model is novel in allowing a workflow on a single data value to be automatically lifted to operate on a set of values. Users may thus prototype on the small-scale and execute on the large, a process which requires no changes to the underlying workflow. The model follows our previous work in supporting combined interactive and batch operation.

The North Carolina State University Virtual Computing Laboratory 'Providing an Efficient eScience Environment'

Eric Sills, Sam Averitt, Michael Bugaev, Aaron Peeler, Henry Schaffer, Josh Thompson, Mladen Vouk; North Carolina State University

North Carolina State University has developed a computational and application resource brokering, differentiation, and delivery system called Virtual Computing Laboratory (VCL). VCL allows sharing of a common hardware infrastructure by a range of applications from CAD packages. Initially, VCL virtualized the STEM computing environments to deliver applications students needed for their course work and research via their personal computing devices rather than at a physical computer lab on campus. As development of VCL progressed, the hardware resources flowed back and forth between production Linux cluster nodes serving typical HPC workloads and providing on-demand student-computing applications on various operating systems. Demand curves for these two uses tend to be out of phase with student computing demand, building as the academic semester progresses, and HPC demand peaking following the end of exams. This allows much better utilization of the hardware resources. VCL has been in production use at North Carolina State University for about three years. Flexibility of VCL has proven to be essential in easily supporting specialized university research computing demands, and our experience is that VCL-based hardware and application management provides a much greater service at a considerably lower cost per unit of service. In addition, VCL provides the various standard and customized services with much less intervention of the central IT staff than previously necessary. This paper discusses the details of the VCL architecture, economics, security and versatility.

Environmental Monitoring through Acoustics using a Network of Smartphones

Richard Mason, Binh Pham, Paul Roe, Queensland University of Technology

Sound is a rich medium carrying lots of information which is tractable for analysis. The natural environment is rich in sounds; potentially fauna, weather, and machinery can be located and recognized. Environmentalists use sound to measure the health of the environment by monitoring key species such as birds which are early indicators of environmental change. We have designed a sensor network based on smart phones for monitoring environmental change. The platform comprises smart phones running a custom application for recording bird song. Sensors are managed in an autonomic fashion to ensure that they operate reliably and efficiently for long periods of time. Recorded birdsong is uploaded to a relational database through a 3G telephony network. The nature of acoustic sensing means that large volumes of data are collected so data communication and optimization is important. Sensor recording can be remotely controlled through a web service interface. Sound data stored in a database is analyzed to recognize different birds and bird calls using a neural network. A novel noise reduction technique is employed prior to identification. The analyses potentially enable the location, type of bird and bird behavior (through bird call), to be known. From this, temporal and spatial profiles of bird behavior can be studied and the effects of environmental change can be known. A field study is being undertaken at Brisbane airport where a second runway is being constructed. Brisbane airport is located in an environmentally valuable wetland area, which is the habitat for much wildlife including the rare Lewins Rail. This study aims to address a number of questions regarding this bird using acoustic sensor networks. The sensors will provide valuable information on the birds’ habits as well as a measure of the impact of the new runways construction.

Accurate Differentiation of Docking Decoys using Quantitative Structure - Binding Affinity Relationship (QSBAR) Classification Models with ENTess Chemical Geometrical Descriptors

Jui-Hua Hsieh, Simon Wang, Shuxing Zhang, Alexander Tropsha; The University of North Carolina at Chapel Hill

Molecular docking has become a common technique in structure-based drug design. Although state-of-the-art search algorithms implemented in the docking software can generate native-like poses in the binding sites, the performance of the scoring functions is still unsatisfactory. The failure to correlate the key interactions with binding affinities leads to the “geometric decoys”, poses deviating more than 3.0 angstrom RMSD from the native pose but with better energy scores. (Shoichet BK. et al. J. Med. Chem. 2005, 48, 3714-3728.). k-Nearest Neighbor (kNN) binary QSAR models generated from 264 protein-ligand complexes in the Protein Databank using ENTess descriptors were applied to four geometric decoy datasets, e.g. Thrombin, Dyhydrofolate Reductase (DHFR), Thymidilate Synthase (TS) and Acetylcholine Esterase (AchE).

Shared Genomics - Accessible HPC for Medical Genomic Research

David Hoyle, Iain Buchan, Peter Crowther; University of Manchester

Microarray technology for genome-wide Single Nucleotide Polymorphism (SNP) genotyping provides a unique opportunity to study complex diseases. This opportunity also presents computational and knowledge management challenges, and the statistical analysis presents a computational bottleneck in processing the raw data, motivating the need for a High Performance Computing (HPC) based solution. Statistical analysis of the raw data produces an equally large volume of derived data. Making sense of this derived data requires integrating the statistical analyses with information already known to the research community, such as SNP location, gene regulation, relevant biochemical pathways etc. Leveraging this community knowledge allows us to filter the statistical analysis and focus upon the most important genetic determinants of the diseases. The community knowledge exists in the form of individual expertise of scientists and information deposited in distributed databases and knowledge repositories. Easy access to both HPC infrastructure and community knowledge will be crucial for accelerating new research findings from genome-wide SNP studies. At NIBHI we have begun to develop, in collaboration with Microsoft, the necessary HPC infrastructure. The HPC facility will be accessed via a SharePoint portal site, providing a shared environment through which collaborating scientists exchange results, analyses, comments and documents. Running the statistical analyses on the HPC infrastructure is executed by initiating workflows from the portal site. Access to community knowledge will be done through automatic retrieval of annotation data from distributed sources. This can be performed via integration with existing bioinformatics workflow management systems, such as Taverna, that allow us to re-use workflows calling web services accessing the knowledge repositories.

Simulating Air Quality and Other Wind Engineering Applications with an Urban Landscape

Alan Huber, The University of North Carolina at Chapel Hill

High-fidelity local-scale Computational Fluid Dynamics (CFD) simulation of pollutant concentrations within roadway and urban landscapes is feasible using current high performance computing. Local-scale CFD simulations are able to account rigorously for topographical details such as terrain variations and building structures in urban areas. Solar or anthropogenic heating may be added to terrain and building surfaces. Real human environments may be directly simulated to support urban planning and response to emergency situations. There are a wide range of potential applications where computational wind engineering will become routine in coming years as computing hardware and software continues to grow and expand the frontiers for application. This presentation will briefly review the history of developments of computational environmental fluid dynamics. Modern day fluid dynamics has evolved much since Sir Isaac Newton’s physical equations and the evolution of the Navier- Stokes equation for fluid flow due to advancing computational hardware and software. The Navier-Stokes equation is the general basis for all CFD applications, for example, from weather prediction to vehicular aerodynamics. Example applications developed over the past few years while employed with the US Environmental Protection Agency are now being applied as an adjunct research faculty of the University of North Carolina using the critically needed computing capacity of RENCI’s Topsail computing system. In particular, simulations of the air transport of pollutant emissions within the Madison Square Garden area of New York City will be demonstrated. The virtual environment for midtown Manhattan was been developed to support planning and response to potential accidental emissions or intended terror activities. The age of direct local-scale environmental simulation has arrived.

QSAR Modeling of Blood and Brain Barrier Permeability of Diverse Organic Compounds

Liying Zhang, Hao Zhu, Alexander Tropsha; The University of North Carolina at Chapel Hill

We have developed robust QSAR models of Blood-Brain Barrier (BBB) permeability using k-Nearest Neighbors (kNN) and Support Vector Machines (SVM) approaches and molecular topological descriptors. The modeling set of 159 compounds was divided into external evaluation set (15 compounds) and multiple training and test sets (the remaining 144 compounds). The consensus QSAR model accuracies were q2=0.91 and R02=0.68 for self-validation and external evaluation sets, respectively. These models were applied to additional external evaluation sets consisting of 99 drugs (from the WOMBAT-PK dataset) and 267 organic compounds classified as permeable (BBB+) or non-permeable (BBB-), and the best prediction accuracies were 82.5% and 59.0%, respectively. Noticeable improvements in prediction accuracy were achieved after applying applicability domain threshold for the prediction of evaluation sets: the accuracy for the first external evaluation set increased to R02=0.75 and for both of the additional external sets to 100%. The resulting models can be used to guide the design of pharmaceutically relevant chemical libraries towards drug-like compounds with optimal BBB permeability.

Combinational QSAR Modeling of Chemical Toxicants Tested against Tetrahymenapyriformis

Hao Zhu, Alexander Tropsha, The University of North Carolina at Chapel Hill

Selecting suitable quantitative structure-activity relationships (QSAR) approaches for a specific toxicity endpoint is one of the critical issues for the development of robust predictive computational toxicity models. To this end, we have compiled an aqueous toxicity dataset containing 1,093 unique compounds tested in the same laboratory over several years against tetrahymenapyriformis. A modeling set consisting of 644 compounds randomly selected from the original set was distributed to five chemoinfomatic groups to use their own QSAR approaches and descriptors for model development. The remaining 449 compounds in the original set were used as an evaluation set to test the predictive power of individual models. In total, our virtual collaboratory generated 11 different validated QSAR toxicity models for the training set. The best models had the Leave One Out (LOO) cross-validation correlation coefficient R2(q2) = 0.93 for the training set and the correlation coefficient R2 for the external evaluation sets as high as 0.83. The results demonstrated that the evaluation of the models only based on the statistical parameters obtained for the modeling set may mislead the selection of the externally predictive models. We have developed a consensus model based on the average of the prediction results of all 11 models. The consensus model resulted in the best prediction accuracy for the training and external evaluation sets as high as 0.95 (q2) and 0.86 (R2), respectively. The utilization of the applicability domain could be included to balance the prediction accuracy with the chemistry space coverage based on the requirement of the users with respect to the error tolerance level.

Ad Hoc Scientific Workflows through Data-Driven Service Composition

David Chiu, Gagan Agrawal; The Ohio State University

Scientific domains increasingly involve data that can be obtained from the deep Web, while having other datasets in low-level formats. At the same time, an increasing number of Web or grid services are being made available. This leads to an interesting question, “Can we query low-level and deep Web data by automatically composing services and creating workflows”. Our work is driven by a collaboration with geodetic sciences, funded by an NSF grant for Cyber infrastructure for Environmental Observatories. Specifically, geospatial data is known to have: – Large Volumes: data may be collected in a continuous manner, – Low-level Format: data is normally stored in native low-level format, rather than in databases. – High Dimensionality: high dimensionality inherently alludes to nontrivial complexity for processing certain types of data. – Heterogeneous Data Sources: disparate data sources can collect and represent the same information with different accuracy and format, all of which offer various precision and accuracy but are ultimately used to describe the same information. – Temporal-Spatio Domain: since geospatial data is highly volatile, rigorous maintenance of descriptors such as location and date are imperative to providing accurate information. We propose a system that automatically constructs ad hoc workflows for answering high-level queries based on both service and data availability. A specific contribution of this work is the so-called “data-driven” capability in which we provide a framework to capture and utilize information redundancy that is present in heterogeneous data sources. We will use “machine-interpretable metadata” to be able to understand and parse low-level datasets and use them with the services.

A Novel Approach to Structure-based Pharmacophore Search Using Computational Geometry and Shape Matching Techniques

Jerry Ebalunode, North Carolina Central University; Zheng Ouyang, University of Illinois at Chicago; Jie Liang, Weifan Zheng, North Carolina Central University

The structure-based drug design methods are typified by docking technologies that have been widely adopted by the pharmaceutical industry for virtual screening and library design. They are often the computational tools of choice for both lead generation and lead optimization. However, despite many reports of successful applications of off-the-shelf docking tools, serious issues remain unsolved in terms of the accuracies of docking poses and affinity scores. Recently, more intuitive and computationally more efficient structure-based methods have been reported that seek to find effective means to utilize experimental structure information without employing detailed docking calculations. These tools can (should) be coupled with efficient HTS technologies to improve the probability of success in the discovery process. For example, LigandScout has been successfully applied in several virtual and experimental HTS projects. We report the development of a new method that employs a rigorous computational geometry method and a deterministic geometric casting algorithm to derive the negative image of a binding site. Once the negative image of the binding site is generated, a variety of computer vision methods can be applied to compare and match small organic molecules with the shape of the negative image. We report the detailed computational protocol and its validation using known biologically active compounds extracted from the WOMBAT database. Models derived for selected targets are used to perform the virtual screening experiments to obtain the enrichment data for various methods. It is found that our new approach (Shape4 for shape pharmacophore) affords significantly better enrichment of hits than other methods studied in this work.

The Challenges for eScience with the Pan-STARRS Sky Surveys

Nick Kaiser, Jim Heasley, Eugene Magnier, Alex Szalay; University of Hawaii

The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) will use giga-pixel CCD cameras on multi-aperture telescopes to survey the sky in the visible and near infra-red bands. A single telescope system (PS1) has been deployed on Maui and a four-telescope system (PS4) will be sited on Mauna Kea on the Big Island of Hawaii. These systems will survey the sky repeatedly and will generate petabytes of image data and catalogs of billions of stars and galaxies. The images will be combined to generate a very sensitive multi-color image of the static sky, and differences between images will provide a massive database for “time domain astronomy”; the study of moving, transient or variable objects. In addition to the challenge of building the telescopes and detectors, the project is faced with the formidable challenges of processing the image data in near real time and making the catalog data accessible via relational databases in order to facilitate the eScience that this project promises. This talk will describe the scale and content of the data products and will outline the designs of the image processing and database and archiving systems.

The eScience program at the University of Copenhagen

Eric Jul, Brian Vinter; University of Copenhagen

The University of Copenhagen has started an eScience graduate program in eScience and has established an eScience center to further develop and enhance research in eScience. The university has recognized the importance of eScience and has therefore established an eScience graduate degree in eScience. While it is possible to take many eScience related courses in most degree programs at the University, the University feels that by establishing a separate eScience degree, a much stronger emphasis can be put on eScience. The new program has achieved solid backing from all department of the Faculty of Natural Sciences. At the workshop I, as Director of eScience Studies, would welcome the chance to present the approach that the University of Copenhagen has taken to promote the new eScience graduate degree program – and the motivation for establishing an eScience center that draws faculty members from many different areas of the Natural Sciences. As far as we know, our program is one of the very first to provide a cross-disciplinary program to students and, at the same time, where they can interact with researchers at a dedicated eScience research center. At the workshop, the motivation and rationale for the program will be presented and the specific core courses will be described.

Analysis and Characterization of Reactive Cysteines in Protein Structures and Within Cellular Signal Transduction Networks

Stan Thomas, Freddie Salsbury, Jr., Stacy Knutson, Leslie Poole, Jacquelyn Fetrow; Wake Forest University

Protein post-translational modifications play key biological roles by modifying the structure and function of proteins. A common example is that of protein phosphorylation in signal transduction, metabolism and cellular differentiation. Analysis of phosphorylation sites has led to a better understanding of kinase substrate specificity, methods for site prediction and a combined experimental/computational approach resulting in a better understanding of the yeast phosphoproteome. Cysteine sulfenic acid (Cys-SOH) is a catalytic intermediate at enzyme active sites, a sensor for cellular stress, a regulator of transcription factors and an intermediate in redox signaling. The cysteine post-translational modification to sulfenic acid is not random; features at or near the cysteine control its reactivity. To identify the features responsible for the propensity of certain cysteines to be modified to sulfenic acid, a list of 47 proteins (containing 49 Cys-SOH sites) was compiled. Modifiable cysteines are found in proteins from many structural and functional classes. The site itself is not located in any one type of secondary structure. To further identify residues affecting cysteine reactivity, sites were analyzed using both functional profiling and electrostatic analysis. The combined approach reveals mechanistic determinants not obvious from sequence comparison alone. The longterm goals of this work are: 1) to combine structural and electrostatic feature analysis to predict Cys-SOH modification sites; 2) to include other modifications and distinguish between types of reactive cysteines; 3) to create a publicly accessible database of known and potential modification sites. The database would link sequence, structure, chemical and biological data to allow researchers to assess the effects of mutations or the possibility of oxidative cysteine modifications in proteins.

Data Placement Services for eScience Workflows

Ann Chervenak, University of Southern California

Data management for eScience applications is a challenging problem. Data-intensive scientific applications may produce and consume terabytes of data, which must be staged into and out of the high-performance computing resources on which the application’s computational analyses run. These analyses are often represented as scientific workflows that consist of millions of interdependent tasks. Workflow management systems are increasingly used to manage the dependencies among these computational tasks and the movement of data sets that are produced or consumed during task execution. The placement of data sets on storage resources can have a significant impact on the performance of eScience workflows. For example, if data sets are placed near high-performance computing resources, they can be staged efficiently into computations that execute on those resources; moving data sets off computational resources quickly when task execution is complete can also improve performance. In this talk, we consider the use of policy-driven Data Placement Services to improve the performance of eScience workflows. We are studying a variety of placement policies that seek to place data sets in ways that are advantageous for scientific workflow execution. Our research focuses on the relationship between data placement services and workflow management systems, with the goal of making data placement largely asynchronous with respect to workflow execution, thus reducing the need for on-demand data staging by the workflow system. The workflow system can also provide hints to the data placement service system about the order in which data are accessed. Using two existing services, the Data Replication Service for staging data and the Pegasus workflow management system, we demonstrate that intelligent data placement has the potential to significantly improve the performance of eScience workflows.

Mapping the Early Universe with a Next Generation Radio Telescope of Silicon and Software

Lincoln Greenhill, Harvard-Smithsonian Center for Astrophysics; Daniel Mitchell, Steven Ord, Randall Wayth; Smithsonian Astrophysical Observatory

The expansion and cooling of the Big Bang was how the universe began, with particles eventually combining to form a dark sea of atomic hydrogen. Over time, gravity drew material together, giving rise to the earliest stars, black holes, and galaxies. Intense ultraviolet radiation, over time, heated and then destroyed the neutral hydrogen. Then the “dark sea” parted and the era of reionization, which lasted a billion years, brought about the most important structures in the universe we know. Yet we have only vague notions of how the universe evolved during this time. The best way to study reionization is to map the evolving distribution of hydrogen. The Mileura Wide-field Array (MWA) will do this for the first time; it is a new-concept, digital, radio “camera” in which the traditional telescope optics of lenses and reflectors are effectively replaced by software and high performance computers. The MWA computer pipeline will absorb in real time 128 gigabits of data per second (24×7), execute calibration and Fourier transform image construction on the fly, and accumulate reduced data to enable output at a manageable a few hundred TB per year, a 1000x reduction. This is one of the larger computing challenges in radio astronomy, and would have been impractical to attempt without recent computing advances. I will describe known MWA computing challenges, with emphasis on throughput and I/O, pipeline parallelization, possible application of GPUs, use of instrument simulations in algorithm and software development, scaling to future instruments, and collaboration thus far with the IIC.

Building Next-generation CyberCollaboratory for Environmental Observatories

Yong Liu, James Myers, Barbara Minsker, Joe Futrelle, Steve Downey; National Center for Supercomputing Applications; Il-hwan Kim, University of Michigan; Esa Rantanen, National Center for Supercomputing Applications

Providing community-scale infrastructure while enabling innovation by individual researchers is a central challenge for eScience efforts. Since 2004, the Cybercollaboratory, which is built on top of the open source Liferay portal framework, is part of the efforts of the at the National Center for Supercomputing Applications to build national cyber infrastructure to support collaborative research in environmental engineering and sciences. The CyberCollaboratory was used by Collaborative Largescale Engineering Analysis Network for Environmental Research (CLEANER), which is now the WATer and Environmental Research Systems (WATERS) network, project office and several CLEANER/WATERS test bed projects. Among over 400 registered users, over 100 had active involvements in the CyberCollaboratory. However, users have also reported usability issues. For example, users working in multiple groups found it difficult to get an overview of all of their activities and found differences in group layouts to be confusing. Users also found the standard account creation and group management processes cumbersome and wanted a better sense of presence and social networks within the portal. Keeping the document repository up-to-date as editing was performed on local files and as files were transmitted via email was another concern. As a result of this feedback and discussions with representatives from the CUAHSI (Consortium of Universities for the Advancement of Hydrologic Science) community, new design and development efforts were initiated in early 2007. This paper reviews the usability feedback and potential design changes and provides a summary of the changes made to the CyberCollaboratory.

Leveraging OGC Sensor Web Enablement and Open Source Enterprise Service Bus for RealTime Urban Digital Watershed Data Integration and Dissemination

Yong Liu, National Center for Supercomputing Applications; David Fazio, US Geological Survey; Tarek Abdelzaher, University of Illinois at Urbana-Champaign; Barbara Minsker, National Center for Supercomputing Applications

The value of real-time hydrologic data dissemination including river stage, stream flow, and precipitation for operational storm water management efforts is particularly high for communities where flash flooding is common and costly. Ideally, such data would be presented within a watershed-scale geospatial context to portray a holistic view of the watershed. Recent efforts on providing unified access to hydrological data have concentrated on creating new SOAP-based web services and common data format e.g. WaterML and Observation Data Model for data access e.g. HIS and HydroSeek. OGC sensor web enablement SWE proposes a revolutionary concept, however, these efforts do not facilitate dynamic data integration/fusion among heterogeneous sources, or data filtering and support for workflows or domain specific applications. We propose a light weight integration framework by extending SWE with open source Enterprise Service Bus e.g., mule as a backbone component to dynamically transform, transport, and integrate both heterogeneous sensor data sources and simulation model outputs. We will report our progress on building such framework where multi-agencies’ sensor data and hydro-model outputs with map layers will be integrated and disseminated in a geospatial browser e.g. Virtual Earth. Our project is the result of collaboration between the National Center for Supercomputing Applications, the US Geological Survey, the Illinois Water Science Center, and the Computer Science Department at the University of Illinois at Urbana-Champaign and is funded by the Adaptive Environmental Infrastructure Sensing and Information Systems initiative.

Semantically Aware On-line Community for Biomedical Researchers

Sudeshna Das, Alister Lewis-Bowen, Lousi Weitzman, Tim Clark; Harvard University

We are developing a reusable framework for on-line communities of biomedical researchers. Although there is a growing number of biological knowledge bases, the vast majority of biological information and various resources used by the community (such as cell lines, antibodies etc.) reside in laboratory notebooks and heterogeneous databases. The context of the data is rarely captured and information exchange among researchers is usually accomplished via emailing of documents or conversations. Moreover, community websites publishing on-line materials rarely, if ever, link them to the biological information or resources, whereby key knowledge is lost. We are developing the framework as a Drupal (http://www.drupal.org/) distribution integrated with an RDF triple store and some associated java components. Drupal is a popular content management system and is widely used by various communities to develop their website. The framework will allow easy publishing of online materials. In addition, the framework will have semantic underpinnings to capture the relationships between research articles, biological entities, profiles of experts etc. We will use an extension of the SWAN ontology (Clark and Kinoshita, 2007) as our knowledge schema. Our goal is to organize and repurpose on-line material in communities by defining and capturing semantic relationships to existing knowledge repositories. Such a knowledgebase will enable richer and more powerful interactions amongst many sub disciplines within the scientific community.

Sharing Digital Science

David De Roure, The University of Southampton; Carole Goble, University of Manchester

Most computer users are familiar with the practice of sharing individual files, such as text, photos, videos and music, using social tools – Wikis, blogs and social networking sites like Flickr, YouTube and Facebook. Scientists are beginning to share information this way too. However, scientists commonly work with collections of digital items which include experimental plans, documentation, data, results, logs of runs, ‘housekeeping’ information, etc. myExperiment (http://myexperiment.org) is a social space for sharing scientific workflows and associated information – a way for scientists to share reusable pieces of scientific practice. In contrast to photo-sharing on Flickr or videos on YouTube, the basic unit of sharing in myExperiment is not a single file but rather a package of components that make up an experiment – what we call an Encapsulated myExperiment Object (EMO), and others have called Reproducible Research Objects. Notionally an EMO is a folder containing the various assets associated with an experiment. In the scientific context there are stringent requirements with respect to versioning, ownership, intellectual property and the maintenance of provenance information. We have looked at emerging practice in sharing “pieces of science” in the scientific and scholarly lifecycle, from social sites to digital repositories. myExperiment provides simple and extensible support to better understand requirements as new collaborative practice emerges. In this presentation, we will describe the characteristics of EMOs and present our initial design solution which supports the requirements of encapsulation and preserves our principles of simplicity and interoperability.

Simple Standards-based Grids

Andrew Grimshaw, University of Virginia

Providing such transparency and thus minimizing the effort required by users to integrate and use their code and data in the grid is both practical and desirable. The lack of easy data integration and access within a grid is a major barrier to a large number of potential grid users because they physically cannot change their code (the code is commercial or they do not have the source code) or because they do not have the time to devote to performing the necessary integration. At a macro level it is desirable to remove such burdens from end users because time they devote to grid integration activities is time taken away from working in their area of expertise “ their science and research”, while lowering the integration effort will encourage more users to take advantage of the benefits that data and compute grid systems offer. This talk will focus on the data grid capabilities of the Genesis II grid system. Genesis II is an open implementation of grid standards emerging from the Open Grid Forum. Specifically, Genesis II implements WS-Naming, the HPC-Profile, OGSA-BES, OGSA-ByteIO, RNS, and the draft OGSA Express Authentication Profile suite.

Computationally-intensive Tasks in Medical Imaging Informatics

William Horsthemke, Daniela Raicu, Jacob Furst; DePaul University

Medical imaging informatics addresses initiatives to improve the performance of clinical radiology. These efforts range from managing images for reading by radiologists to computer-aided diagnosis. Many projects require significant image processing to extract image features for use in diagnosis or as reference queries for retrieving other images with similar characteristics. The effectiveness of such projects often depends on having large image data sets. Given the computational complexity of many image processing techniques and the number and size of medical images, medical imaging informatics tools are limited by hardware resources. Many tasks can be parallelized or adapted to distributed processing as available on grid-based technology, such as image processing feature extraction, dataset storage, content-based image retrieval (CBIR), and computer-aided diagnosis (CAD). We propose using technologies for three specific medical imaging tasks: 1) automatic segmentation of liver tissue in computed tomography grid (CT) of the abdomen, 2) CBIR for retrieving lung nodule cases in CT, and 3) classification of tumors in mammography images. Each task has a significant requirement for image processing to extract low-level features; the feature independence, as well as the presentation of data as a grid of pixels allows for excellent opportunities to use grid technology. The high level algorithms built on extracted image features (segmentation, similarity measures, and machine learning, respectively) can be run in parallel in a number of different ways – image slices, number of retrieved images, and independent machine learning steps. Focus on grid-enabled techniques will permit inclusion of computationally complex algorithms and larger datasets than otherwise acceptable for the near-real-time performance requirement of clinically useable medical imaging applications.

Methods for Automated, Real-Time, Public Health Disease Surveillance in Metropolitan Atlanta Using Computerized Integration, Knowledge Management, and Analysis of Multiple Data Streams

Douglas Lowery-North, Eugene Agichtein, James Buehler, Walter Orenstein, Lance Waller, Vicki Hertzberg; Emory University

Disease surveillance remains a challenging, though essential, public health function. Laws mandate that physicians and laboratories report cases or clusters of specific notifiable diseases to public health authorities, and failure to report these incidents in a timely or complete manner may lead to belated recognition of public health threats and lost opportunities for investigation and intervention. We identified the potential for a real-time, automated system could improve public health disease surveillance, using three large healthcare systems in Metro Atlanta, through the integration and knowledge management of the prediagnostic manifestations of disease (syndromic surveillance) from prehospital, outpatient, and inpatient data sources; the incorporation of laboratory and imaging diagnoses, and a bidirectional interface between the state public health agency and these three healthcare systems. Developing a real-time surveillance system presents many scientific and technological challenges, including: identification of subpopulations of special interest for disease surveillance; knowledge management technologies allowing forecasting based on diverse information gathered from different sources; utilization of free text records, such as dictations, to improve responsiveness, sensitivity and positive predictive value of the surveillance system; development of the analytical tools necessary to detect events of interest; application of performance improvement tactics to improve the value of data collection, analysis, and reporting, and to reduce waste associated with false positives; automation of initial epidemiologic event investigations; and feedback in response to epidemic threats.

Lowering Adoption Barriers for Scientific Workflow Systems

Luigi Marini, Rob Kooper, Peter Bajcsy, James Myers; National Center for Supercomputing Applications

As current scientific workflow systems reach technical maturity, new challenges arise in the areas of usability and user access to advanced functionalities. The mismatch between the expertise of domain scientists and the technical knowledge required to use scientific workflow systems via visual programming is becoming more prominent. While domain scientists greatly benefit from using scientific workflow systems, the adoption barriers are non-trivial. In our development of the Cyberintegrator workflow system, we have investigated an exploratory, macro-recording-style interface as an alternative to visual programming. A macro-recording interface provides a more natural, step-by-step model that makes workflow creation easier. The scientist can focus on available data sets and relevant analytical tools, while the system records the overall workflow. With traditional workflow systems forcing the scientist to focus too much on the lower level engineering details, keeping track of the higher level scientific process can become a challenge. We have explored ways to make the use of support tools required for lower level data manipulation (loading, translation and visualization) more transparent to the scientist. The resulting interfaces have a stronger focus on science and support both scientific and engineering views of workflows. Since scientific research is often done in a community setting, simple ways to capture and share personal annotations in workflow editors would be extremely useful. We have looked into the addition of a community annotation system, which allows easy sharing of annotations about data, tools and workflows. We discuss issues encountered and design choices made when trying to lower adoption barriers of scientific workflow systems. We include examples from our experience designing and implementing the Cyberintegrator.

A Novel Scalable Informatics Infrastructure for Predictive Health Advances

Eva Lee, Qifeng Lin, Kyungduck Cha, Calton Pu, Georgia Institute of Technology; Lynn Cummingham, Kenneth Brigham, The Emory / Georgia Tech Predictive Health Institute

The Emory / Georgia Tech Predictive Health Institute is a new model of healthcare that focuses on maintaining health, rather than treating disease. Through meta-analysis of multiple heterogeneous attributes (e.g. biological, genetic, clinical, behavioral, and environmental) PHI researchers seek to identify and measure risks and mechanisms of disease, and ultimately to promote health maintenance. When there is a potential health problem, predictive health aims to intervene at the very earliest indication, based on an individual’s personal profile, and restore normal function. A fundamental component of the PHI scientific mission is a scalable and extensible informatics framework(SEIF). In this talk, we will present our design and development of SEIF. SEIF is built using a 3-tier architecture that includes 3 major engines: the database server(DBS); the model interpreter(MI); and the information protection, propagation and access module(PWEB). DBS incorporates distributed clinical/ translational data, participant surveys, complex images using various databases including Oracle, MySQL, sequential files, and novel in-house models (e.g. for complex metabolomics data). MI employs semantics, relational and data mappings, and performs code generation to accommodate the evolutionary nature and heterogeneity of data, new data types and national standards. The dynamic capability and flexibility of automatic code-generation allows for re-organizing, re-loading, and requerying of meta/heterogeneous data, and is of paramount importance. PWEB offers secure multi-tier privileged user login. PHI participants, health partners, and researchers have different levels of data access requirements, and each is allowed to perform the necessary functionality through a web portal. Various features and scalability of SEIF for broad usage will be discussed.

Customizing Windows Workflow to Enable Design of Control, Data, and Conditional Flow for eScience Applications

Furrukh Khan, The Ohio State University

Applications that enable scientists to visually design the control flow (flowchart) as well as dataflow and conditional logic flow for interruptible programs (workflows) in their own Domain Specific Languages have obvious applications in eScience. The Windows Workflow (WF) runtime provides us with a light-weight and powerful engine for running interruptible programs that can be automatically persisted and tracked by WF; however a designer that can be used to visually construct control as well as dataflow and conditional logic flow is lacking at present. The WF designer allows only control flow to be visually designed; it cannot be used to wire together dataflow or conditional expressions. The stock designer can also not be used to design WF programs in browser based applications. Fortunately one can exploit various extensibility points to craft domain specific custom designers and loaders that interface directly with the WF runtime, thus bypassing the stock loader and designer. We first introduce the audience to workflows, then we talk about the powerful extensibility features of the WF runtime and demonstrate how we have leveraged these features to implement our own custom designer and loader. Scientists can use these to visually design and wire together not only flow of control (flowchart) but also complex dataflow and conditional logic. The custom designer can be implemented as a desktop or a web application that can further exploits Ajax technology for responsive browser based eScience applications. Finally we show how our designer for Windows WF is being used by scientists in the domain of human cancer research.

Integration of Ecohydrologic and Geomorphic Processes Within a Distributed Watershed Model: Applications to the Prediction of Ecosystem Patterns, Runoff Production and Landslide Risk

Lawrence Band, Sdhyok Shin, Taehee Hwang; University of North Carolina at Chapel Hill; Mark Reed, Matts Rynge, Lisa Stillwell; Renaissance Computing Institute; Jonathan Goodall, University Of South Carolina; Kenneth Galluppi, Renaissance Computing Institute

We describe a project that is developing and applying integrated ecohydrologic and geomorphic process models with mesoscale climate simulation to predict spatially distributed soil moisture, saturation, flash flood and landslide potential in southern Appalachian catchments. Landslides and flash floods are both major landscape forming processes and significant hazards in this region. Landslide risk is dependent on local topographic and soil conditions, long term changes in canopy cover and root structure, as well as transient moisture and saturation conditions from individual and recent storm events. Recent increases in tropical storm intensity, development and road construction in these mountainous areas may be increasing these hazards, as evidenced by major property damage and fatalities in the set of tropical storms experienced in the last few years. The modeling approach links GISci based ecohydrological and geomorphic process models that incorporate catchment patterns in soils, canopy conditions including root structure, and hill slope hydrologic routing that results in the development of space/time patterns of soil moisture, runoff and critical pore pressures that induce debris avalanches. Long term simulations are first used to develop spatially distributed ecosystem properties including canopy cover, LAI and root biomass. The potential for a forecast system is explored by driving the model with Land Data Assimilation System (LDAS) meteorological fields in near real time, then substituting WRF high resolution forecasts when major events are approaching. We use study catchments in the Coweeta.

Innovative Computerized Methods for the Epidemiological Assessment of Disease Exposure Using Social Networks in the Emergency Department

Douglas Lowery-North, Eldad Haber; Emory University; Susanne Hardy, Philadelphia College of Osteopathic Medicine; Christopher Vaughns, Georgia Institute of Technology; Vicki Hertzberg, Emory University

The threat of highly pathogenic avian influenza and a resulting pandemic, has added a renewed sense of urgency to the scientific community’s search for ways to recognize, prevent, and control the spread of disease. The goal of our research is to innovate epidemiological tools using modeling and simulation that can predict propagation paths and outcomes of infectious disease from the original exposure to an ill patient in the emergency department, which is a major opportunity for the spread of infection. Such models, based in social network theory, will evaluate the probability of spreading disease through staff-patient, patient-patient and staff-staff interactions. Coupled with clinical data, accurate measurements of contact between and within patients and hospital staff provide reliable estimates of the context, duration, and distance of these contacts. One way to measure these interactions is through the use of radiofrequency identification (RFID) technology. Here we describe a study to evaluate the use of RFID technology to obtain location data for patients and staff in large, urban ED with computer algorithms processing the location data and developing network models. We present a number of scientific and technological challenges related to defining “contact”, the level of precision in terms of location and frequency required to develop a robust model, how to use trilateration/multilateration to obtain location information, which facility design factors impact the data collection, and how we assess the impact of personal protective equipment use.

Tracking Environmental Change through the Data Resources of the Bird-monitoring Community

Mirek Riedewald, Rich Caruana, Daniel Fink, Wesley Hochachka, Steven Kelling, Art Munson, Ben Shaby, Daria Sorokina; Cornell University

The Avian Knowledge Network (AKN) represents collaboration between the Cornell Lab of Ornithology and researchers from Cornell’s departments of computer science and statistics. Our team is accumulating one of the largest and most comprehensive biodiversity data sets in existence. Data is contributed by many partner organizations, including the US Geologic Survey, Point Reyes Bird Observatory, and Bird Studies Canada. Additionally, the AKN is harvesting a variety of environment attributes including habitat and human population demography to create an enormous data resource, currently with over 35 million bird observations, each linked to more than 1000 environmental attributes. Ultimately, our goal is to use this resource to synthesize biologically useful information for conservation and science. We summarize challenges we faced, how we addressed them so far, and what still needs to be done. One major challenge was to make the data available to a broad audience, and our solution involved defining the Bird Monitoring Data Exchange and setting up a federated architecture based on Grid technologies, accessible through simple Web interfaces. Another challenge is to use AKN data to study biodiversity. We are approaching it at two levels. The first is to use powerful non-parametric supervised learning techniques to build models that make accurate predictions of organism distribution and abundance as a function of environmental effects. The other is to identify the environmental features determining distribution and abundance to discover the affects on bird populations by analyzing the learned models with novel data mining techniques that can handle massive high-dimensional data.

A Computerized Application for the Prediction, Prevention, Diagnosis, and Management of Bottlenecks and Gridlock in Healthcare Systems

Susanne Hardy, Philadelphia College of Osteopathic Medicine; Vicki Hertzberg, Emory University; Marilyn Margolis, Rehman Meghani, Jamie York, Emory Healthcare; Douglas Lowery-North, Emory University    

Emergency Departments (EDs) provide a vital safety net function for healthcare, public health, and disaster preparedness in the US. In recent years, the ability of EDs to accomplish these missions has been threatened by severe crowding. This functional decline has come at the same time that the demand for the unique services of EDs has increased. Many EDs utilize electronic patient tracking systems that offer little more than an electronic version of the traditional patient location chalkboard. While these tools provide an excellent mechanism for the management of individual patients, they fail to function as a tool for department-level management, and EDs continue to rely upon human resources to intuit system flow from this representation. Even systems that incorporate dashboard indicators have done little more than create tachometers for specific processes within the ED. We have developed an automated, computerized, function that can predict, prevent, diagnose, and manage ED crowding. This program interpolates data from that displayed on the patient-tracking boards in two urban EDs, and makes decisions about the deployment of resources to alleviate bottlenecks and relieve crowding. Still, many challenges exist related to the construct of the human computer interface, the methods for applying knowledge management technologies to the engine, the mechanisms for incorporating artificial intelligence capabilities into this functionality, the means to develop more robust predictive abilities, and the ability to apply and interface this technology to other flow-dependent areas within healthcare systems.

The Optiputer Microscopy Demonstrator

David Wallom, Angus Kirkland; University of Oxford; Mark Ellisman; University of California, San Diego    

The Optiputer Microscopy demonstrator builds on the capabilities of the materials group at University of Oxford (Angus Kirkland) and the Biosciences group at the University of California, San Diego (Mark Ellisman) with each group separately needing specific microscopy capabilities in order to further enable them in their research. This project will demonstrate how appropriate infrastructure can enable remote science experiments and provide new science capabilities by building on the existing knowledge and facilities. The instruments at San Diego and Oxford represent state of the art instrumentation in respectively intermediate voltage and aberration corrected geometries. Both instruments have nearly identical local hardware configuration and similar external interfaces. Hence in combination these represent a unique opportunity to link biological and materials science technical expertise across two bespoke instruments. By constructing an infrastructure consisting of microscopes at San Diego and Oxford, together with lambda networks (UKLight and StarLight), appropriate data storage and shared data management schemes as well as integration of local computational resources. This will include a medium sized CCS cluster at oxford as well as high definition visualization using a tile-wall system at each participating site. The data system has been specifically designed to allow easy sharing of stored images as well as real-time process of collected images.

The National Science Foundation’s Cyber-Enabled Discovery and Innovation (CDI) and e-Science: What are the Challenges?

Maria Zemankova, National Science Foundation    

Microsoft has been supporting eScience Workshops and relevant research for several years. National Science Foundation’s (NSF) FY 2008 Budget Request to Congress includes $52 million to support the first year of an initiative on Cyber-enabled Discovery and Innovation (CDI) with the objective to “Broaden the Nation’s capability for innovation by developing a new generation of computationally based discovery concepts and tools to deal with complex, data-rich, and interacting systems” [www.nsf.gov/about/budget/fy2008]. The European Commission supports a study on eScience Digital Repositories (eSciDR) to drive the development and use of digital repositories in the EU in all areas of science [http://www.e-scidr.eu/]. The 2007 IEEE International Conference on eScience and Grid Computing is meeting in Bangalore, India [www.escience2007.org/ index.asp]. Google Earth has “put the world’s geographic information at your fingertips” [http://earth.google.com], Sloan Digital Sky Survey is “Mapping the Universe” [www.sdss.org/] and bringing the possibility of making a discovery to researchers and amateur star-gazers or kids around the world, CERN [http://public.web.cern.ch/] is studying the particles the universe is made of, proteomics researchers around the world [www.wwpdb.org] are trying decipher what we are made of, etc. Sensors are busy collecting more global change data, data mining algorithms are churning out new discoveries, and Scientometrics [www.springerlink.com/content/101080] is trying to help us to understand all existing knowledge that we expanding at a staggering rate. Looks like all is well, but is it? We will discuss some challenges of eScience, the steps NSF is taking in addressing them, and would also like to elicit suggestions from the global eScience community.

The Creation and Implications of Robotic Tool-users

Lloyd Williams, Thomas Horton, Robert St. Amant; North Carolina State University

For many years, tool-using behavior was considered a benchmark by which “intelligent” organisms could be identified. While this special status of tool use has lessened over the years, examining how animals use tools remains a standard practice in the study of biological cognition. The strong link between “intelligent” behavior and tool-use has led us to examine modeling these types of behaviors in robots. We consider robots not simply as assistants to intelligent humans, but also potentially as models of biological organisms. The development of such robotic models provides insight into the mechanisms that support tool use in humans and other animals, and serves as a test bed for exploring theories of cognition from psychology, neurobiology, and related fields. There are practical implications to any insight we gain into the theoretical underpinnings of tool-use. While robot actors are seen in a wide range of environs, from manufacturing floor to laboratory, for the most part they do little more than simply follow instruction sets and are severely limited in their ability to respond to more dynamic environments. The mechanisms we are employing to model tool use, while still preliminary, are relevant to extending robots’ capabilities beyond highly constrained tasks. Our work aims to create a robot architecture capable of interpreting visual information in the context of tool-using behaviors, recording its experiences, and building semantic networks that represent conceptual relationships between actions and the properties of objects in the environment. We have developed a proof-of-concept architecture on the Sony Aibo, a non-anthropomorphic mobile robot with grasping ability that allows it to solve the familiar “monkey and bananas” problem by using a tool to touch an out-of-reach object.

Managing Interpolated Data in Environmental Science Applications

Karl Aberer, Youngluan Zhou; Ecole Polytechnique Federale de Lausanne

The emergence of novel sensing devices and sensor network technologies provides a whole new opportunity for global environmental studies and environment-related decision making. The Swiss Experiment is a newly initiated multidisciplinary project aimed at building a large scale platform, to support field investigations of environmental processes, which is based on new sensor and data management technology. This talk focuses on some data management issues that occur in our practical study. In environmental monitoring, the raw measurement data are typically transformed into an interpolated grid before performing analyses, such as visualization, simulation etc. The typical interpolation models used by the scientists include deterministic ones, such as triangulation, and statistical ones, such as kriging. The interpolated grid can be considered as a view over the raw data tables, however with much higher data density. Directly storing the resulting views would incur a data explosion problem, while computing views on the fly from scratch would be too unresponsive. To enhance the storage, maintenance and querying efficiency, we identify the static parts of the intermediate computational results of the interpolation models and choose to store and maintain them in a database instead of the dynamically changing final interpolated values. The final values of the grid points can be computed on demand in an efficient way. This technique can also be applied in efficiently computing the interpolation view over real-time streaming data. Building a data warehouse over all the historical data can tremendously help the scientists to perform their analysis. Here an interpolated view should be used as the fact table to feed the cube. Again, the static intermediate results are identified and stored to optimize the storage usage, the maintenance cost as well as the query performance.

The Chronopolis Digital Preservation Datagrid: Archival Policy Management in Support of Long-lived eScience

David Minor, Robert McDonald; San Diego Supercomputer Center

In recent years there has been growing discussion about the infrastructure needed to support distributed or federated preservation environments. This data infrastructure, known variously as a component of e-science or cyber infrastructure, would provide the base on which to organize, preserve, and make accessible over time the intellectual capital that is being created via research in science and engineering. The San Diego Supercomputer Center, along with the U.C. San Diego Libraries, the National Center for Atmospheric Research, and the University of Maryland Institute for Advanced Computer Science, have formed a collaborative partnership called Chronopolis. The underlying goal of the Chronopolis partnership is the creation of a digital preservation environment to curate this intellectual capital at a national scale and provide science with a long-term preservation infrastructure. As a first step toward developing a working Chronopolis prototype, the partner sites have begun development of replicated collections between themselves and several other institutions. In addition to working on the mechanics of large-scale data replication, each site is developing its own policies for collection management with the goal of creating policies that can interact independently as well as cross-institutionally. This iterative process is a first step-towards a model cross-institutional strategy that will eventually extend to anyone working in a Chronopolis preservation environment. Our poster will highlight current preservation policies and procedures across the Chronopolis Preservation Datagrid. It will clearly display how the institutions are interacting with each other and the relationships with our initial partner sites. It will also show how the infrastructure is being used in scientific collections such as data from the National Virtual Observatory.

Indexing with Space-filling Curves in Databases

Tamas Budavari, Alex Szalay, Gyorgy Fekete, The Johns Hopkins University; Gerard Lemson, Max Planck Institute for Astrophysics; Istvan Csabai, Laszlo Dobos, Eotvos Lorand University; Jose Blakeley, Microsoft

We present a general indexing scheme for multi-dimensional data sets fine-tuned for relational databases. Our approach is to utilize appropriate hierarchical space-filling curves, and organize the data sets accordingly. Spatial queries of complicated shapes defined by exact mathematical equations are first approximated by unions of cells in the particular hierarchical pixelization scheme that in turn translate into efficient range queries in SQL. The expensive evaluation of the actual mathematical constraints is only performed on a tiny subset of the data at the boundary of the query shapes. The technique has proven to work beautifully in practice for various topological manifolds such as the 3D Euclidean space and the surface of the sphere, and is easily generalized for other problems. Our C# implementations running in the CLR of SQL Server 2005 are currently enabling scientists to routinely search terabytes of astronomy data from the state-of-the-art multicolor observations and N-body simulations, namely the Sloan Digital Sky Survey and the Millennium Run.

Providing Online Access to High-Resolution LiDAR Topography Datasets for Earth Science

Chaitan Baru, San Diego Supercomputer Center; Ramon Arrowsmith, Chris Crosby; Arizona State University; Parag Namjoshi, Viswanath Nandigam; San Diego Supercomputer Center

The GEON project (www.geongrid.org) has developed a system to provide online access to large high-resolution LiDAR topography datasets. This system is available as a portlet in the GEON portal (http://portal.geongrid.org/lidar) and is in use for a number of earth science studies. Example applications of these data including mapping of active faults in California to better understand earthquake potential, studies of landscape development in coastal California, and for validation of satellite remote sensing data. Currently, the GEON LiDAR portlet serves 4 different data sets totaling over 7 billion data points and approximately 2TB. This system has also been selected as the primary distribution pathway for LiDAR data to be acquired by the GeoEarthScope component of the NSF-funded EarthScope project (which will entail more than 20 billion additional points and a significantly large user community). The current implementation, using DB2 with spatial indexing on a 32-way IBM P690, is being migrated to parallel DB2 on a Linux cluster, where we are experimenting with data partitioning strategies for spatial data. Over 90 researchers have been actively using the LiDAR portal. We have analyzed user access patterns and plan to apply this information for database tuning; pre-computing derived products; and, developing other strategies to improve overall access times. We will present the current implementation and future plans, and the use of high performance computing to serve LiDAR and other remote sensing data to the research community.

Astroinformatics: The New eScience Paradigm for Astronomy Research and Education

Kirk Borne, Georgia Mason University

The growth of data volumes in science is reaching epidemic proportions. To cope with that flood, data-driven science is becoming a new research paradigm, on a par with theory and experimentation. This concept was introduced by Jim Gray as the new science of X-Informatics. Informatics is the discipline of organizing, accessing, mining, analyzing, and visualizing data for scientific discovery. We will describe astroinformatics, the new paradigm for astronomy research and education, focusing on existing eScience infrastructure (such as the National Virtual Observatory) as well as new eScience education initiatives. The latter includes the new undergraduate program in Data Sciences at George Mason University, through which students are trained in eScience tools to discover and access large distributed data repositories, to conduct meaningful scientific inquiries into the data, to mine and analyze the data, and to make data-driven scientific discoveries. The data flood is also in full force outside of the sciences. The application of data mining, knowledge discovery, and e-discovery tools to these growing data repositories is essential to the success of agencies, economies, and scientific disciplines. Consequently, many scientific disciplines are developing sub-disciplines that are information-rich and data-based, to such an extent that these are recognized stand-alone research disciplines and academic programs on their own merits. The latter include bioinformatics and geoinformatics, but will soon include eScience, astroinformatics, health informatics, and data science. We will compare these and then focus on the new discipline of astroinformatics as key to the future success of astronomy and astrophysics research. We will describe this within the context of the new CODATA initiative ADMIRE (Advanced Data Methods and Information technologies for Research and Education).

3D Object Matching Using 3D Euclidean Integral Invariant Signature

Shuo Feng, Djamila Aouada, Hamid Krim; North Carolina State University

In this paper, we propose a new 3D object representation and marching algorithm. A 3D object may be viewed as a surface in 2D, and the Global Geodesic Function (GGF) of any point on the surface is defined as a normalized integrated distance from this point to all other points on the surface. With the help of GGF, a 3D object may be represented by a set of level curves of GGF. This representation is invariant to rigid body movement, and an object will be represented by the same set of curve under isometric transformation. This representation also takes advantage of another nice property of GGF, no requirement of reference point. Although a 3D object is represented by the same set of curves under rigid body movement, the curves may still undergo translation, rotation, scaling, and isometric transformations. Comparing curve under these transformations is a great challenge in the object matching stage. We propose a novel Integral Invariant Signature, which may eliminate the effect of translation, rotation, scaling, and isometric transformations. The variations of a space curve under isometric transformations are mapped into the same signature curve, and the comparison is dramatically simplified. Since integrations may smooth out zero mean noise, the integral invariant signature is insensitive to noise. Other advantages of Integral Invariant Signatures, such as independent of parameterization (curve sampling) and initial point selection also help to simplify the matching procedure and improve the matching performance. We pick a subset of 25 models from 5 objects models with articulating parts from the McGill 3D Shape Benchmark to evaluate the matching performance, and promising results are shown in the paper.

Data Management, Analysis and Decision Making Services for Scientists

David Leahy, Paul Watson; Newcastle University

Scientists carry out experiments which generate extensive volumes of raw data and then apply analytical techniques to reduce the data to a form that simplifies comparison of experimental results under varying conditions. A further level of analysis is applied to draw conclusions as to the relationship between these variables and the summarized experimental results. They use the patterns uncovered to hypothesize new phenomena and to make decisions. In the drug discovery domain, variables of interest include the impact of different disease states on the behavior of tissues and the effects of treatment with chemical substances (i.e. real and potential drugs). Data analysis develops understanding, such as which biological components are implicated in disease or how the structure of a chemical is related to its impact on biological system. For this understanding to translate into value it should also inform decision making, which in this case could be, “will this chemical be a successful drug?” The talk presents two examples of the process of data management and analysis through to decision making and describes the underlying architecture to support this. It builds on work underway at Newcastle in two areas, CARMEN, an infrastructure “in the clouds” for supporting scientific research and collaboration as well as the Discovery Bus, a novel “Competitive Workflow” system for facilitating decision making.

Distributed Computing Solutions for Petabyte-scale Data Analysis in Particle Physics Experiments

Ying Ying Li, University of Cambridge; Karl Harison, University of Cambridge; Michael Parker, University of Cambridge; Vassily Lyutsarev, Microsoft Research; Andrei Tsaregorodtsev, CPPM CNRS-IN2P3

Particle physics studies the fundamental building blocks of nature and the interactions between them, with current understanding embodied in the subject’s Standard Model. The Large Hadron Collider (LHC), the world’s highest-energy particle accelerator, starts operation at the European Laboratory for Particle Physics (CERN), Geneva, in 2008, and will be the key testing ground for the Standard Model over the next decade or more. The four main LHC experiments, involving thousands of physicists from around the world, each need to analyze data volumes of the order or petabytes per year, about a factor of 10,000 higher than in the previous generation of CERN collider experiments. Processing of these massive amounts of data relies on the use of globally distributed computing resources, made available in the context of international Grid projects. This presentation illustrates the solutions developed for optimizing use of these resources, taking one experiment, LHCb, as an example. In particular, details are given of the experiment’s workload-management system, DIRAC, and Grid user interface, Ganga. With DIRAC, sites offering resources launch agents that pull processing requests (jobs) from a central server. The system is being used successfully to coordinate the running of many thousands of jobs per day on over 6000 CPUs, distributed across more than 80 sites and 4 continents. Ganga is a job-management framework that provides a uniform interface for accessing multiple processing systems, making it trivial to switch from tests on local batch queues to a full-scale analysis on a Grid-based system. DIRAC and Ganga both have a component architecture that readily allows customization for applications outside of particle physics. Ganga, for example, has been used in activities as diverse as software regression testing, drug.

PAS: A Wireless-enabled Personal Assistance System for Independent Living

Jennifer Hou, Zheng Zeng, Sammy Yu, Wook Shin; University of Illinois; Stanley Birge, Washington University in Saint Louis

The aging of baby boomers is creating social and economic challenges. As the population ages there will be an increasing demand on health care resources. Fortunately advances in sensing, localization, event monitoring, wireless communications technologies make possible the non-obtrusive supervision of basic needs of frail elderly and thereby replicate services of on-site health care providers. It is postulated that implementation of a cost-effective, secure, and open personal assistance system (PAS) that provides real-time interaction between elderly people and remote care providers can delay their transfers to skilled nursing facilities and improve the quality of their lives. We have been in the process of designing, developing, and deploying such a wireless-based software infrastructure. PAS exploits inexpensive, “off the shelf” technologies to assist elderly people to maintain the capability of independent living through (i) time-based reminders of daily activities from healthcare providers through the Internet to the home environment, (ii) monitoring of physiological functions and its delivery through the Internet to healthcare providers/clinicians, (iii) non-intrusive localization and tracking of residents with small sensor devices, and (iv) a fall detection and response system to track impact/orientation of residents and to provide audio communications with the health care provider in case of need. To enhance the robustness and ubiquity of PAS we are also exploring use of cell phones as both the wireless modem and the local intelligence for data aggregation and acquisition. We are currently working with Geriatricians at Washington University in Saint Louis in evaluating PAS with respect to the delay achieved in transitioning from independent living to a higher level of skill nursing care by a randomized clinical trial comparing PAS to standard of care.

CANDI - Retrospective on an N-tiered, .Net Remoting, Vendor-Neutral Application Suite for Liquid Chromatographic-Mass Spectrometric Analysis of Small Molecule Purity and Identity

Mark Bean, GlaxoSmithKline

CANDI – Retrospective on an N-tiered, .Net Remoting, Vendor-Neutral Application Suite for Liquid Chromatographic-Mass Spectrometric Analysis of Small Molecule Purity and Identity Instrument vendor-independence is a worthwhile software goal as learning and effectively using software from multiple vendors is non-trivial, expensive, and impractical for hundreds of chemists in their daily work. Such independence allows us to select instruments based on performance rather than software familiarity. There are two solutions to this goal: vendors could adopt a common file format (see paper by this author on AnIML, an XML standard), or the vendor software could isolated on servers in an N-tiered application architecture. In an N-tiered architecture, software dependencies (Oracle client, vendor software, PDF creators) are installed on application servers and a shell created around them in a single service accessible from anywhere on the network. This is the familiar Internet architecture, but is just as easily implemented for thick Windows clients making remote procedure calls to Windows services. This offers added benefits out of the box such as multi-threading (for multi-processor servers), scalability across server sets, both of which can improve performance of processor-intensive scientific applications. A useful addition for application servers is a mechanism whereby new versions and bug fixes can be hot swapped without restarting the central services, and whereby clients automatically download and run the latest version of software on startup. This paper discusses creation and maintenance of a pure application server using the CANDI software to illustrate some impressive architectural advantages.

Lessons Learned from the Deployment of a Hydrologic Science Observations Data Model

David Valentine, University of San Diego; Ilya Zaslavsky, University of San Diego, Supercomputer Center

The CUAHSI Hydrologic Information System project is developing information technology infrastructure to support hydrologic science. The CUAHSI Observations Data Model (ODM) is a data model to store hydrologic observations data in a system designed to optimize data retrieval for integrated analysis of information collected by multiple investigators. The ODM v1 (Tarboton et. al, 2007), provides a distinct view into what information the community has determined is important to store, and what data views the community. As we began to work with ODM v1, we discovered the problem with the approach of tightly linking the community views of data to the database model. ODM v1 was difficult to populate, and the large size of the model hindered the ability to populate the data model and database. Different development groups had different approaches to handling the complexity; from populating the ODM with a bare minimum of constraints to creating a fully constrained data model. This made the integration of different tools, difficult. In the end, we decided to utilize the fully populate model which ensure maximum compatibility with the data sources. Groups also discovered that while the data model central concept was optimized for data retrieval of individual observation. In practice, the concept of data series is better to manage data, yet there is no link between data series and data value in ODM v1. We are beginning to develop ODM v2 as a series of profiles. By utilizing profiles, we intend to make the core information model smaller, more manageable, and simpler to understand and populate. We intend to keep the community semantics, improve the linkages between data series and data values, and enhance data retrieval.

Tarboton, et al. 2007. CUAHSI Community Observations Data Model (ODM), Version 1.0.

Retrieved from: http://water.usu.edu/cuah.si/odm/files/ODM1.pdf.

eQTL: Inferring Associations Between Genes and Gene Expression Phenotypes

Jinze Liu, University of Kentucky

A central focus of genetics is the genetic basis of phenotypic traits and their variation. The recent proliferation of highthroughput bio-technologies has enabled the collection of a wealth of data describing the genetic makeup and phenotypic traits of a given biological system. For example, genome-wide SNP (single Nucleotide Polymorphism) data and gene expression data may be collected for multiple strains of mice to describe their genotypic variation and phenotypic variation, respectively. Expression Quantitative Trait Locus (eQTL) mapping seeks to identify genes whose genotypic variations are associated with the expression variations. This approach has the potential to dissect the genetic basis of gene expression, which can be further utilized to infer causal relationships between modulator and modulated genes. Existing eQTL methods suffer from lack of systematic statistical modeling of genome-wide linkages and/or are extremely demanding in computational power. We present an approximate Bayesian-based eQTL method. The Bayesian method can produce precise statements about the posterior densities of linkages between an expression trait and the genetic makeup of a gene. While the method improves on existing approaches, it introduces new computational challenges for large scale eQTL study. We employ Laplace’s method to approximate the integration of likelihood over nuisance parameters, and this has proven to be accurate and especially computationally efficient for eQTL analysis.

e-Social Science in a Nutshell

Rob Procter, Peter Halfpenny, Alex Voss; National Centre for e-Social Science

Among research priorities identified in a recent review of UK social science are globalization, population change and understanding individual behavior. The nature of these problems calls for collaboration across traditional disciplinary boundaries, and their complexity and scale demands more powerful research tools. At the same time, the social sciences are on the verge of what is likely to be a fundamental and decisive shift in data collection methods as it seeks to unlock the research value of “born digital” data such as administrative and transactional records. The National Centre for e-Social Science (NCeSS) was established by the UK Economic and Social Research Council (ESRC) in 2004 as its key contribution to the UK e-Science programs. The Centre’s objective is to enable social scientists to make best use of emerging eScience technologies in order to address the key challenges in their substantive research fields in new ways. In pursuit of this, NCeSS aims to stimulate the uptake and use across the UK social science research community of distributed computational resources, data infrastructures and collaboration mechanisms by coordinating a program of e-Social Science research, making available information, training, advice and support to the social research community, and leading the development of an e-Infrastructure for the Social Sciences that will provide new resources and tools for social research. NCeSS is also responsible for providing advice to the ESRC on the future strategic direction of e-Social Science. In this presentation, we will review the progress NCeSS has made to date in achieving its objective and outline its roadmap for future research and development of methodologies, tools and infrastructure.

Tackling the Barriers to Adoption of e-Research

Rob Procter, Alex Voss, Peter Halfpenny, Marzieh Asgari-Targhi; National Centre for e-Social Science

As part of a study to investigate and tackle barriers to adoption of e-Infrastructure, we have been conducting a review of project documentation, reports and academic papers in the field with the aim of establishing a typology of barriers to uptake and candidate responses to tackle them. Underlying this is the expectation that there are ways of dimensioning the problem space so as to reveal recurring patterns in adoption processes; that these barriers will be “typical” in a number of different ways, e.g., typical in a particular domain, for a given technology, for specific stakeholders, etc. Of course, the real value of this study lies in how it may prospectively afford the adoption of e-Infrastructure rather than simply explain its history. The concern here is how to make our findings re-usable by a broad range of e-Infrastructure users, both current and future. What is needed is a format that allows us to capture the different dimensions of our typology, linking what we recognize as “typical” to concrete examples so that users can navigate the space between a clear conceptual framework and a set of pertinent examples of barriers and concrete responses to them. While a simple wiki served the purpose of data collection adequately at the beginning, we are now finding that as we populate this space, a more structured and dynamic approach is required to reflect the complex relationships found. We will report on this initial phase of our data collection, further steps towards our own empirical work and the development of a rich representation of our findings. We will also talk about plans to make our work sustainable by fostering a community process that we hope will eventually carry on an active reflection within the e- Infrastructure user community about the state of adoption and effective ways forward towards realizing the ambitious goals of e-Research.

Optimal reduction and classification of 3D data based on characteristic curvature

Djamila Aouada, Hamid Krim; North Carolina State University

During the last decade, 3D data acquisition techniques have been developed very quickly, contributing in an important increase in the available 3 dimensional data. This explosive growth inferred a natural need for efficient and automatic classification methods. We propose to base our classification technique on simply characterizing each 3D object by one parameter “R”, referred to as characteristic resolution. “R” is empirically defined as the minimal number of points that correctly represent the shape of an object. The power of “R” is enormous in reducing the computational cost for nearly the same quality of representation. Indeed, the initial number of points constituting a mesh may often be reduced up to ten times. Moreover, using flat norm-like measures, we show that “R” is directly related to the curvature information of each shape. Hence, our classification technique relies on this unique property of each shape. We present promising results carried on a sample dataset of 120 objects.

Using Instrument Simulation to Quantify Experimentation

Cory Quammen, Russell Taylor; University of North Carolina, Chapel Hill

Increasingly, computers take on crucial roles in processing and analyzing results from experimental science. In many applications, one such role involves removing artifacts in a signal produced by the sensing device that captured the signal. Such applications typically use a model of the sensing device’s affect to remove the artifacts, producing a “restored” signal. Inferences about the object or process under study are then made by analyzing the restored signal. In contrast to a restoration approach, we propose to reverse the procedure by using computer simulation to generate the signal a sensing device would produce observing a hypothesized model of the object under study. Differences between the simulated signal and an experimentally-obtained signal can be used in an optimization loop to derive a set of model parameters that best explain the experimental signal. In this talk, I will describe an implementation of this methodology for understanding biological images from confocal microscopes.

Using Concepts and Entailment for Passage Retrieval from Biomedical Literature

Catherine Blake, University of North Carolina, Chapel Hill; Nassib Nassar, Renaissance Computing Institute

Scientists in healthcare and biomedical informatics have never had as much information available in electronic form as they do today. The increased variety of approaches for information retrieval and extraction offer the potential to combine different techniques and provide scientists with new ways of accessing information; but the real contribution to e-science will occur when the next generation of information tools are consistent with the work flows used by scientists in a specific discipline. One area that holds much promise is recent work that focuses on retrieving relevant passages and entities from an article, rather than an entire document. In this presentation, we will explore the degree to which a concept representation and methods that recognize textual entailment will aid in passage retrieval performance. Our approach combines concepts from the Unified Medical Language System (UMLS) with a syntax representation that has shown success in recognizing textual entailment. We use the Genomics TREC collection of 160,000 documents and 50 topics that biologists considered important. The standard measures of precision (the proportion of accurately retrieved passages divided by the number of passages retrieved), and recall (the percentage of accurately retrieved passages divided by the total number of relevant passages) and mean average precision (MAP) (the percentage of correctly retrieved passages at each level of recall) are used to evaluate results. One of the key motivation behind this work is the historically low precision values measured in passage retrieval, and the need to investigate alternative approaches that increase relative precision. In addition, the scope of the UMLS enables us to explore the impact of different vocabularies on performance and may inform both manual and automated methods of ontology construction.

Monitoring and Managing Scientific Workflows through Dashboards

Mladen Vouk, Scott Klasky, Roselyne Barreto, Terence Critchlow, Ayla Khan, Jeffery Ligon, Pierre Mouallem, Meiyappan Nagappan, Norbert Podhorszki, Leena Kora

A dashboard for petascale multi-scale simulation displays pertinent information about the simulation in an intuitive form for application scientist to easily monitor and retrieve vital information. Our vision for a petascale simulation dashboard displays the most interesting information from the simulation and combines enough provenance information so that one can inspect not only their simulation, but also the machines used. This not only allows the user to monitor the simulations and machines, but also to interact with them to perhaps adaptively show parts of the simulations and display results from queries to inspect the status of the workflow. In this paper we will discuss some of the general principles behind dashboard design for scientific workflows. Our work concentrated mainly on the following: a) monitoring large supercomputing resources and clusters; b) monitoring jobs on these large resources; c) submitting jobs, editing input files, and interacting with remote resources; d) organizing and displaying simulations one runs on these resources, including the capturing of annotations to describe the simulation; e) monitoring the simulation itself in real-time and for later post processing; f) displaying scientific information on dashboards; and finally g) the methods for interacting with running simulations and how this interacts with specialized workflows for controlling simulations. One of the key features of our dashboard allows recording of annotations in a database, capturing the provenance. This mainly affords integrating the dashboard with an electronic scientific notebook for scientists to track all of the elements of a simulation. These include the following: graphs of data (xy, contour, 3D)+time data saved, annotations of these graphs, mapping of this data with other graphs the user’s needs to compare with, and external data the user compares the simulation with including outside experimental data for validation. The main pieces behind our dashboard …

Climate Prediction and Regional Modeling with PRECIS (Providing Regional Climates for Impacts Studies)

Richard Jones, David Wallom, Carl Christensen, Myles Allen, Milo Thurston, Tolu Aina, Simon Wilson

The aim of the Climateprediction.net PRECIS regional modeling experiment is to provide a public distribution of a physically based modeling system allowing detailed assessments of future climate change for any region by continuously coupling coarse resolution global and a high resolution regional climate models. Global Climate Models (GCMs) describe the important physical processes that make up the climate system but tend to have a high scale up to a few hundred kilometers in resolution. Impact, vulnerability and adaptation studies need to be studied on much finer scales. Regional Climate Models (RCMs), have the potential to improve the representation of the climate information and dynamics which is important for assessing a country’s vulnerability to climate change. PRECIS is designed as a practical and flexible regional climate model (RCM) which allows scientists to run regional simulations on their own PCs. It is intended for use by non-Annex I countries which have minimal computing resources available for climate change studies. A public resource distributed computing version of this system would allow assessment of likely ranges of detailed future climate changes over any region of the globe. This approach would employ the volunteer computing paradigm which also lends itself well to public education and outreach endeavors.

Segmentation of Colorectal Cancer MRI Images

Michael Brady, Niranjan Joshi, Andrew Blake, Vicente Grau, Fergus Gleeson, Anne Trefethen

We report recent progress on a Microsoft-sponsored project that is based on collaboration between Microsoft Cambridge and Oxford University. The project has an application focus: more accurate delineation of key anatomical structures in MRI images of the colorectum in order to assess cancer staging and the feasibility of carrying out a resection. The project also has a more generic image analysis component: the analysis of existing segmentation algorithms and the development of a novel synthesis that combines their best features. Colorectal MRI images provide a tough environment in which to develop algorithms that can reliably and accurately segment structures such as the mesorectum (for surgical assessment) and lymph nodes (for staging). We are analyzing three well known techniques for image segmentation: level sets, Hidden Markov Measure Fields (HMMF), and graph cuts. Image noise, partial volume effects (mixed tissue voxels as a result of low spatial sampling in the image), and other forms of uncertainty lead us to Bayesian methods, for which the estimation of probability density functions (of intensities, local phase structures, or other image representations) (PDF) is a fundamental requirement. Noting that histograms perform poorly when given few samples of a distribution, and that kernel methods work well but are computationally intensive when optimized, we have developed a non-parametric PDF estimation scheme (NP-Windows) and extended it to handle the partial volume effect by an inequality constrained least squares method. The resulting NP-Windows-ICLS algorithm has been incorporated into the region term of a level sets segmentation algorithm. We have used the monogenic signal (local energy, phase, and orientation) to provide a range of features for the level set algorithm. The resulting method gives very accurate results on a range of clinical data. We outline the next steps, toward relating the work to HMMF and graph cuts models.

iLearning and eHomeStudy: Multimedia Training and Assessments for Field Survey Staff

Charles Loftis, Nanthini Ganapathi; RTI International

Survey data collection projects strive to collect high quality data from survey respondents. The quality of the data collected is greatly dependent upon the effectiveness of field interviewers (FIs) to conduct in person screenings and interviews. Training FIs and subsequently assessing their knowledge of project protocol, methods and interviewing techniques is critical to the overall success of any data collection effort. For large surveys, as the number of FIs increase, the cost of in person training can become prohibitively large. As a cost effective solution to increase the quality of the field data, we developed a suite of web and media based training and assessment tools called iLearning and eHomeStudy for training field staff. Besides saving the project costs associated with in-person training, we are also able to provide refresher trainings throughout the year. This application also enables FIs to view standardized training courses at their convenience and at their own pace. This paper describes the technical details, key features and benefits of this application suite, and also it includes some details on user satisfaction and future directions.

The DRYAD Repository: Transforming Scientific Publishing and Data Discovery Via the Convergence of Open Access and eScience

Sarah Carrier, Jed Dube, Jane Greenberg, Hilmar Lapp, Abbey Thompson, Todd Vision and Hollie White; University of North Carolina, Chapel Hill

The DRYAD repository aims to support the preservation, discovery, sharing, use, and reuse of scientific data objects supporting published research in the field of evolutionary biology. Dryad is supported by a collaboration involving NESCent (The National Evolutionary Synthesis Center) and the Metadata Research Center (MRC) at the School of Information and Library Science, University of North Carolina at Chapel Hill. Dryad exemplifies the transformation of scientific publishing and data discovery motivated by the convergence of open access and eScience. Dryad seeks to balance a need for low barriers, which invite contribution from the wide range of scientists participating in the field of evolutionary biology, with a series of sophisticated, higher-level goals supporting data synthesis required to advance the field of evolutionary biology. In order to meet these goals, we have defined Dryad’s functional requirements. We conducted a survey of selected leading digital data and resource repository initiatives and held two stakeholder workshops (December ’06, and May ’07), with scientists (targeted depositors and users), representatives of major evolutionary biology journals and scientific societies, and metadata and digital library experts. Based on this input, we have developed Phase I of Dryad’s metadata architecture. To gather additional input we are developing a survey and a use case study that will provide data on evolutionary biologists’ experiences with and perceptions of open data repositories and the professional sharing of scientific data. This work will further inform Dryad’s future architecture. Here we present Dryad’s functional requirements, the underlying repository architecture, and the research methodologies and protocols for our forthcoming survey and use case study.

Models for the Science of Learning in Collaboratories

Miriam Heller, University of Southern California; Anthony E. Kelly, George Mason University; John Cherniavsky, Arlene de Strulle; National Science Foundation

Hundreds of collaboratories have emerged to transcend distance and time constraints and allow communities of research scientists and engineers to interact, share data, digital libraries and computational resources, and exploit remote instrumentation. Budget levels have ranged from $500,000 to $11,000,000 per collaboratory, possibly motivating the transformation of collaboratories from objects for research into objects of research. For instance, NSF’s Science of Collaboratories project studied over 200 collaboratories to identify sustainable, generalizable technologies for collaboration in science research. Some collaboratories claim to include virtual learning environments. The Science of Collaboratories database included thirty with learning features. Davenport (2005) notes, though, with regards to collaboratories, “…learning is not traditionally discussed or included in research proposals as a research activity.” An analogous Science of Learning Collaboratories program demands consideration, especially if collaboratories are to achieve effective integration of research and learning. Key to understanding, assessing, and optimizing emerging learning collaboratory features are a set of new Models of Educational Inquiry (MEI). The following eight MEIs are proposed and described in this poster to facilitate scholarly research of learning in a distributed, networked, collaborative environment: Curricular Content, Cyber-Learning, Teaching, Assessment and Evaluation, Educational Policy, Educational Research Design, Educational Technologies and Learning Environments, Communities of Learning and Teaching. Finholt, T. Collaboratories. In B. Cronin (Ed.). Annual Review of Information Science and Technology. Washington, DC: American Society for Information Science and Technology, 2001, 73-108. Bos, N., Zimmerman, et al.