Patrick Hogan, NASA
The need for massive communication and dynamic sharing of scientific data has never been greater than it will be in the world that awaits our children. The ability to integrate, analyze, and exchange both local and global information is critical to maximizing our understanding of our circumstances, whether for ground-truthing of satellite data (Earth’s carbon budget), coalescing field data for regional projections (North Africa to North India locust intervention), or simply innovative analyses coming from world-wide access to global data, and whether it be on behalf of academia, governments, or enfranchised individuals from the global community. This realm of scientific understanding needs the kind of innovation that comes from coding environments that provide the greatest opportunity for the development of solution-based technology. Competition in this realm should be based purely on results engendered by access to the scientific data. The .NET programming environment provides a compelling solution for scientific endeavors to maximize solution-based analyses and it also equally serves the geospatial visualization technology needed to effectively share this information.
Greg Quinn, University of California, San Diego
Within the past few years, numerous cell phone platforms have come to market that provide more than sufficient technical capability to enable advanced information visualization. Accompanying these advances in telecommunications hardware is the increasing maturity and capability of Smart Phone operating systems such as Windows Mobile 6.0. This has led to the increasing dependence of people from all walks of life on their cell phone to provide not only telecommunications functionality but also Internet-based information access and entertainment capability. Here we describe work in progress to utilize the Windows Communications Foundation capability in the .Net Framework version 3.0 to efficiently serve bioinformatics data on-the-fly to Smart Phones devices running the Windows Mobile operating system. We will also discuss the use of binary-formatted data transfer as a means to increase the download and processing efficiency of Protein Data Bank (PDB) data stored in a Microsoft SQL Server database.
Supratik Mukhopadhyay, Utah State University; Krishna Shenai, University of Toledo; Ramesh Bharadwaj, NRL
World demand for fresh water is increasing, and competition for allocation of water between the urban and agricultural sectors is rapidly growing in arid and semi-arid climates. This has brought an emphasis on intensive water management to achieve greater system efficiencies, especially in irrigated agriculture in arid regions such as the western US. Further, studies by the FAO (Food and Agricultural Organization) and others predict that in the coming 20 years, this competition for water will present potentially serious economic, political, and social problems for much of the population in both the urban and rural areas of developing countries, especially in the arid and semi-arid regions of the world. We present a novel irrigation control system to intelligently and reliably manage large soil and water ecological system for environmental and agricultural applications. Reliability is an important concern in precise monitoring and control of soil and water properties, since any malfunction can result in financial as well as environmental disaster. Our controller consists of novel sensor and uses state-of- the-art distributed information fusion and networking technologies for multi-zone implementation. It integrates intelligent sensor coordination and data fusion techniques to access, retrieve, process, and communicate with disparate wireless sensors in an ad-hoc manner to deliver reliable dynamic decisions and provide adequate information management. Our approach drastically reduces the hardware cost almost by a factor of 10 and removes the main bottleneck in irrigation control arising from wired sensors. Apart from this it provides a smart control mechanism with formal reliability guarantees that is reconfigurable at runtime in response to changing requirements.
Christoph Hoffmann, Voicu Popescu, Purdue University
Visualization is a core task in scientific computations, and in interdisciplinary settings it becomes even more important in view of the need to communicate insights across disciplinary expertise in the team. We explain how to integrate state-of-the-art finite element analysis and visualization systems. Instead of replicating functionality of one system in the other, we federate the systems by automated translation of FEA results into a form suitable for the animation/visualization system. This includes bridging the gap between different geometry conceptualizations, inverting and visually concretizing abstractions convenient for FEA, deriving visualization strategies that scale with the number of simulation elements and states, and placing the simulation results in the context of the surrounding scene. We demonstrate our approach with the recently completed simulation and animation of the crash of AA-11 into the North Tower of the World trade Center, a video that has been downloaded more than 1.3M times to date. We discuss some of the research issues that arose and describe some of the benefits for the FEA when high-end visualization is considered part of the effort. In the broader context, our work finds applications in VR training, in forensics, and in communicating with a wide audience outside of the scientific community.
Jignesh Patel, University of Michigan
Modern life sciences explorations often need to analyze and manage large volumes of complex biological data. Unfortunately, existing life sciences applications often employ awkward procedural querying methods and use query evaluation algorithms that do not scale as the data size increases. For example, data is often stored in flat files and queries are expressed and evaluated by programs written in Python. The perils of employing such procedural querying methods are well known to a database audience, namely a) severely limiting the ability to rapidly express complex queries, and b) often resulting in very inefficient query plans as sophisticated query optimization and evaluation methods are not employed. The problem is likely to get worse in the future as many life sciences datasets are growing at a rate faster than Moore’s Law. Furthermore, the queries that scientists want to pose are also rapidly increasing in their complexity. The focus of this talk is on a database approach to querying biological datasets. The talk describes ongoing work in the Periscope project in which we are developing a system for declarative and efficient querying on biological graphs and sequence databases. This talk will also highlight how these database methods allow a scientist to work in a loop of a) first posing queries, b) viewing the results, c) then refining and reposing a modified query, and d) continuing through this iterative process until an answer has been found. The efficiency of the system enables the scientist to explore even large biological databases in real time.
Claudio Silva, Juliana Freire, Carlos Scheidegger, David Koop, Huy Vo; University of Utah
Workflow systems have recently emerged as an alternative to ad-hoc approaches to constructing computational tasks widely used in the scientific community. These systems can capture complex analysis processes at various levels of detail and systematically capture the provenance information necessary for reproducibility, result publication, and sharing. Although the benefits of using workflow systems are well known, the fact that workflows are hard to create and maintain has been a major barrier to wider adoption of the technology in the scientific domain. Constructing complex analysis processes requires expertise in both in the domain of the data being explored, and in using a number of different analysis and visualization tools. Furthermore, the path from “data to insight” requires a laborious trial-and-error process, where users successively assemble, modify, and execute multiple workflows. We advocate a data-centric view of workflow-based computational processes, where the workflows and information about their evolution are stored, along with their impact on the data they manipulate. This information captures detailed provenance of the steps followed in exploratory processes. We propose a new frame work that lets users explore and re-use this detailed provenance information through intuitive interfaces. Our framework consists of two key components: a query-by-example interface for querying workflows whereby users query workflows through the same familiar interface they use to create them; and a mechanism for semi-automatically creating and refining workflows by analogy}, without requiring users to directly manipulate or edit the workflow specifications. In this talk, we will describe the framework and demonstrate its use in VisTrails (www.vistrails.org), a publicly-available open-source system.
Vicki Hertzberg, Douglas Lowery-North, Walter Orenstein, James Buehler, Lance Waller, Eugene Agichtein; Emory University
Rapid detection of disease outbreaks and response to cases is an important public health function. Definitive diagnoses and subsequent reporting can lag initial case presentation by days or weeks, a critical weakness in outbreak detection. In addition, timely notification of outbreaks to healthcare providers by a central public health authority is also crucial. However, the best strategies for such notification have not been determined. We describe here the potential for developing a real-time syndromic surveillance (SS) system using three healthcare systems in a large urban area with reciprocal interface from the state PH agency. These systems cover patients presenting in the hospital emergency departments (four adult, three pediatric) and primary care clinics as well as related laboratory and radiology orders. This system presents many scientific and technological challenges. How can we best integrate data sets within and between systems rapidly? Is there benefit to monitoring the health status of a particularly vulnerable population comprising one of the hospitals? What tools are necessary to detect “blips” suggesting events of interest? Can we automate epidemiologic investigation of such events? Can we apply performance improvement tactics to reduce waste and improve value in SS data collection, analysis, and reporting? How can free text records, such as dictations, be utilized to improve sensitivity and positive predictive value of SS? How can we best give meaningful real time feedback to clinicians regarding PH alert information? What is the most valuable information to provide to these clinicians? What are the most valuable actions for providers to accomplish with such information? Should space be reserved in electronic?
Robert Grossman, Dave Hanley, University of Illinois; Jennifer Schopf, Argonne National Laboratory
Many applications perform queries to large scientific data sets that involve scanning the entire data set in the sense that each record must be checked to see if a given condition is satisfied. In contrast, there is often an implicit assumption by the database developers that latency must be optimized, and an expectation that data is indexed in such a way that a relatively small amount of the data needs to be retrieved in order to satisfy the query. We are interested in the case seen by applications including SDSS, BLAST, and others in which there are multiple contending scanning queries and the end user wishes to optimize total throughput. In this paper, we define a system called RAY that collects scanning queries as they arrive, presents them with the entire database chunk by chunk, and releases them after the entire database has been scanned, thereby increasing the performance of multiple contending scanning queries by reducing the number of aggregate disk reads. We present experimental studies using a large astronomy data set from the Sloan Digital Sky Survey and realistic queries from that experiment that touch varying amounts of data, from 100% down to 20%. We show that RAY is significantly faster than directly passing the queries to the database. When 100% of the data is touched this can be true even when there is no contention, and for less data touched in the scan, RAY can achieve better performance for as few as 2 or 3 contending scanning queries.
Marty Humphrey, Sang-Min Park, Jun Feng, Norm Beekwilder, Glenn Wasson, Jason Hogg, Brian LaMacchia, Blair Dillaway; University of Virginia
Access control policy languages today are generally one of two extremes: either extremely simplistic, or overly complex and challenging for even security experts to use. In this presentation, we explicitly identify requirements for an access control policy language for scientific data and then consider six specific data access use-cases that have been problematic in multiinstitutional collaborations: attribute-based access, role-based access, “role-deny” access, impersonation-based access, delegation-based access, and capability-based access. We evaluate the Microsoft Research Security Policy Assertion Language (SecPAL) against those requirements, specifically in the context of these six use-cases involving GridFTP.NET. We find that while some of these six use-cases are individually possible via existing authorization systems, we believe that SecPAL uniquely offers a single approach that meets the requirements of a multi-institutional access control policy language, thereby creating support for a wide range of expanded scenarios for controlled sharing of scientific data.
Bora Zivkovic, Public Library of Science
Online technologies are fundamentally changing the world of science: how research is performed, how science is taught and communicated, and how scientists’ networks are formed. Meteoric rise in number, quality and prestige of Open Access journals, rise in interest in Open Notebook Science, proliferation of science blogs, increased use of existing social networks (e.g. Facebook) and formation of science-specific networks (e.g., Postgenomic, Connotea), all contribute to big changes in the structure of the scientific enterprise which upset the traditional model.
C. Augusto Casas, St Thomas Aquinas College
Taking notes is the most common activity of students in the classroom. College students’ use of technology has increased significantly in the last several years. Students now attend class armed with PDAs, laptops and especially cell phones. These last devices are more than a telephone. Cell phones include calculators, web browsers, instant messaging software, phone books, digital cameras, video players, calculators, and games. Research conducted by the author found that students can benefit academically from such technology. More specifically, class experiments demonstrated that using personal computers to take and share notes student class participation and test scores increase. Microsoft Office Live Meeting was used as the underlying technology. Lectures were given to students divided in two groups. One group shared notes with Live Meeting. The other group took notes individually. A day after the lecture both groups took the same test. The experiment was conducted multiple times with different pools of students. Results showed that students using the notes-taking-sharing system were more actively engaged in class and scored better in the test. The results were consistent across all groups tested. With the Live Meeting system, each student was assigned a section of an online whiteboard. Each student took notes in her/his area while looking at the notes taken by classmates. At the end of the lecture students that use the online system could save and keep a copy of the online whiteboard. The experiments showed that students are more likely to engage in class and less likely to be distracted with other activities when they are working within this collaborative environment. The next research phase intends to determine if such a system helps disadvantaged students.
Peter Bajcsy, Sang-Chul Lee, NCSA/UIUC
We discuss the problem of understanding computational requirements for preservation of computer-aided decisions. Computer-aided decisions increasingly impact our society. These decisions have to be documented semi-automatically and the electronic records have to be appraised and understood in terms of the preservation and reconstruction cost. Currently there is no simulation framework that could support understanding and forecasting of computational requirements for preservation purposes. Our objective has been to develop such an exploratory simulation framework that allows archivists and other users to explore and evaluate computational costs as a function of several key preservation variables of appraised records. Thus, the application of our simulation framework is in supporting investigations of preservation tradeoffs and improving appraisals of electronic records. We first outline such prototype simulation software called Image Provenance To Learn (IP2Learn) that has been developed for a class of computer-aided decisions based on visual image inspection. The current software enables to explore some of the tradeoffs related to (1) information granularity (category and level of detail), (2) representation of provenance information, (3) compression, (4) encryption, (5) watermarking and steganography, (6) information gathering mechanism, and (7) final report content (level of detail) and its format. The simulation software consists of Image Viewer (visual inspection of images), Event Tracker (information gathering), Event Reviewer (decision reconstruction), and Final Report Editor (semi-automatic report generation). We will also illustrate example tradeoff studies using IP2Learn for a specific image inspection task.
David Lee, Perry Samson, Erik Hofer, University of Michigan
In early 2007 the department of Atmospheric Oceanic and Space Sciences (AOSS) and the School of Information (SI) at the University of Michigan collaborated on the installation of a 50 million pixel OptIPortal, or tiled display, utilizing OptIPuter technologies for applications spanning high-resolution image exploration to multi-modal atmospheric visualizations. In addition to research and persistent display tasks, the OptIPortal was incorporated into the undergraduate curriculum by requiring use of the display in demonstrating their understanding of principals in atmospheric sciences. This presentation discusses the rapid adoption of ultra high resolution visualization cyber infrastructure in a classroom setting. The AOSS student group demonstrated the ability to effectively utilize advanced cyber infrastructure using the interfaces provided by a software stack, enabling them to rapidly prototype compelling applications that take advantage of the high resolution display despite the technical complexity of the system. Utilizing these tools, the students produced projects ranged from conventional PowerPoint presentations, to distributed and parallel rendering of movie files, to dynamic multi-modal and multi-resolution weather visualizations to aid in the prediction or understanding of atmospheric phenomena. In analysis of their achievements, observations and interactions with the student group provided insight into how the OptIPuter software driving the tiled display enabled students to rapidly prototype meaningful visualizations aiding their course projects. Considering these results we are optimistic that these experiences point to the feasibility and utility of the introduction of OptIPortals to the classroom as well as lessons for the next generation of control software for high resolution displays.
Fabio Scibilia, Dario Russo, INFN-Catania
The grid paradigm has emerged as the next step in the evolution of distributed computing. The gLite middleware (http://www.glite.org) is one of the most popular grid middlewares and it is developed in the context of the EGEE project (http://www.eu-egee.org) which built the largest grid infrastructure for e-Science in the world. At present, gLite essentially runs on Linux platforms and this has up to now taken Microsoft Windows users and applications out of the EGEE infrastructure. The aim of the Grid2Win project is to port basic gLite services to run under MS-Windows to let Windows user’s access to grid facilities as well as to make possible the integration of Windows applications with the grid. Among all gLite services, we focus on the User Interface (UI), which is the set of command line tools to access the grid resources, and the Computing Element (CE), which is the grid service managing the computing power of the grid. Each CE wraps a Local Resource Management System (LRMS) exploiting its computing power. Using Cygwin as a POSIX emulation environment, we successfully ported the gLite User Interface to run under MS-Windows XP and developed a GUI on top of it. Moreover, we ported the Torque/MAUI (free release of the PBS job scheduler) based CE as first Windows CE. Encouraged by the results obtained, we also successfully managed to integrate Microsoft Compute Cluster Server (CCS) into gLite as first Windows native LRMS recognized by gLite. The presentation will make the point on the activities carried out so far as well as on the future plans.
Lee Giles, Prasenjit Mitra, Levent Bolelli, Xiaonan Lu, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, James Z. Wang, Karl Mueller, William Brouwer, James Kubicki, Barbara Garrison, Joel Bandstra, Pennsylvania State University
In chemistry, the growth of data has been explosive, and timely, effective information and data access is critical. We propose the NSF-funded ChemXSeer architecture, a portal for academic researchers in environmental chemistry, which integrates the scientific literature with experimental, analytical and simulation datasets. ChemXSeer will be comprised of information crawled from the web, manual submission of scientific documents and user submitted datasets as well as scientific documents and metadata provided by major publishers. Information crawled by ChemXSeer from the web and user submitted data will be publicly accessible whereas access to publisher resources can be provided by linking to their respective sites. Thus, instead of being a fully open search engine and repository, ChemXSeer will be a hybrid, limiting access to some resources. ChemXSeer intends to offer some unique aspects of search not yet present in other scientific search services. We are developing algorithms for the extraction of tables, figures, equations and formulae from scientific documents enabling users to search on those fields. ChemXSeer intends to provide the search features including; full text search Author, affiliation, title and venue search Figure and table search Equation and formulae search, citation and acknowledgement search, and citation linking and statistics. For dataset search, we are developing tools that automatically annotate published data representations such as figures, and that permit researchers to annotate their datasets by providing both document-level and attribute-level metadata in OAI-PMH format to facilitate searching data more effectively both at the attribute and semantic levels, browsing datasets, and linking to existing scientific literature and other datasets.
David Green, Steven Skiena, Stony Brook University
A major problem in synthetic biology is the tendency of bacterial systems to eliminate any genes that do not directly benefit the organism, as a result of natural selection favoring shorter genome lengths, which can be replicated more quickly. We are working on advances in computational protein and gene design that directly address this problem. We have previously demonstrated an algorithm capable of creating the shortest nucleotide sequence that encodes two given proteins, taking advantage of multiple reading frames and the redundancy of the genetic code. We also have expertise in computational approaches to the redesign of proteins to satisfy particular functions. We are currently working to integrate these technologies in achieving two particular goals. The first involves the interleaving of an antibiotic resistance gene with a particular protein whose expression is desired. Challenging bacteria containing this construct with the appropriate antibiotic will lead to a selective pressure to keep the inserted gene; as the sequence of the protein of interest overlaps this coding sequence, the deletion of the desired protein from the genome will be avoided. Secondly, we are developing methods to directly reduce the coding length for a given protein, taking a two-step approach: (1) redesign a multi-domain protein consisting of a single polypeptide sequence into a protein complex; (2) overlap the coding sequences of the two components, leading to a substantially reduced length of DNA that codes for a functionally equivalent protein. Our approach integrates protein design, coding-sequence optimization, and validation in a experimental context to address a major problem in the long term viability of synthetic biological networks. We will present our initial results in targeting these problems.
Jaroslaw Pillardy, Cornell University
One of the challenges of High Performance Computing (HPC) is the user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, MrBayes et al. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 10,000 job submissions per year, many of them requiring massively parallel computations. It also has a built in user management system which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The suite will be released along with its source code this year. The system consists of a web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server and file server. Users can interact with their jobs and data by a web browser, ftp or e-mail. Remote HPC clusters can be accessed via JSDL protocol. The interface is accessible at http://BioHPC.net/.
Jan Prins, University of North Carolina, Chapel Hill; Lars Nyland, Mark Harris, Nvidia Corp.
Acceleration of computational kernels using a GPU is becoming simpler using improved GPU programming models. We examine the all-pairs computational kernel for N-body simulation and its implementation using the NVIDIA CUDA programming model. We show how the parallelism available in the all-pairs computational kernel can be expressed in the CUDA model and how various parameters can be chosen to effectively engage the full resources of the first GPU to support the CUDA model, the NVIDIA GeForce 8800 GPU. We report on the performance of a familiar N-body kernel for astrophysical simulations. For this problem the GeForce 8800 calculates over 10 billion interactions per second performing 100 integration time steps per second to simulate a system with 10,000 bodies. At 20 flops per interaction, this corresponds to a sustained performance in excess of 200 gigaflops. This is close to the theoretical peak performance of the GeForce 8800 GPU. The all-pairs approach is typically used as a kernel to determine the forces in close-range interactions. The all-pairs method is then combined with a faster method based on a far-field approximation of longer range forces, which is only valid between parts of the system that are well separated. In all cases, a fast all-pairs kernel is essential to the overall performance of the n-body simulation.
David Holmes, Life Science Foundation; Fernado González-Nilo, Center for Bioinformatics and Molecular Simulation; Raúl Isea, Apartado Postal 40336
This presentation examines the case of the Virtual Institute for Integrative Biology (VIIB) as a Latin American paradigm for achieving global collaborative eScience. Biology has emerged as one of the major areas of focus of scientific research worldwide, providing new challenges in eScience and grid computing. Whereas major efforts to meet these challenges have been mounted in various parts of the world, less appears to have been accomplished in Latin America and the VIIB was developed to fill this need. The scientific agenda of the VIIP includes: construction and operation of databases for comparative genomics of particular relevance to Latin America, bioinformatics services and protein simulations for biotechnological and medical applications. Human resource development through shared teaching, co-sponsored students and seminars is also an integral component of the collaborative effort. eScience challenges include: connectivity concerns, high performance computing (HPC) limitations, development of a customized Grid framework, language issues, maintenance of open access without compromising security and the dissemination of scientific and technical information. Finally, it was recognized that computational frameworks and flexible workflows were required to efficiently exploit shared resources without causing impediments to the user who has little interest in the underlying information technology (IT). Overall, the VIIB has proved an effective way for small teams to transcend the critical mass problem, to overcome geographic limitations and to harness the power of large scale, collaborative science; as such, it may prove a useful model for promoting additional eScience initiatives in Latin America and other emerging regions.
Nahuel Olaiz, Esteban Mocskos, Mariano Perez Rodriguez, Lucas Colombo, Alejandro Soba, Cecilia Suarez, Graciela Gonzalez, University of Buenos Aires; Luis Nuñez, Argonne National Laboratory; Marcelo Risk, Guillermo Marshall, University of Buenos Aires
Here we describe an application in biomedical engineering. In cancer tumor drug treatment nothing can reach tumor cells without passing through the vessel wall and the interstitial matrix. Physicochemical and physiological barriers could hinder the main transport mechanisms, thus leading to heterogeneous therapeutic agent accumulation and some cells remaining untreated. Use of electric currents in chemotherapy greatly enhances drug transport and delivery. Cancer electrochemical treatment consists in the passage of an electric current, whether direct (EChT) or micro-/nano-pulsed (ECT), through two or more electrodes inserted locally in the tumor tissue. Extreme pH changes at tissue level (EChT) or the creation of membrane porous channels at the cell level (facilitating penetration of anticancer drugs into the cell, ECT), are the main tumor regression mechanisms. We study tumor drug transport for cancer treatment with nanoparticles (loaded with therapeutic agents) during EChT and ECT through a combined modeling methodology: in vivo with BALB/c mice bearing a subcutaneous tumor, in vitro with multi-cellular spheroids and collagen gels, and in silicon using the Nernst-Planck, Poisson and Navier-Stokes equations for ion transport, electric field distribution and fluid flow, respectively. The main goal is to find nano-particle/drug combinations, electric field intensities and pulse frequencies that optimize tumor treatment. In this interdisciplinary approach we use I-labs web based for confocal and fluorescent microscopy image processing, and HPC computing on a low latency cluster under MS CCS platform. Preliminary results suggest that using nano charged drugs and tuned electrical fields, significantly increases drug .
This summary presents a custom Software for Automatic Measurement of Circadian Activity Deviation called SAMCAD. The primary goal of this software is to extract, from raw activity data collected through passive monitoring, Circadian Activity Rhythms (CAR) or home human behaviors, for various types of populations who may benefit from a home assistive technology. Based on a pattern mining algorithm, SAMCAD establishes the life rhythm of a resident in approximately three weeks from empirical observations, then tracks for any behavioral changes eventually occurring during daily life at home. Early clinical trials show the potential to detect chronic pathologies such as urinary infections or to evaluate cognitive decline or rehabilitation treatments. The knowledge of life habits, given by a derived type of CAR activity patterns based on the user presence in every room, permits also to setup various home automation functions such as power management. For example, half duplex radio transmissions which are highly solicited during long-term in-home wireless activity monitoring in sensor networks, can be efficiently regulated for energy saving by mapping motes’ behavior to the resident behavior, while preserving a high quality of monitoring. The detection of the deviation of these home behaviors, part of the CAR model, can be as well useful in the field of privacy to re-enforce rules based systems dealing with dynamic Role Based Access Control. Privileges to access personal medical data belong first to patients. However, they may be willing to automatically provide permissions to caregivers in case of shortterm at-risk situations (falls, cardiac arrests), or in longer situations involving abnormal CAR behavioral context. Such behavioral anomalies, which may be indicative of a cognitive decline, can be used to warn caregivers for investigations.
Tanya Berger-Wolf, University of Illinois at Chicago; Daniel Rubenstein, Princeton University; Mayank Lahiri, Chayant Tantipathananandh , University of Illinois at Chicago; David Kempe, University of Southern California; Habiba Habiba, University of Illinois at Chicago; Jared Saia, University of New Mexico
Computation has fundamentally changed the way we study nature. Recent breakthroughs in data collection technology, such as GPS and other mobile sensors, are giving biologists access to data about wild populations that are orders of magnitude richer than any previously collected. Such data offer the promise of answering some of the big ecological questions about animal populations: Unfortunately, in this domain, our ability to analyze data lags substantially behind our ability to collect it. In particular, interactions among individuals are often modeled as social networks where nodes represent individuals and an edge exists if the corresponding individuals have interacted during the observation period. The model is essentially static in that the interactions are aggregated over time and all information about the time and ordering of social interactions is discarded. We show that such traditional social network analysis methods may result in incorrect conclusions on dynamic data about the structure of interactions and the processes that spread over those interactions. We have extended computational methods for social network analysis to explicitly address the dynamic nature of interactions among individuals. We have developed techniques for identifying persistent communities, influential individuals, and extracting patterns of interactions in dynamic social networks. We will present our approach and demonstrate its applicability by analyzing interactions among zebra populations and identifying how the structure of interactions changes with demographic status.
Jeff Dozier, James Frew, University of California, Santa Barbara
Using reflectance values from the 7 MODIS “land” bands with 250 or 500m resolution, along with a 1km cloud product, we estimate the fraction of each 500m pixel that snow covers, along with the albedo of that snow. Such products are then used in hydrologic models in several mountainous basins. The daily products have data gaps and errors because of cloud cover and sensor viewing geometry. Rather than make users interpolate and filter these patchy daily maps without completely understanding the retrieval algorithm and instrument properties, we use the daily time series in an intelligent way to improve the estimate of the measured snow properties for a particular day. We use a combination of noise filtering, snow/cloud discrimination, and interpolation and smoothing to produce our best estimate of the daily snow cover and albedo. We consider two modes: one is the “predictive” mode, whereby we estimate the snow-covered area and albedo on that day using only the data up to that day; the other is the “retrospective” mode, whereby we reconstruct the history of the snow properties for a previous period.
Jeremy Archuleta, Wuchun Feng, Eli Tilevich, Virginia Polytechnic Institute and State University
The biomedical and life sciences communities make heavy use of BLAST (Basic Local Alignment Search Tool) to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between pairs of sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. For example, it can be used for phylogenetic profiling, bacterial genome annotation, and pathogen detection. Unfortunately, BLAST has proven to be too slow to keep up with the current rate of sequence acquisition. Searching for a given sequence against the nucleotide database takes nearly three times longer today than it did in 2004 despite faster hardware. Thus, we created mpiBLAST, a novel parallelization of BLAST that runs on many OS platforms, including Microsoft Windows. mpiBLAST can deliver super-linear speed-up and scale to tens of thousands of processors due to an array of integrated features including database and query segmentation, advanced job scheduling and load balancing, and parallel I/O. Currently, mpiBLAST v1.4 delivers 305-fold speedup when running on a 128-processor cluster. By abstracting the execution characteristics of sequence-search algorithms such as BLAST, mpiBLAST has evolved to efficiently transform any given serial sequence-search tool into a parallel one, thus delivering the above performance to an entire class of sequence-search algorithms. This new version of mpiBLAST (v2.0) achieves the above by utilizing “mixing layers” to separate functionality into complementary modules and “refined roles” within each layer to improve the inherently modular design, thus enhancing maintenance and extensibility, e.g., allow advanced algorithmic features to be developed and incorporated while routine maintenance of the code base persists.
Wuchun Feng, Virginia Polytechnic Institute and State University
For decades now, the notion of performance has been synonymous with speed. For example, the performance of supercomputers running on our n-body cosmology code may have improved nearly 10,000-fold since 1992; the performance per watt only improved 300-fold and the performance per square foot only 65-fold. The “mere” 300-fold increase in performance per watt implies that supercomputers are not making as significant advances in power efficiency as in performance; interdependently, the relatively miniscule 65-fold increase in performance per square foot (or alternatively, performance per square meter) means that advances in space efficiency, when compared to performance, have been virtually non-existent. These smaller gains in efficiency oftentimes result in the design and construction of new machine rooms, and in some cases, require the construction of entirely new buildings. Unfortunately, this particular focus has led to the emergence of supercomputers that consume egregious amounts of electrical power and produce so much heat that extravagant cooling facilities must be constructed to ensure proper operation. In addition, the emphasis on speed as the performance metric has adversely affected other performance metrics, e.g., reliability. As a consequence, all of the above has contributed to an extraordinary increase in the total cost of ownership (TCO) of a supercomputer. Therefore, we espouse the importance of being green in high-performance computing and even argue for a complementary list to the TOP500: The Green500 List.
Jeremy Frey, University of Southampton
The e-Malaria project aimed to bring together 16-18 year old school students with university researchers to explain aspects of computational drug design using the example the hunt for new anti-malarial drugs. Malaria kills a child every thirty seconds, and 40% of the world’s population lives in countries where the disease is endemic. Resistance to existing drugs is increasing and with global warming the range of the malaria carrying mosquitoes is expected to increase, so there is a growing need for new drug compounds. The challenge was presented to school students who to use a distributed drug search and selection system via a web interface to design potential drugs to act on the DHFR enzyme. The project makes use of industrial code for the docking study (“GOLD” from CCDC) and as such presents valuable lessons in how to achieve the integration of industrial programs into a “free” outreach environment. The results of the trials are displayed in an accessible manner, giving students an opportunity for discussion and debate both with peers and university researchers, to lean about computational drug design and Chemistry in general. The initial outreach project was extended to provide a similar challenge for undergraduate chemists as part of a chemical informatics course. For this course more complex design and modeling challenges were devised, that used the same e-Malaria core programs, but at a level relevant to more advances chemical skills. The types of problems devised will be illustrated in the presentation.
Leonard McMillan, University of North Carolina at Chapel Hill
What if solving nature’s puzzles was entertaining as well as fulfilling? Would you rather play a first-person shooter, or be the first person to figure out a gene’s function? Or is it possible to do both? This is the challenge that I gave a class graduate students. We explored the potential of game interfaces, game-design principles, and game production approaches for constructing bioinformatics tools. You might ask why? 1) Set-top Supercomputers. The most powerful computer in most homes today is a video-game console. Today’s machines boast multiple cores and 100+ MFlop performance with high-end graphics. Moreover, at $299, they represent one of the best MFlop per dollar ratios in history. 2) Most bioinformatics applications stink. Typical bioinformatics tools require their user to be literate in statistics, computer science, and biology. Imagine if, in order to drive a car, you had to simultaneously be a test-driver, mechanic, and combustion engineer. This is what is expected of today’s biologists. Lab software focuses on function and features rather than usability. In contrast, video game manuals are seldom read. Is it possible to build scientific tools that are usable by anyone? Can we make them fun? 3) Leverage an insatiable resource. Can we harness the minds and reflexes of the billion-plus gamers worldwide to find cures for disease with incentives of being a high scorer rather than securing drug-patent rights? Many of the tasks confronted by biologists amount to combinatorial puzzles, not unlike the game “Bejeweled”. A biologist may spend years searching for patterns within a gene expression array. What if hundreds of gamers joined in, and explored their datasets in parallel? In this talk, I will share our experiences in writing video games with a purpose. This will include discussions of some of the underlying biology, as well as game demonstrations.
Wuchun Feng, Virginia Polytechnic Institute and State University
Since the advent of the computer, performance has always been defined with respect to speed. As a consequence, microprocessor vendors have not only doubled the number of transistors (and speed) every 18-24 months, but they have also doubled the power densities. Consequently, keeping a datacenter environment functioning properly requires continual cooling and exhaust, thus resulting in substantial operational costs, e.g., the annual cost of powering and cooling computer servers worldwide is fast approaching the annual spending on new machines. In addition, the increase in power densities has led to a decrease in system reliability, thus leading to lost productivity. To address these problems in the datacenter, we present a power-aware scheduling algorithm that automatically and transparently adapts its voltage and frequency settings to achieve significant power reduction and energy savings with minimal impact on the performance of datacenter workloads. We evaluate our power-aware scheduling algorithm on actual platforms based on AMD and Intel platforms, which support PowerNow! and demand-based switching, respectively. For sequential and parallel scientific workloads in datacenters, the energy savings averages 20% and 25%, respectively, with maximum energy savings reaching as high as 70%. The energy savings for business workloads in datacenters is even higher given their transaction-based execution profiles.
Oscar Corcho, Paolo Missier, Pinar Alper, Sean Bechhofer, Carole Goble; University of Manchester
eScience applications are usually characterized by their distributed and knowledge-intensive nature, what poses interesting new metadata management challenges, such as metadata distribution across application components, access control, evolution, etc. Given the role of metadata in these applications, we think that it should be treated as a first class entity, coexisting with other entities in the system (Web services, datasets, sensors, documents, etc.). This shift in the treatment of metadata allows dealing appropriately with the previous challenges. This is what we propose in the S-OGSA architecture (which stands for Semantically-enriched Open Grid Service Architecture, originally proposed as a semantic extension of Grid applications), and what we have implemented in its supporting reference technological infrastructure. In S-OGSA, metadata can refer to any first-class entity that an application is dealing with (services invoked by a workflow engine, datasets, sensors, scientific documents, etc.), and it can be represented in multiple forms (natural language documentation, user-defined tags, ontology instances, etc.). Metadata is stored in metadata containers, called Semantic Bindings, which are linked to the entities that they refer to and which can be accessed either independently or jointly in a system, regardless of their physical distribution. Access control can be applied with different levels of granularity, since Semantic Bindings may contain small or large pieces of metadata from a specific resource, and metadata lifetime can be managed by means of appropriate event-driven notification mechanisms that trigger transitions between metadata states. We describe the main design principles of S-OGSA and how they can be applied in different e-Science scenarios, with examples of a prototype developed in the domain of satellite image quality analysis.
Munindar Singh, Yathiraj Udupi, North Carolina State University
Collaboration among peers is common in large-scale scientific computing (as in production grids). Often, resources (e.g., data, compute servers) need to be shared among multiple parties in a manner that respects both the overall needs of the collective and the individual. The famous example of preemptive scheduling is a case in point. Currently, computational support for collaborative resource sharing is inadequate. A common approach is to apply policy engines. This poses two challenges. One, when autonomous peers interact, a centralized policy engine cannot make decisions for all of them. Two, current approaches lack a deep conceptual model of how collaboration takes place in scientific computing (or service engagements broadly). We define Governance as the process by which peers achieve agreement about how they will administer themselves. We contrast governance with management, which (as the current mindset) applies to a superior managing his or her subordinates — clearly inapplicable among peers. We have developed a conceptually well-grounded approach for Governance. This models organizations based upon our formalization of commitments. Each organization is defined in terms of the standing commitments among its members. These commitments constrain the members’ behaviors. Organizations can enter into contract with one another. Our conceptual model includes a rich vocabulary by which interactions among peers (such as for administering organizations) can be captured, and appropriate policies stated for each peer to satisfy both collective and individual needs. This is how we achieve policy-based governance. A multi-agent prototype demonstrates our model and architecture. Our research seeks to capture important technical properties of policy-based governance. This presentation summarizes work previously reported in AAAI 06 and SCC 06 and 07.
Richard Buttimer, The University of North Carolina at Charlotte
Mortgages are one of the major fixed-income investment classes in the U.S. They are held by financial institutions, pension funds, mutual funds, and hedge funds. They are also frequently held in the investment portfolio of non-financial firms. Mortgages are an extremely complex financial instrument for a variety of reasons: they are long-lived, they are extremely interest rate sensitive, and they have embedded within them the borrower’s options to default and prepay. In practice, mortgage pricing is nearly always done through very lengthy and computationally-intensive Monte Carlo simulation. Microsoft, RENCI, and UNC Charlotte are working together to develop a mortgage pricing system utilizing the Microsoft Hosted High Performance Computing system. This system will initially be used in advanced MBA courses. Students in these courses will be assigned the task of managing simulated mortgage portfolios similar to those held by large money-center banks. They will utilize the pricing model to determine not only the prices of the securities they hold, but also their risk characteristics. The system will also provide prices and risk characteristics for a variety of alternative investment and hedging vehicles. This system will provide the students with a near “real world” mortgage portfolio management experience. Microsoft, RENCI, and UNC Charlotte will each gain experience with hosted high-performance computing applications. Although the system will initially utilize a publicly-available model, the Office of Thrift Supervision (OTS) regulatory model, the model could potentially be expanded to be a commercially viable system.
Amarjeet Singh, Maxim Batalin, William Kaiser; University of California, Los Angeles
Networked InfoMechanical Systems (NIMS) provide a family of robotic platforms for diverse environment monitoring applications. We provide an overview of these systems and their applicability through several real world sensing campaigns that provided scientists with the data at a scale and resolution that was not previously possible. The new class of observational methods is also supported by experimental design that optimizes measurement fidelity by combining knowledge of measurement objectives, phenomena models, and system constraints. We have developed and demonstrated the generally applicable, Iterative experimental Design for Environmental Applications (IDEA), methods and systems to efficiently use distributed sensing and computing for understanding the high spatial and temporal variability associated with environmental applications. Next, we model the observed natural system as a Gaussian Process and present a resource-cost-aware informative path planning approach. In this approach, we compute a set of most informative observation locations that can be visited by the mobile robot with a constraint on the upper bound of the resource capacity of the robot, such as limited sensing time or limited battery capacity. For this NP hard problem, we provide strong approximation guarantees for the single robot scenario and extend it for multiple robots providing near optimal approximation guarantee. The NIMS family of sensing systems, together with a systematic experimental design approach that also involves phenomena modeling, enabled the first high resolution imaging of several important scientific phenomena such as contaminant concentration and algal bloom dynamics. This work is currently being applied to survey entire river systems in interdisciplinary investigations providing scientists with important new characterization of primary national water resources.
Lin Ye, Hao Zhu, Alexander Golbraikh, Alexander Tropsha, University of North Carolina at Chapel Hill
Predictive models for acute fish toxicity (96 hour fathead minnow LC50) have been developed. A dataset consisting of 587 molecules with experimentally determined LC50 values was compiled. The entire dataset was randomly divided into modeling set (470 compounds) and external validation set (117 compounds) and this procedure was repeated ten times to generate 10 modeling-validation set pairs. Molecular descriptors were calculated by Dragon and MolConnZ software for all compounds in every subset. Each modeling set was split into multiple training-test sets using a diversity sampling approach. QSAR models were developed for individual training sets by kNN methods and the resulting models were validated using the respective test sets. The models that satisfied the cutoff (both leave-one-out cross-validation Q2 for the training set and linear fit R2 for the test set greater than 0.6) were kept. All the successful models were used to make the consensus prediction of the external validation set. The statistical results of all 10 external validation experiments were similar (R2 range from 0.67 to 0.83, Mean Absolute Error (MAE) range from 0.46 to 0.66). The results were improved by removing outliers of the modeling set compounds in the chemical space before model development: for the external validation sets the range of R2 was between 0.76 and 0.82, and MAE was 0.41 and 0.44.
Winston Wu, Maxim Batalin, William Kaiser, University of California, Los Angeles
Recent advancement in micro sensor technology permits miniaturization of conventional physiological sensors. Combined with low-power, energy-aware embedded systems and low power wireless interfaces, these sensors now enable patient monitoring in home and workplace environments in addition to the clinic. Low energy operation is critical for meeting typical long operating lifetime requirements. Important challenges appear as some of these important physiological sensors, such as electrocardiographs (ECG), introduce large energy demand because of the need for high sampling rate and resolution, and also introduce limitations due to reduced convenience of user wearability. Energy usage of the distributed sensor node systems may be reduced by activating and deactivating sensors according to real-time measurement demand. Indeed, as will be described, not all the physiological sensors are required at all times in order to achieve high certainty diagnostics. Our results show that with proper adaptive measurement scheduling, an ECG signal from a subject may be needed for analysis only at certain times, such as during or after an exercise activity. This demonstrates that autonomous systems may rely on low energy cost sensors combined with real time computation to determine patient context and apply this information to properly schedule use of high cost sensors, for example, ECG sensor systems. We have implemented a wearable system based on standard widely-used handheld computing hardware components. This system relies on a new software architecture and an embedded inference engine developed for these standard platforms. The performance of the system is evaluated using experimental data sets acquired for subjects wearing this system during an exercise sequence. This same approach can be used in context-aware monitoring of diverse physiological signals in a patient’s daily life.
Uma Shama, Lawrence Harman, Juozas Baltikauskas, Daniel Fitch, Glen Kidwell; Bridgewater State College
We document the collaboration of the GeoGraphics Laboratory at Bridgewater State College and the Town of Brewster (MA) Fire and Rescue Department to develop a low-cost automatic vehicle location system using commercial-off-the-shelf (COTR) military-specification cell phones and web mapping applications to provide situational awareness and post-action analysis for emergency response command and control personnel in a mobilization involving multiple jurisdictions. Using open-source software, a program was written to send assisted-global positioning systems (A-GPS) data at very high refresh rates (2-4 seconds) using inexpensive data-only cell phones and standard Internet communications. The web mapping application provides a rich no-cost display of the AVL data on public domain web service http://www.geolabvirtualmaps.com/ (Southeastern MA Emergency Response) with the capacity to add custom features defined by the local emergency response and emergency management personnel. It is hosted on Microsoft Virtual Earth but uses GeoRSS standards for creating points, lines and areas for geographic objects added to the application. It also provides a dynamic reverse geo-coding feature that displays the nearest street address on the vehicle location label of the web display for emergency response commanders. The system was tested as a part of the Fourth of July Provincetown (MA) Fireworks Mobilization involving ambulances and emergency response personnel from six towns. This presentation will provide the design features, a geo-spatial analysis of the mobilization and debriefing of the mobilization commander. This assessment will critique the performance of the technology before, during and after the mobilization.
Thomas Finholt, Erik Hofer, David Lee University of Michigan
The School of Information at the University of Michigan recently launched the Virtual Space Interaction Test bed (VISIT) project. VISIT demonstrates a number of “ultra-resolution” collaboration capabilities. Using OptIPortals of varying sizes (e.g., arrays of commodity LCD displays coupled with computing clusters and high performance networking), VISIT supports visualization of images and data at very high resolution (currently 50 megapixels) alongside uncompressed HD video of distant collaborators. Previous use of OptIPortals has emphasized collocated collaboration and visualization. A key feature of VISIT is distributed installation of OptIPortals to enable distant collaboration. Requirements for distant collaboration are much different. For example, with limited or reduced shared visual access, it is necessary to create or simulate many of the cues used in shared spaces to coordinate conversation and to orient to common visual references. Therefore, VISIT explores the use of multi-modal sensor data, artifacts (e.g., shared electronic posters), and visual cues to allow distributed collaborators to use OptIPortals both to conduct their scientific work better as well as to improve awareness of the availability and presence of remote colleagues. This model of OptIPortal use emphasizes socio-technical aspects of the technology, seeking to produce gains in scientific understanding by improving the process of collaboration, as well as through the introduction of advanced visualization capabilities. Therefore, a key goal of VISIT is evaluation of use in terms of the impact on creation and maintenance of social network ties among scientists, research performance (e.g., time to produce publications), and usability.
Mehrdad Jahangiri, Cyrus Shahabi; University of Southern California
Spreadsheets allow us to perform complex data analysis on scientific datasets. However, they cannot operate efficiently on large multidimensional datasets generated by the current data acquisition methods. Current science practice is to store the original data in databases or ftp sites and then manually generate a smaller subset of the data (by sampling, aggregating, or categorizing). Yet, this time-consuming process suffers from one major drawback. By losing the detailed information and working with the second-hand dataset, we conduct a biased study of the data by verifying our known hypothesis rather than being surprised with unknown facts. One of the mostly exercised functionalities of spreadsheets is to generate meaningful plots over the data. However, to the best of our knowledge no other work has studied plots as “queries” on large datasets. A Plot query summarizes how a fact changes over a set of attributes and is visually represented in various forms of charts. The valuable insight provided by these queries comes from the illustrated relationship among the plot points. Thus it is essential to preserve this relationship in approximate or progressive answering rather than conserving the accuracy of each individual plot point. Here, we propose a wavelet-based technique that exploits I/O sharing across plot points to evaluate the query progressively and efficiently. The intuition comes from the fact that we can decompose a plot query into two sets of aggregate and slice-and-dice queries. Subsequently, we can effectively compute both as investigated in our earlier studies. Our technique is not only efficient as an exact algorithm but also very effective as an approximation method in case of limited query time or storage space. We believe this study can proactively lead us toward building an interactive pivot chart on massive multidimensional datasets.
Tiberiu Stef-Praun, Ian Foster, Computation Institute/University of Chicago; Robert Townsend, Economics Dept/ University of Chicago
We report on a project that seeks to scale up this approach to larger quantities of data, more computationally demanding analytic methods, and a larger population of economist and student users. At the core of this project is an infrastructure that integrates spatial data services for organizing, accessing, analyzing, and displaying spatial data, and computational services that allow for the distributed processing of models on Grid-enabled resources. Integration via Web Services allows users to pose questions that are answered by extracting data from GIS data sources, running substantial computations on that data and depositing derived data back into the spatial data store.
Carole Goble, Andrew Gibson, Matthew Gamble, Katy Wolstencroft; The University of Manchester; Tom Oinn, The European Bioinformatics Institute
Workflow environments like Taverna (http://www.mygrid.org.uk/) are great for scientists who have a clear understanding of their task and goals. However, a significant amount of bioinformatics does not have such well defined goals. We present the Data Playground, an environment designed to encourage the uptake of workflow systems in bioinformatics through more intuitive interaction by focusing the user on their data rather than on the processes. A prototype plug-in for the Taverna workflow environment shows how we can promote the creation of workflow fragments by automatically converting the users’ interactions with data and Web Services into a more conventional workflow specification. We claim that this exploratory mode is more natural to users, and enables workflow development by example.
Hao Tang, Alexander Tropsha, Simon Wang, The University of North Carolina at Chapel Hill; Alan Kozikowski, University of Illinois at Chicago; Bryan Roth, The University of North Carolina at Chapel Hill
Histone deacetylases (HDAC) play a critical role in transcription regulation. Small molecule HDAC inhibitors are an emerging target for treating cancer and other cell proliferation diseases. Several previous reports have studied 3D Quantitative Structure- Activity Relationship (QSAR) to assess the possibility of computer based drug mining for HDAC inhibitors. We employed variable selection k Nearest Neighbor approach (kNN) and Support Vector Machines approach (SVM) to generate QSAR models for 59 chemically diverse compounds with inhibition activity on class I histone deacetylase. MOE and MolConnZ based 2D descriptors were combined with kNN and SVM approaches independently to improve the predictability of models. Rigorous model validation approaches were employed including randomization of target activity (Y-randomization test) and assessment of model predictability by consensus prediction on two external datasets. Highly predictive QSAR models were generated with leave-one-out cross validation R2 (q2) values for the training set and R2 values for the test set as high as 0.81 and 0.80, respectively with MolconnZ /kNN approach and 0.94 and 0.81, respectiveley with MolconnZ/SVM approach. Validated QSAR models were then used to mine four chemical databases which included a total of over 3 million compounds resulting in 48 consensus hits, including two reported HDAC inhibitors not included in the original data set.
Meiyappan Nagappan, North Carolina State University; Ilkay Altintas, San Diego Supercomputing Center ; George Chin, Pacific Northwest National Lab; Daniel Crawl, San Diego Supercomputing Center; Terence Critchlow, Pacific Northwest National Lab; David Koop, University of Utah; Jeffrey Ligon, North Carolina State University; Bertram Ludaescher, University of California, Davis; Pierre Mouallem, North Carolina State University; Norbert Podhorszki, University of California, Davis; Claudio Silva, University of Utah; Mladen Vouk, North Carolina State University
Scientific workflow management systems are used to automate scientific discovery. Increasing complexity of such workflows, and sometimes legal reasons, is fueling a demand for more run-time and historical information about the workflow processes, outputs, environments, etc. Properly constructed run-time and provenance information collection framework can help manage, integrate and display the needed information. In this paper we present the provenance system developed by the Department of Energy Scientific Data Management Enabling Technology Center’s Scientific Process Automation group. The solution adds to the successful Kepler scientific workflow support system by integrating Kepler with a standard LAMP – Linux Apache MySql PHP environment to provide a very flexible and readily deployable (K)LAMP scientific workflow support environment for e-science. The solution is sufficiently modular to allow use of other workflow engines and other component solutions. This paper discusses the architecture of the solution, its deployment and some of the principal challenges it is solving: how to collect provenance information in a standardized and seamless way and with minimal overhead, how to store this information in a permanent way so that the scientist can come back to it at anytime, and how to present this information to the user in a logical manner. Also, part of the issue is privacy policies and strict security policies that apply to Department of Energy (DoE) national laboratories.
Yuri Peterson, Duke University; Simon Wang, The University of North Carolina at Chapel Hill ; Patrick Casey, Duke University; Alexander Tropsha, The University of North Carolina at Chapel Hill
Geranylgeranyltranferase inhibitors (GGTIs) are small molecule drugs that inhibit C20 lipid modification of CaaX motif proteins. Attenuating function of these proteins will provide therapeutic benefit in cancer, inflammation, multiple sclerosis, viral infection (HepC/HIV), apoptosis, angiogenesis, rheumatoid arthritis, atherosclerosis (vascular disease), psoriasis, glaucoma and diabetic retinopathy. However, there are only two publicly known chemical scaffolds available for GGTIs at present. We have developed the combinatorial quantitative structure-activity relationship (QSAR) models for 48 known GGTIs, using k-nearest neighbor (kNN) method, automated lazy learning (ALL) method and partial least square (PLS) method. The models were rigorously validated based on several statistical criteria, including the randomization of the target property (Y-randomization), the verification of the training set models’ predictive power using test sets, and the establishment of the models’ applicability domain. The validated QSAR models were used to mine major publicly available chemical databases, including the National Cancer Institute database of ca. 250,000 compounds, the Maybridge database of ca. 54,000 compounds, the ChemDiv database of ca. 630,000 compounds, the WDI database of ca. 59,000 compounds, and the ZINC 7.0 database of ca. 6,500,000 compounds. These searches resulted in multiple consensus hits and had revealed several new chemical scaffolds for GGTIs. They had been validated by biological assays and patented recently. This study illustrates that the combined application of predictive QSAR modeling and database mining may provide an important avenue for rational computer-aided drug discovery.
Hanan Samet, University of Maryland; Jagan Sankaranarayanan, University of Maryland; Michael Lieberman, University of Maryland; Adam Phillippy, University of Maryland
eScience techniques can be used to understand the source and spread of disease epidemics to contain future outbreaks, thereby possibly reducing the potentially massive toll on human life in underdeveloped nations. Even though epidemiological information is available for many pathogenic microbes, incidence reports are scattered and are difficult to summarize. We have built a system to automatically extract, classify and organize incidence reports based on geographic location and type for analysis by domain experts. Documents from the U.S. National Library of Medicine (www.pubmed.gov) and the World Health Organization (www.who.int) have been tagged according to their spatial and temporal relationships to specific disease occurrences, and presented graphically via a map interface. This work has leveraged our experience with the SAND Spatial Browser and Spreadsheet to provide spatial and textual search capabilities on the web (e.g., documents on “influenza” near “Hong Kong”). Users can also see the phrases in the documents that satisfy the query, thereby facilitating easy verification as well as dismissal of false positives due to errors in identification of geographical references, which are difficult to avoid. The user interface also provides the ability to restrict the search result to a particular time period. In addition, newspaper articles have been tagged and indexed to bolster the surveillance of ongoing epidemics, while examining past epidemics using our system leads to improved understanding of the sources and spreading mechanisms of infectious diseases. In our paper, we will describe the design of our system which combines state of the art technologies from different areas of computer science and demonstrate the working and the usefulness of our system.
James Hogan, Paul Roe; Queensland University of Technology
Modern scientific enquiry, particularly in bioinformatics, is increasingly characterized by fine-grained comparative analyses over large data sets. Such studies require the automation of software tools to operate across multiple data values, and sensible strategies for managing the explosion of outputs which may result. Modern scientific workflow systems, therefore, must provide support for these activities and for the active involvement of the user in selection, combination and filtering. In this talk we present a new version of the GPFlow scientific workflow system which provides extensive support for collection processing, but does so in a manner largely transparent to the user, and which avoids the need for the scientist to take direct control of operational plumbing. GPFlow is a novel, web-accessible workflow system which makes large-scale comparative studies accessible without programming, and eases the transition from small-scale experimentation to large scale serious analyses. In a typical comparative study, several tools and services are used in concert, and all must be lifted to operate across sets of values to implement the analysis, with some components drawing upon outputs from multiple precursors. Data must be combined and filtered at each end of the process. The model and its implementation are presented in the context of a core Bioinformatics problem – the search for regulatory motifs. The model is novel in allowing a workflow on a single data value to be automatically lifted to operate on a set of values. Users may thus prototype on the small-scale and execute on the large, a process which requires no changes to the underlying workflow. The model follows our previous work in supporting combined interactive and batch operation.
Eric Sills, Sam Averitt, Michael Bugaev, Aaron Peeler, Henry Schaffer, Josh Thompson, Mladen Vouk; North Carolina State University
North Carolina State University has developed a computational and application resource brokering, differentiation, and delivery system called Virtual Computing Laboratory (VCL). VCL allows sharing of a common hardware infrastructure by a range of applications from CAD packages. Initially, VCL virtualized the STEM computing environments to deliver applications students needed for their course work and research via their personal computing devices rather than at a physical computer lab on campus. As development of VCL progressed, the hardware resources flowed back and forth between production Linux cluster nodes serving typical HPC workloads and providing on-demand student-computing applications on various operating systems. Demand curves for these two uses tend to be out of phase with student computing demand, building as the academic semester progresses, and HPC demand peaking following the end of exams. This allows much better utilization of the hardware resources. VCL has been in production use at North Carolina State University for about three years. Flexibility of VCL has proven to be essential in easily supporting specialized university research computing demands, and our experience is that VCL-based hardware and application management provides a much greater service at a considerably lower cost per unit of service. In addition, VCL provides the various standard and customized services with much less intervention of the central IT staff than previously necessary. This paper discusses the details of the VCL architecture, economics, security and versatility.
Richard Mason, Binh Pham, Paul Roe, Queensland University of Technology
Sound is a rich medium carrying lots of information which is tractable for analysis. The natural environment is rich in sounds; potentially fauna, weather, and machinery can be located and recognized. Environmentalists use sound to measure the health of the environment by monitoring key species such as birds which are early indicators of environmental change. We have designed a sensor network based on smart phones for monitoring environmental change. The platform comprises smart phones running a custom application for recording bird song. Sensors are managed in an autonomic fashion to ensure that they operate reliably and efficiently for long periods of time. Recorded birdsong is uploaded to a relational database through a 3G telephony network. The nature of acoustic sensing means that large volumes of data are collected so data communication and optimization is important. Sensor recording can be remotely controlled through a web service interface. Sound data stored in a database is analyzed to recognize different birds and bird calls using a neural network. A novel noise reduction technique is employed prior to identification. The analyses potentially enable the location, type of bird and bird behavior (through bird call), to be known. From this, temporal and spatial profiles of bird behavior can be studied and the effects of environmental change can be known. A field study is being undertaken at Brisbane airport where a second runway is being constructed. Brisbane airport is located in an environmentally valuable wetland area, which is the habitat for much wildlife including the rare Lewins Rail. This study aims to address a number of questions regarding this bird using acoustic sensor networks. The sensors will provide valuable information on the birds’ habits as well as a measure of the impact of the new runways construction.
Jui-Hua Hsieh, Simon Wang, Shuxing Zhang, Alexander Tropsha; The University of North Carolina at Chapel Hill
Molecular docking has become a common technique in structure-based drug design. Although state-of-the-art search algorithms implemented in the docking software can generate native-like poses in the binding sites, the performance of the scoring functions is still unsatisfactory. The failure to correlate the key interactions with binding affinities leads to the “geometric decoys”, poses deviating more than 3.0 angstrom RMSD from the native pose but with better energy scores. (Shoichet BK. et al. J. Med. Chem. 2005, 48, 3714-3728.). k-Nearest Neighbor (kNN) binary QSAR models generated from 264 protein-ligand complexes in the Protein Databank using ENTess descriptors were applied to four geometric decoy datasets, e.g. Thrombin, Dyhydrofolate Reductase (DHFR), Thymidilate Synthase (TS) and Acetylcholine Esterase (AchE).
David Hoyle, Iain Buchan, Peter Crowther; University of Manchester
Microarray technology for genome-wide Single Nucleotide Polymorphism (SNP) genotyping provides a unique opportunity to study complex diseases. This opportunity also presents computational and knowledge management challenges, and the statistical analysis presents a computational bottleneck in processing the raw data, motivating the need for a High Performance Computing (HPC) based solution. Statistical analysis of the raw data produces an equally large volume of derived data. Making sense of this derived data requires integrating the statistical analyses with information already known to the research community, such as SNP location, gene regulation, relevant biochemical pathways etc. Leveraging this community knowledge allows us to filter the statistical analysis and focus upon the most important genetic determinants of the diseases. The community knowledge exists in the form of individual expertise of scientists and information deposited in distributed databases and knowledge repositories. Easy access to both HPC infrastructure and community knowledge will be crucial for accelerating new research findings from genome-wide SNP studies. At NIBHI we have begun to develop, in collaboration with Microsoft, the necessary HPC infrastructure. The HPC facility will be accessed via a SharePoint portal site, providing a shared environment through which collaborating scientists exchange results, analyses, comments and documents. Running the statistical analyses on the HPC infrastructure is executed by initiating workflows from the portal site. Access to community knowledge will be done through automatic retrieval of annotation data from distributed sources. This can be performed via integration with existing bioinformatics workflow management systems, such as Taverna, that allow us to re-use workflows calling web services accessing the knowledge repositories.
Alan Huber, The University of North Carolina at Chapel Hill
High-fidelity local-scale Computational Fluid Dynamics (CFD) simulation of pollutant concentrations within roadway and urban landscapes is feasible using current high performance computing. Local-scale CFD simulations are able to account rigorously for topographical details such as terrain variations and building structures in urban areas. Solar or anthropogenic heating may be added to terrain and building surfaces. Real human environments may be directly simulated to support urban planning and response to emergency situations. There are a wide range of potential applications where computational wind engineering will become routine in coming years as computing hardware and software continues to grow and expand the frontiers for application. This presentation will briefly review the history of developments of computational environmental fluid dynamics. Modern day fluid dynamics has evolved much since Sir Isaac Newton’s physical equations and the evolution of the Navier- Stokes equation for fluid flow due to advancing computational hardware and software. The Navier-Stokes equation is the general basis for all CFD applications, for example, from weather prediction to vehicular aerodynamics. Example applications developed over the past few years while employed with the US Environmental Protection Agency are now being applied as an adjunct research faculty of the University of North Carolina using the critically needed computing capacity of RENCI’s Topsail computing system. In particular, simulations of the air transport of pollutant emissions within the Madison Square Garden area of New York City will be demonstrated. The virtual environment for midtown Manhattan was been developed to support planning and response to potential accidental emissions or intended terror activities. The age of direct local-scale environmental simulation has arrived.
Liying Zhang, Hao Zhu, Alexander Tropsha; The University of North Carolina at Chapel Hill
We have developed robust QSAR models of Blood-Brain Barrier (BBB) permeability using k-Nearest Neighbors (kNN) and Support Vector Machines (SVM) approaches and molecular topological descriptors. The modeling set of 159 compounds was divided into external evaluation set (15 compounds) and multiple training and test sets (the remaining 144 compounds). The consensus QSAR model accuracies were q2=0.91 and R02=0.68 for self-validation and external evaluation sets, respectively. These models were applied to additional external evaluation sets consisting of 99 drugs (from the WOMBAT-PK dataset) and 267 organic compounds classified as permeable (BBB+) or non-permeable (BBB-), and the best prediction accuracies were 82.5% and 59.0%, respectively. Noticeable improvements in prediction accuracy were achieved after applying applicability domain threshold for the prediction of evaluation sets: the accuracy for the first external evaluation set increased to R02=0.75 and for both of the additional external sets to 100%. The resulting models can be used to guide the design of pharmaceutically relevant chemical libraries towards drug-like compounds with optimal BBB permeability.
Hao Zhu, Alexander Tropsha, The University of North Carolina at Chapel Hill
Selecting suitable quantitative structure-activity relationships (QSAR) approaches for a specific toxicity endpoint is one of the critical issues for the development of robust predictive computational toxicity models. To this end, we have compiled an aqueous toxicity dataset containing 1,093 unique compounds tested in the same laboratory over several years against tetrahymenapyriformis. A modeling set consisting of 644 compounds randomly selected from the original set was distributed to five chemoinfomatic groups to use their own QSAR approaches and descriptors for model development. The remaining 449 compounds in the original set were used as an evaluation set to test the predictive power of individual models. In total, our virtual collaboratory generated 11 different validated QSAR toxicity models for the training set. The best models had the Leave One Out (LOO) cross-validation correlation coefficient R2(q2) = 0.93 for the training set and the correlation coefficient R2 for the external evaluation sets as high as 0.83. The results demonstrated that the evaluation of the models only based on the statistical parameters obtained for the modeling set may mislead the selection of the externally predictive models. We have developed a consensus model based on the average of the prediction results of all 11 models. The consensus model resulted in the best prediction accuracy for the training and external evaluation sets as high as 0.95 (q2) and 0.86 (R2), respectively. The utilization of the applicability domain could be included to balance the prediction accuracy with the chemistry space coverage based on the requirement of the users with respect to the error tolerance level.
David Chiu, Gagan Agrawal; The Ohio State University
Scientific domains increasingly involve data that can be obtained from the deep Web, while having other datasets in low-level formats. At the same time, an increasing number of Web or grid services are being made available. This leads to an interesting question, “Can we query low-level and deep Web data by automatically composing services and creating workflows”. Our work is driven by a collaboration with geodetic sciences, funded by an NSF grant for Cyber infrastructure for Environmental Observatories. Specifically, geospatial data is known to have: – Large Volumes: data may be collected in a continuous manner, – Low-level Format: data is normally stored in native low-level format, rather than in databases. – High Dimensionality: high dimensionality inherently alludes to nontrivial complexity for processing certain types of data. – Heterogeneous Data Sources: disparate data sources can collect and represent the same information with different accuracy and format, all of which offer various precision and accuracy but are ultimately used to describe the same information. – Temporal-Spatio Domain: since geospatial data is highly volatile, rigorous maintenance of descriptors such as location and date are imperative to providing accurate information. We propose a system that automatically constructs ad hoc workflows for answering high-level queries based on both service and data availability. A specific contribution of this work is the so-called “data-driven” capability in which we provide a framework to capture and utilize information redundancy that is present in heterogeneous data sources. We will use “machine-interpretable metadata” to be able to understand and parse low-level datasets and use them with the services.
Jerry Ebalunode, North Carolina Central University; Zheng Ouyang, University of Illinois at Chicago; Jie Liang, Weifan Zheng, North Carolina Central University
The structure-based drug design methods are typified by docking technologies that have been widely adopted by the pharmaceutical industry for virtual screening and library design. They are often the computational tools of choice for both lead generation and lead optimization. However, despite many reports of successful applications of off-the-shelf docking tools, serious issues remain unsolved in terms of the accuracies of docking poses and affinity scores. Recently, more intuitive and computationally more efficient structure-based methods have been reported that seek to find effective means to utilize experimental structure information without employing detailed docking calculations. These tools can (should) be coupled with efficient HTS technologies to improve the probability of success in the discovery process. For example, LigandScout has been successfully applied in several virtual and experimental HTS projects. We report the development of a new method that employs a rigorous computational geometry method and a deterministic geometric casting algorithm to derive the negative image of a binding site. Once the negative image of the binding site is generated, a variety of computer vision methods can be applied to compare and match small organic molecules with the shape of the negative image. We report the detailed computational protocol and its validation using known biologically active compounds extracted from the WOMBAT database. Models derived for selected targets are used to perform the virtual screening experiments to obtain the enrichment data for various methods. It is found that our new approach (Shape4 for shape pharmacophore) affords significantly better enrichment of hits than other methods studied in this work.
Nick Kaiser, Jim Heasley, Eugene Magnier, Alex Szalay; University of Hawaii
The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) will use giga-pixel CCD cameras on multi-aperture telescopes to survey the sky in the visible and near infra-red bands. A single telescope system (PS1) has been deployed on Maui and a four-telescope system (PS4) will be sited on Mauna Kea on the Big Island of Hawaii. These systems will survey the sky repeatedly and will generate petabytes of image data and catalogs of billions of stars and galaxies. The images will be combined to generate a very sensitive multi-color image of the static sky, and differences between images will provide a massive database for “time domain astronomy”; the study of moving, transient or variable objects. In addition to the challenge of building the telescopes and detectors, the project is faced with the formidable challenges of processing the image data in near real time and making the catalog data accessible via relational databases in order to facilitate the eScience that this project promises. This talk will describe the scale and content of the data products and will outline the designs of the image processing and database and archiving systems.
Eric Jul, Brian Vinter; University of Copenhagen
The University of Copenhagen has started an eScience graduate program in eScience and has established an eScience center to further develop and enhance research in eScience. The university has recognized the importance of eScience and has therefore established an eScience graduate degree in eScience. While it is possible to take many eScience related courses in most degree programs at the University, the University feels that by establishing a separate eScience degree, a much stronger emphasis can be put on eScience. The new program has achieved solid backing from all department of the Faculty of Natural Sciences. At the workshop I, as Director of eScience Studies, would welcome the chance to present the approach that the University of Copenhagen has taken to promote the new eScience graduate degree program – and the motivation for establishing an eScience center that draws faculty members from many different areas of the Natural Sciences. As far as we know, our program is one of the very first to provide a cross-disciplinary program to students and, at the same time, where they can interact with researchers at a dedicated eScience research center. At the workshop, the motivation and rationale for the program will be presented and the specific core courses will be described.
Stan Thomas, Freddie Salsbury, Jr., Stacy Knutson, Leslie Poole, Jacquelyn Fetrow; Wake Forest University
Protein post-translational modifications play key biological roles by modifying the structure and function of proteins. A common example is that of protein phosphorylation in signal transduction, metabolism and cellular differentiation. Analysis of phosphorylation sites has led to a better understanding of kinase substrate specificity, methods for site prediction and a combined experimental/computational approach resulting in a better understanding of the yeast phosphoproteome. Cysteine sulfenic acid (Cys-SOH) is a catalytic intermediate at enzyme active sites, a sensor for cellular stress, a regulator of transcription factors and an intermediate in redox signaling. The cysteine post-translational modification to sulfenic acid is not random; features at or near the cysteine control its reactivity. To identify the features responsible for the propensity of certain cysteines to be modified to sulfenic acid, a list of 47 proteins (containing 49 Cys-SOH sites) was compiled. Modifiable cysteines are found in proteins from many structural and functional classes. The site itself is not located in any one type of secondary structure. To further identify residues affecting cysteine reactivity, sites were analyzed using both functional profiling and electrostatic analysis. The combined approach reveals mechanistic determinants not obvious from sequence comparison alone. The longterm goals of this work are: 1) to combine structural and electrostatic feature analysis to predict Cys-SOH modification sites; 2) to include other modifications and distinguish between types of reactive cysteines; 3) to create a publicly accessible database of known and potential modification sites. The database would link sequence, structure, chemical and biological data to allow researchers to assess the effects of mutations or the possibility of oxidative cysteine modifications in proteins.
Ann Chervenak, University of Southern California
Data management for eScience applications is a challenging problem. Data-intensive scientific applications may produce and consume terabytes of data, which must be staged into and out of the high-performance computing resources on which the application’s computational analyses run. These analyses are often represented as scientific workflows that consist of millions of interdependent tasks. Workflow management systems are increasingly used to manage the dependencies among these computational tasks and the movement of data sets that are produced or consumed during task execution. The placement of data sets on storage resources can have a significant impact on the performance of eScience workflows. For example, if data sets are placed near high-performance computing resources, they can be staged efficiently into computations that execute on those resources; moving data sets off computational resources quickly when task execution is complete can also improve performance. In this talk, we consider the use of policy-driven Data Placement Services to improve the performance of eScience workflows. We are studying a variety of placement policies that seek to place data sets in ways that are advantageous for scientific workflow execution. Our research focuses on the relationship between data placement services and workflow management systems, with the goal of making data placement largely asynchronous with respect to workflow execution, thus reducing the need for on-demand data staging by the workflow system. The workflow system can also provide hints to the data placement service system about the order in which data are accessed. Using two existing services, the Data Replication Service for staging data and the Pegasus workflow management system, we demonstrate that intelligent data placement has the potential to significantly improve the performance of eScience workflows.
Lincoln Greenhill, Harvard-Smithsonian Center for Astrophysics; Daniel Mitchell, Steven Ord, Randall Wayth; Smithsonian Astrophysical Observatory
The expansion and cooling of the Big Bang was how the universe began, with particles eventually combining to form a dark sea of atomic hydrogen. Over time, gravity drew material together, giving rise to the earliest stars, black holes, and galaxies. Intense ultraviolet radiation, over time, heated and then destroyed the neutral hydrogen. Then the “dark sea” parted and the era of reionization, which lasted a billion years, brought about the most important structures in the universe we know. Yet we have only vague notions of how the universe evolved during this time. The best way to study reionization is to map the evolving distribution of hydrogen. The Mileura Wide-field Array (MWA) will do this for the first time; it is a new-concept, digital, radio “camera” in which the traditional telescope optics of lenses and reflectors are effectively replaced by software and high performance computers. The MWA computer pipeline will absorb in real time 128 gigabits of data per second (24×7), execute calibration and Fourier transform image construction on the fly, and accumulate reduced data to enable output at a manageable a few hundred TB per year, a 1000x reduction. This is one of the larger computing challenges in radio astronomy, and would have been impractical to attempt without recent computing advances. I will describe known MWA computing challenges, with emphasis on throughput and I/O, pipeline parallelization, possible application of GPUs, use of instrument simulations in algorithm and software development, scaling to future instruments, and collaboration thus far with the IIC.
Yong Liu, James Myers, Barbara Minsker, Joe Futrelle, Steve Downey; National Center for Supercomputing Applications; Il-hwan Kim, University of Michigan; Esa Rantanen, National Center for Supercomputing Applications
Providing community-scale infrastructure while enabling innovation by individual researchers is a central challenge for eScience efforts. Since 2004, the Cybercollaboratory, which is built on top of the open source Liferay portal framework, is part of the efforts of the at the National Center for Supercomputing Applications to build national cyber infrastructure to support collaborative research in environmental engineering and sciences. The CyberCollaboratory was used by Collaborative Largescale Engineering Analysis Network for Environmental Research (CLEANER), which is now the WATer and Environmental Research Systems (WATERS) network, project office and several CLEANER/WATERS test bed projects. Among over 400 registered users, over 100 had active involvements in the CyberCollaboratory. However, users have also reported usability issues. For example, users working in multiple groups found it difficult to get an overview of all of their activities and found differences in group layouts to be confusing. Users also found the standard account creation and group management processes cumbersome and wanted a better sense of presence and social networks within the portal. Keeping the document repository up-to-date as editing was performed on local files and as files were transmitted via email was another concern. As a result of this feedback and discussions with representatives from the CUAHSI (Consortium of Universities for the Advancement of Hydrologic Science) community, new design and development efforts were initiated in early 2007. This paper reviews the usability feedback and potential design changes and provides a summary of the changes made to the CyberCollaboratory.
Yong Liu, National Center for Supercomputing Applications; David Fazio, US Geological Survey; Tarek Abdelzaher, University of Illinois at Urbana-Champaign; Barbara Minsker, National Center for Supercomputing Applications
The value of real-time hydrologic data dissemination including river stage, stream flow, and precipitation for operational storm water management efforts is particularly high for communities where flash flooding is common and costly. Ideally, such data would be presented within a watershed-scale geospatial context to portray a holistic view of the watershed. Recent efforts on providing unified access to hydrological data have concentrated on creating new SOAP-based web services and common data format e.g. WaterML and Observation Data Model for data access e.g. HIS and HydroSeek. OGC sensor web enablement SWE proposes a revolutionary concept, however, these efforts do not facilitate dynamic data integration/fusion among heterogeneous sources, or data filtering and support for workflows or domain specific applications. We propose a light weight integration framework by extending SWE with open source Enterprise Service Bus e.g., mule as a backbone component to dynamically transform, transport, and integrate both heterogeneous sensor data sources and simulation model outputs. We will report our progress on building such framework where multi-agencies’ sensor data and hydro-model outputs with map layers will be integrated and disseminated in a geospatial browser e.g. Virtual Earth. Our project is the result of collaboration between the National Center for Supercomputing Applications, the US Geological Survey, the Illinois Water Science Center, and the Computer Science Department at the University of Illinois at Urbana-Champaign and is funded by the Adaptive Environmental Infrastructure Sensing and Information Systems initiative.
Sudeshna Das, Alister Lewis-Bowen, Lousi Weitzman, Tim Clark; Harvard University
We are developing a reusable framework for on-line communities of biomedical researchers. Although there is a growing number of biological knowledge bases, the vast majority of biological information and various resources used by the community (such as cell lines, antibodies etc.) reside in laboratory notebooks and heterogeneous databases. The context of the data is rarely captured and information exchange among researchers is usually accomplished via emailing of documents or conversations. Moreover, community websites publishing on-line materials rarely, if ever, link them to the biological information or resources, whereby key knowledge is lost. We are developing the framework as a Drupal (http://www.drupal.org/) distribution integrated with an RDF triple store and some associated java components. Drupal is a popular content management system and is widely used by various communities to develop their website. The framework will allow easy publishing of online materials. In addition, the framework will have semantic underpinnings to capture the relationships between research articles, biological entities, profiles of experts etc. We will use an extension of the SWAN ontology (Clark and Kinoshita, 2007) as our knowledge schema. Our goal is to organize and repurpose on-line material in communities by defining and capturing semantic relationships to existing knowledge repositories. Such a knowledgebase will enable richer and more powerful interactions amongst many sub disciplines within the scientific community.
David De Roure, The University of Southampton; Carole Goble, University of Manchester
Most computer users are familiar with the practice of sharing individual files, such as text, photos, videos and music, using social tools – Wikis, blogs and social networking sites like Flickr, YouTube and Facebook. Scientists are beginning to share information this way too. However, scientists commonly work with collections of digital items which include experimental plans, documentation, data, results, logs of runs, ‘housekeeping’ information, etc. myExperiment (http://myexperiment.org) is a social space for sharing scientific workflows and associated information – a way for scientists to share reusable pieces of scientific practice. In contrast to photo-sharing on Flickr or videos on YouTube, the basic unit of sharing in myExperiment is not a single file but rather a package of components that make up an experiment – what we call an Encapsulated myExperiment Object (EMO), and others have called Reproducible Research Objects. Notionally an EMO is a folder containing the various assets associated with an experiment. In the scientific context there are stringent requirements with respect to versioning, ownership, intellectual property and the maintenance of provenance information. We have looked at emerging practice in sharing “pieces of science” in the scientific and scholarly lifecycle, from social sites to digital repositories. myExperiment provides simple and extensible support to better understand requirements as new collaborative practice emerges. In this presentation, we will describe the characteristics of EMOs and present our initial design solution which supports the requirements of encapsulation and preserves our principles of simplicity and interoperability.
Andrew Grimshaw, University of Virginia
Providing such transparency and thus minimizing the effort required by users to integrate and use their code and data in the grid is both practical and desirable. The lack of easy data integration and access within a grid is a major barrier to a large number of potential grid users because they physically cannot change their code (the code is commercial or they do not have the source code) or because they do not have the time to devote to performing the necessary integration. At a macro level it is desirable to remove such burdens from end users because time they devote to grid integration activities is time taken away from working in their area of expertise “ their science and research”, while lowering the integration effort will encourage more users to take advantage of the benefits that data and compute grid systems offer. This talk will focus on the data grid capabilities of the Genesis II grid system. Genesis II is an open implementation of grid standards emerging from the Open Grid Forum. Specifically, Genesis II implements WS-Naming, the HPC-Profile, OGSA-BES, OGSA-ByteIO, RNS, and the draft OGSA Express Authentication Profile suite.
William Horsthemke, Daniela Raicu, Jacob Furst; DePaul University
Medical imaging informatics addresses initiatives to improve the performance of clinical radiology. These efforts range from managing images for reading by radiologists to computer-aided diagnosis. Many projects require significant image processing to extract image features for use in diagnosis or as reference queries for retrieving other images with similar characteristics. The effectiveness of such projects often depends on having large image data sets. Given the computational complexity of many image processing techniques and the number and size of medical images, medical imaging informatics tools are limited by hardware resources. Many tasks can be parallelized or adapted to distributed processing as available on grid-based technology, such as image processing feature extraction, dataset storage, content-based image retrieval (CBIR), and computer-aided diagnosis (CAD). We propose using technologies for three specific medical imaging tasks: 1) automatic segmentation of liver tissue in computed tomography grid (CT) of the abdomen, 2) CBIR for retrieving lung nodule cases in CT, and 3) classification of tumors in mammography images. Each task has a significant requirement for image processing to extract low-level features; the feature independence, as well as the presentation of data as a grid of pixels allows for excellent opportunities to use grid technology. The high level algorithms built on extracted image features (segmentation, similarity measures, and machine learning, respectively) can be run in parallel in a number of different ways – image slices, number of retrieved images, and independent machine learning steps. Focus on grid-enabled techniques will permit inclusion of computationally complex algorithms and larger datasets than otherwise acceptable for the near-real-time performance requirement of clinically useable medical imaging applications.
Douglas Lowery-North, Eugene Agichtein, James Buehler, Walter Orenstein, Lance Waller, Vicki Hertzberg; Emory University
Disease surveillance remains a challenging, though essential, public health function. Laws mandate that physicians and laboratories report cases or clusters of specific notifiable diseases to public health authorities, and failure to report these incidents in a timely or complete manner may lead to belated recognition of public health threats and lost opportunities for investigation and intervention. We identified the potential for a real-time, automated system could improve public health disease surveillance, using three large healthcare systems in Metro Atlanta, through the integration and knowledge management of the prediagnostic manifestations of disease (syndromic surveillance) from prehospital, outpatient, and inpatient data sources; the incorporation of laboratory and imaging diagnoses, and a bidirectional interface between the state public health agency and these three healthcare systems. Developing a real-time surveillance system presents many scientific and technological challenges, including: identification of subpopulations of special interest for disease surveillance; knowledge management technologies allowing forecasting based on diverse information gathered from different sources; utilization of free text records, such as dictations, to improve responsiveness, sensitivity and positive predictive value of the surveillance system; development of the analytical tools necessary to detect events of interest; application of performance improvement tactics to improve the value of data collection, analysis, and reporting, and to reduce waste associated with false positives; automation of initial epidemiologic event investigations; and feedback in response to epidemic threats.
Luigi Marini, Rob Kooper, Peter Bajcsy, James Myers; National Center for Supercomputing Applications
As current scientific workflow systems reach technical maturity, new challenges arise in the areas of usability and user access to advanced functionalities. The mismatch between the expertise of domain scientists and the technical knowledge required to use scientific workflow systems via visual programming is becoming more prominent. While domain scientists greatly benefit from using scientific workflow systems, the adoption barriers are non-trivial. In our development of the Cyberintegrator workflow system, we have investigated an exploratory, macro-recording-style interface as an alternative to visual programming. A macro-recording interface provides a more natural, step-by-step model that makes workflow creation easier. The scientist can focus on available data sets and relevant analytical tools, while the system records the overall workflow. With traditional workflow systems forcing the scientist to focus too much on the lower level engineering details, keeping track of the higher level scientific process can become a challenge. We have explored ways to make the use of support tools required for lower level data manipulation (loading, translation and visualization) more transparent to the scientist. The resulting interfaces have a stronger focus on science and support both scientific and engineering views of workflows. Since scientific research is often done in a community setting, simple ways to capture and share personal annotations in workflow editors would be extremely useful. We have looked into the addition of a community annotation system, which allows easy sharing of annotations about data, tools and workflows. We discuss issues encountered and design choices made when trying to lower adoption barriers of scientific workflow systems. We include examples from our experience designing and implementing the Cyberintegrator.
Eva Lee, Qifeng Lin, Kyungduck Cha, Calton Pu, Georgia Institute of Technology; Lynn Cummingham, Kenneth Brigham, The Emory / Georgia Tech Predictive Health Institute
The Emory / Georgia Tech Predictive Health Institute is a new model of healthcare that focuses on maintaining health, rather than treating disease. Through meta-analysis of multiple heterogeneous attributes (e.g. biological, genetic, clinical, behavioral, and environmental) PHI researchers seek to identify and measure risks and mechanisms of disease, and ultimately to promote health maintenance. When there is a potential health problem, predictive health aims to intervene at the very earliest indication, based on an individual’s personal profile, and restore normal function. A fundamental component of the PHI scientific mission is a scalable and extensible informatics framework(SEIF). In this talk, we will present our design and development of SEIF. SEIF is built using a 3-tier architecture that includes 3 major engines: the database server(DBS); the model interpreter(MI); and the information protection, propagation and access module(PWEB). DBS incorporates distributed clinical/ translational data, participant surveys, complex images using various databases including Oracle, MySQL, sequential files, and novel in-house models (e.g. for complex metabolomics data). MI employs semantics, relational and data mappings, and performs code generation to accommodate the evolutionary nature and heterogeneity of data, new data types and national standards. The dynamic capability and flexibility of automatic code-generation allows for re-organizing, re-loading, and requerying of meta/heterogeneous data, and is of paramount importance. PWEB offers secure multi-tier privileged user login. PHI participants, health partners, and researchers have different levels of data access requirements, and each is allowed to perform the necessary functionality through a web portal. Various features and scalability of SEIF for broad usage will be discussed.
Furrukh Khan, The Ohio State University
Applications that enable scientists to visually design the control flow (flowchart) as well as dataflow and conditional logic flow for interruptible programs (workflows) in their own Domain Specific Languages have obvious applications in eScience. The Windows Workflow (WF) runtime provides us with a light-weight and powerful engine for running interruptible programs that can be automatically persisted and tracked by WF; however a designer that can be used to visually construct control as well as dataflow and conditional logic flow is lacking at present. The WF designer allows only control flow to be visually designed; it cannot be used to wire together dataflow or conditional expressions. The stock designer can also not be used to design WF programs in browser based applications. Fortunately one can exploit various extensibility points to craft domain specific custom designers and loaders that interface directly with the WF runtime, thus bypassing the stock loader and designer. We first introduce the audience to workflows, then we talk about the powerful extensibility features of the WF runtime and demonstrate how we have leveraged these features to implement our own custom designer and loader. Scientists can use these to visually design and wire together not only flow of control (flowchart) but also complex dataflow and conditional logic. The custom designer can be implemented as a desktop or a web application that can further exploits Ajax technology for responsive browser based eScience applications. Finally we show how our designer for Windows WF is being used by scientists in the domain of human cancer research.
Lawrence Band, Sdhyok Shin, Taehee Hwang; University of North Carolina at Chapel Hill; Mark Reed, Matts Rynge, Lisa Stillwell; Renaissance Computing Institute; Jonathan Goodall, University Of South Carolina; Kenneth Galluppi, Renaissance Computing Institute
We describe a project that is developing and applying integrated ecohydrologic and geomorphic process models with mesoscale climate simulation to predict spatially distributed soil moisture, saturation, flash flood and landslide potential in southern Appalachian catchments. Landslides and flash floods are both major landscape forming processes and significant hazards in this region. Landslide risk is dependent on local topographic and soil conditions, long term changes in canopy cover and root structure, as well as transient moisture and saturation conditions from individual and recent storm events. Recent increases in tropical storm intensity, development and road construction in these mountainous areas may be increasing these hazards, as evidenced by major property damage and fatalities in the set of tropical storms experienced in the last few years. The modeling approach links GISci based ecohydrological and geomorphic process models that incorporate catchment patterns in soils, canopy conditions including root structure, and hill slope hydrologic routing that results in the development of space/time patterns of soil moisture, runoff and critical pore pressures that induce debris avalanches. Long term simulations are first used to develop spatially distributed ecosystem properties including canopy cover, LAI and root biomass. The potential for a forecast system is explored by driving the model with Land Data Assimilation System (LDAS) meteorological fields in near real time, then substituting WRF high resolution forecasts when major events are approaching. We use study catchments in the Coweeta.
Douglas Lowery-North, Eldad Haber; Emory University; Susanne Hardy, Philadelphia College of Osteopathic Medicine; Christopher Vaughns, Georgia Institute of Technology; Vicki Hertzberg, Emory University
The threat of highly pathogenic avian influenza and a resulting pandemic, has added a renewed sense of urgency to the scientific community’s search for ways to recognize, prevent, and control the spread of disease. The goal of our research is to innovate epidemiological tools using modeling and simulation that can predict propagation paths and outcomes of infectious disease from the original exposure to an ill patient in the emergency department, which is a major opportunity for the spread of infection. Such models, based in social network theory, will evaluate the probability of spreading disease through staff-patient, patient-patient and staff-staff interactions. Coupled with clinical data, accurate measurements of contact between and within patients and hospital staff provide reliable estimates of the context, duration, and distance of these contacts. One way to measure these interactions is through the use of radiofrequency identification (RFID) technology. Here we describe a study to evaluate the use of RFID technology to obtain location data for patients and staff in large, urban ED with computer algorithms processing the location data and developing network models. We present a number of scientific and technological challenges related to defining “contact”, the level of precision in terms of location and frequency required to develop a robust model, how to use trilateration/multilateration to obtain location information, which facility design factors impact the data collection, and how we assess the impact of personal protective equipment use.
Mirek Riedewald, Rich Caruana, Daniel Fink, Wesley Hochachka, Steven Kelling, Art Munson, Ben Shaby, Daria Sorokina; Cornell University
The Avian Knowledge Network (AKN) represents collaboration between the Cornell Lab of Ornithology and researchers from Cornell’s departments of computer science and statistics. Our team is accumulating one of the largest and most comprehensive biodiversity data sets in existence. Data is contributed by many partner organizations, including the US Geologic Survey, Point Reyes Bird Observatory, and Bird Studies Canada. Additionally, the AKN is harvesting a variety of environment attributes including habitat and human population demography to create an enormous data resource, currently with over 35 million bird observations, each linked to more than 1000 environmental attributes. Ultimately, our goal is to use this resource to synthesize biologically useful information for conservation and science. We summarize challenges we faced, how we addressed them so far, and what still needs to be done. One major challenge was to make the data available to a broad audience, and our solution involved defining the Bird Monitoring Data Exchange and setting up a federated architecture based on Grid technologies, accessible through simple Web interfaces. Another challenge is to use AKN data to study biodiversity. We are approaching it at two levels. The first is to use powerful non-parametric supervised learning techniques to build models that make accurate predictions of organism distribution and abundance as a function of environmental effects. The other is to identify the environmental features determining distribution and abundance to discover the affects on bird populations by analyzing the learned models with novel data mining techniques that can handle massive high-dimensional data.
Susanne Hardy, Philadelphia College of Osteopathic Medicine; Vicki Hertzberg, Emory University; Marilyn Margolis, Rehman Meghani, Jamie York, Emory Healthcare; Douglas Lowery-North, Emory University
Emergency Departments (EDs) provide a vital safety net function for healthcare, public health, and disaster preparedness in the US. In recent years, the ability of EDs to accomplish these missions has been threatened by severe crowding. This functional decline has come at the same time that the demand for the unique services of EDs has increased. Many EDs utilize electronic patient tracking systems that offer little more than an electronic version of the traditional patient location chalkboard. While these tools provide an excellent mechanism for the management of individual patients, they fail to function as a tool for department-level management, and EDs continue to rely upon human resources to intuit system flow from this representation. Even systems that incorporate dashboard indicators have done little more than create tachometers for specific processes within the ED. We have developed an automated, computerized, function that can predict, prevent, diagnose, and manage ED crowding. This program interpolates data from that displayed on the patient-tracking boards in two urban EDs, and makes decisions about the deployment of resources to alleviate bottlenecks and relieve crowding. Still, many challenges exist related to the construct of the human computer interface, the methods for applying knowledge management technologies to the engine, the mechanisms for incorporating artificial intelligence capabilities into this functionality, the means to develop more robust predictive abilities, and the ability to apply and interface this technology to other flow-dependent areas within healthcare systems.
David Wallom, Angus Kirkland; University of Oxford; Mark Ellisman; University of California, San Diego
The Optiputer Microscopy demonstrator builds on the capabilities of the materials group at University of Oxford (Angus Kirkland) and the Biosciences group at the University of California, San Diego (Mark Ellisman) with each group separately needing specific microscopy capabilities in order to further enable them in their research. This project will demonstrate how appropriate infrastructure can enable remote science experiments and provide new science capabilities by building on the existing knowledge and facilities. The instruments at San Diego and Oxford represent state of the art instrumentation in respectively intermediate voltage and aberration corrected geometries. Both instruments have nearly identical local hardware configuration and similar external interfaces. Hence in combination these represent a unique opportunity to link biological and materials science technical expertise across two bespoke instruments. By constructing an infrastructure consisting of microscopes at San Diego and Oxford, together with lambda networks (UKLight and StarLight), appropriate data storage and shared data management schemes as well as integration of local computational resources. This will include a medium sized CCS cluster at oxford as well as high definition visualization using a tile-wall system at each participating site. The data system has been specifically designed to allow easy sharing of stored images as well as real-time process of collected images.
Maria Zemankova, National Science Foundation
Microsoft has been supporting eScience Workshops and relevant research for several years. National Science Foundation’s (NSF) FY 2008 Budget Request to Congress includes $52 million to support the first year of an initiative on Cyber-enabled Discovery and Innovation (CDI) with the objective to “Broaden the Nation’s capability for innovation by developing a new generation of computationally based discovery concepts and tools to deal with complex, data-rich, and interacting systems” [www.nsf.gov/about/budget/fy2008]. The European Commission supports a study on eScience Digital Repositories (eSciDR) to drive the development and use of digital repositories in the EU in all areas of science [http://www.e-scidr.eu/]. The 2007 IEEE International Conference on eScience and Grid Computing is meeting in Bangalore, India [www.escience2007.org/ index.asp]. Google Earth has “put the world’s geographic information at your fingertips” [http://earth.google.com], Sloan Digital Sky Survey is “Mapping the Universe” [www.sdss.org/] and bringing the possibility of making a discovery to researchers and amateur star-gazers or kids around the world, CERN [http://public.web.cern.ch/] is studying the particles the universe is made of, proteomics researchers around the world [www.wwpdb.org] are trying decipher what we are made of, etc. Sensors are busy collecting more global change data, data mining algorithms are churning out new discoveries, and Scientometrics [www.springerlink.com/content/101080] is trying to help us to understand all existing knowledge that we expanding at a staggering rate. Looks like all is well, but is it? We will discuss some challenges of eScience, the steps NSF is taking in addressing them, and would also like to elicit suggestions from the global eScience community.
Lloyd Williams, Thomas Horton, Robert St. Amant; North Carolina State University
For many years, tool-using behavior was considered a benchmark by which “intelligent” organisms could be identified. While this special status of tool use has lessened over the years, examining how animals use tools remains a standard practice in the study of biological cognition. The strong link between “intelligent” behavior and tool-use has led us to examine modeling these types of behaviors in robots. We consider robots not simply as assistants to intelligent humans, but also potentially as models of biological organisms. The development of such robotic models provides insight into the mechanisms that support tool use in humans and other animals, and serves as a test bed for exploring theories of cognition from psychology, neurobiology, and related fields. There are practical implications to any insight we gain into the theoretical underpinnings of tool-use. While robot actors are seen in a wide range of environs, from manufacturing floor to laboratory, for the most part they do little more than simply follow instruction sets and are severely limited in their ability to respond to more dynamic environments. The mechanisms we are employing to model tool use, while still preliminary, are relevant to extending robots’ capabilities beyond highly constrained tasks. Our work aims to create a robot architecture capable of interpreting visual information in the context of tool-using behaviors, recording its experiences, and building semantic networks that represent conceptual relationships between actions and the properties of objects in the environment. We have developed a proof-of-concept architecture on the Sony Aibo, a non-anthropomorphic mobile robot with grasping ability that allows it to solve the familiar “monkey and bananas” problem by using a tool to touch an out-of-reach object.
Karl Aberer, Youngluan Zhou; Ecole Polytechnique Federale de Lausanne
The emergence of novel sensing devices and sensor network technologies provides a whole new opportunity for global environmental studies and environment-related decision making. The Swiss Experiment is a newly initiated multidisciplinary project aimed at building a large scale platform, to support field investigations of environmental processes, which is based on new sensor and data management technology. This talk focuses on some data management issues that occur in our practical study. In environmental monitoring, the raw measurement data are typically transformed into an interpolated grid before performing analyses, such as visualization, simulation etc. The typical interpolation models used by the scientists include deterministic ones, such as triangulation, and statistical ones, such as kriging. The interpolated grid can be considered as a view over the raw data tables, however with much higher data density. Directly storing the resulting views would incur a data explosion problem, while computing views on the fly from scratch would be too unresponsive. To enhance the storage, maintenance and querying efficiency, we identify the static parts of the intermediate computational results of the interpolation models and choose to store and maintain them in a database instead of the dynamically changing final interpolated values. The final values of the grid points can be computed on demand in an efficient way. This technique can also be applied in efficiently computing the interpolation view over real-time streaming data. Building a data warehouse over all the historical data can tremendously help the scientists to perform their analysis. Here an interpolated view should be used as the fact table to feed the cube. Again, the static intermediate results are identified and stored to optimize the storage usage, the maintenance cost as well as the query performance.
David Minor, Robert McDonald; San Diego Supercomputer Center
In recent years there has been growing discussion about the infrastructure needed to support distributed or federated preservation environments. This data infrastructure, known variously as a component of e-science or cyber infrastructure, would provide the base on which to organize, preserve, and make accessible over time the intellectual capital that is being created via research in science and engineering. The San Diego Supercomputer Center, along with the U.C. San Diego Libraries, the National Center for Atmospheric Research, and the University of Maryland Institute for Advanced Computer Science, have formed a collaborative partnership called Chronopolis. The underlying goal of the Chronopolis partnership is the creation of a digital preservation environment to curate this intellectual capital at a national scale and provide science with a long-term preservation infrastructure. As a first step toward developing a working Chronopolis prototype, the partner sites have begun development of replicated collections between themselves and several other institutions. In addition to working on the mechanics of large-scale data replication, each site is developing its own policies for collection management with the goal of creating policies that can interact independently as well as cross-institutionally. This iterative process is a first step-towards a model cross-institutional strategy that will eventually extend to anyone working in a Chronopolis preservation environment. Our poster will highlight current preservation policies and procedures across the Chronopolis Preservation Datagrid. It will clearly display how the institutions are interacting with each other and the relationships with our initial partner sites. It will also show how the infrastructure is being used in scientific collections such as data from the National Virtual Observatory.
Tamas Budavari, Alex Szalay, Gyorgy Fekete, The Johns Hopkins University; Gerard Lemson, Max Planck Institute for Astrophysics; Istvan Csabai, Laszlo Dobos, Eotvos Lorand University; Jose Blakeley, Microsoft
We present a general indexing scheme for multi-dimensional data sets fine-tuned for relational databases. Our approach is to utilize appropriate hierarchical space-filling curves, and organize the data sets accordingly. Spatial queries of complicated shapes defined by exact mathematical equations are first approximated by unions of cells in the particular hierarchical pixelization scheme that in turn translate into efficient range queries in SQL. The expensive evaluation of the actual mathematical constraints is only performed on a tiny subset of the data at the boundary of the query shapes. The technique has proven to work beautifully in practice for various topological manifolds such as the 3D Euclidean space and the surface of the sphere, and is easily generalized for other problems. Our C# implementations running in the CLR of SQL Server 2005 are currently enabling scientists to routinely search terabytes of astronomy data from the state-of-the-art multicolor observations and N-body simulations, namely the Sloan Digital Sky Survey and the Millennium Run.
Chaitan Baru, San Diego Supercomputer Center; Ramon Arrowsmith, Chris Crosby; Arizona State University; Parag Namjoshi, Viswanath Nandigam; San Diego Supercomputer Center
The GEON project (www.geongrid.org) has developed a system to provide online access to large high-resolution LiDAR topography datasets. This system is available as a portlet in the GEON portal (http://portal.geongrid.org/lidar) and is in use for a number of earth science studies. Example applications of these data including mapping of active faults in California to better understand earthquake potential, studies of landscape development in coastal California, and for validation of satellite remote sensing data. Currently, the GEON LiDAR portlet serves 4 different data sets totaling over 7 billion data points and approximately 2TB. This system has also been selected as the primary distribution pathway for LiDAR data to be acquired by the GeoEarthScope component of the NSF-funded EarthScope project (which will entail more than 20 billion additional points and a significantly large user community). The current implementation, using DB2 with spatial indexing on a 32-way IBM P690, is being migrated to parallel DB2 on a Linux cluster, where we are experimenting with data partitioning strategies for spatial data. Over 90 researchers have been actively using the LiDAR portal. We have analyzed user access patterns and plan to apply this information for database tuning; pre-computing derived products; and, developing other strategies to improve overall access times. We will present the current implementation and future plans, and the use of high performance computing to serve LiDAR and other remote sensing data to the research community.
Kirk Borne, Georgia Mason University
The growth of data volumes in science is reaching epidemic proportions. To cope with that flood, data-driven science is becoming a new research paradigm, on a par with theory and experimentation. This concept was introduced by Jim Gray as the new science of X-Informatics. Informatics is the discipline of organizing, accessing, mining, analyzing, and visualizing data for scientific discovery. We will describe astroinformatics, the new paradigm for astronomy research and education, focusing on existing eScience infrastructure (such as the National Virtual Observatory) as well as new eScience education initiatives. The latter includes the new undergraduate program in Data Sciences at George Mason University, through which students are trained in eScience tools to discover and access large distributed data repositories, to conduct meaningful scientific inquiries into the data, to mine and analyze the data, and to make data-driven scientific discoveries. The data flood is also in full force outside of the sciences. The application of data mining, knowledge discovery, and e-discovery tools to these growing data repositories is essential to the success of agencies, economies, and scientific disciplines. Consequently, many scientific disciplines are developing sub-disciplines that are information-rich and data-based, to such an extent that these are recognized stand-alone research disciplines and academic programs on their own merits. The latter include bioinformatics and geoinformatics, but will soon include eScience, astroinformatics, health informatics, and data science. We will compare these and then focus on the new discipline of astroinformatics as key to the future success of astronomy and astrophysics research. We will describe this within the context of the new CODATA initiative ADMIRE (Advanced Data Methods and Information technologies for Research and Education).
Shuo Feng, Djamila Aouada, Hamid Krim; North Carolina State University
In this paper, we propose a new 3D object representation and marching algorithm. A 3D object may be viewed as a surface in 2D, and the Global Geodesic Function (GGF) of any point on the surface is defined as a normalized integrated distance from this point to all other points on the surface. With the help of GGF, a 3D object may be represented by a set of level curves of GGF. This representation is invariant to rigid body movement, and an object will be represented by the same set of curve under isometric transformation. This representation also takes advantage of another nice property of GGF, no requirement of reference point. Although a 3D object is represented by the same set of curves under rigid body movement, the curves may still undergo translation, rotation, scaling, and isometric transformations. Comparing curve under these transformations is a great challenge in the object matching stage. We propose a novel Integral Invariant Signature, which may eliminate the effect of translation, rotation, scaling, and isometric transformations. The variations of a space curve under isometric transformations are mapped into the same signature curve, and the comparison is dramatically simplified. Since integrations may smooth out zero mean noise, the integral invariant signature is insensitive to noise. Other advantages of Integral Invariant Signatures, such as independent of parameterization (curve sampling) and initial point selection also help to simplify the matching procedure and improve the matching performance. We pick a subset of 25 models from 5 objects models with articulating parts from the McGill 3D Shape Benchmark to evaluate the matching performance, and promising results are shown in the paper.
David Leahy, Paul Watson; Newcastle University
Scientists carry out experiments which generate extensive volumes of raw data and then apply analytical techniques to reduce the data to a form that simplifies comparison of experimental results under varying conditions. A further level of analysis is applied to draw conclusions as to the relationship between these variables and the summarized experimental results. They use the patterns uncovered to hypothesize new phenomena and to make decisions. In the drug discovery domain, variables of interest include the impact of different disease states on the behavior of tissues and the effects of treatment with chemical substances (i.e. real and potential drugs). Data analysis develops understanding, such as which biological components are implicated in disease or how the structure of a chemical is related to its impact on biological system. For this understanding to translate into value it should also inform decision making, which in this case could be, “will this chemical be a successful drug?” The talk presents two examples of the process of data management and analysis through to decision making and describes the underlying architecture to support this. It builds on work underway at Newcastle in two areas, CARMEN, an infrastructure “in the clouds” for supporting scientific research and collaboration as well as the Discovery Bus, a novel “Competitive Workflow” system for facilitating decision making.
Ying Ying Li, University of Cambridge; Karl Harison, University of Cambridge; Michael Parker, University of Cambridge; Vassily Lyutsarev, Microsoft Research; Andrei Tsaregorodtsev, CPPM CNRS-IN2P3
Particle physics studies the fundamental building blocks of nature and the interactions between them, with current understanding embodied in the subject’s Standard Model. The Large Hadron Collider (LHC), the world’s highest-energy particle accelerator, starts operation at the European Laboratory for Particle Physics (CERN), Geneva, in 2008, and will be the key testing ground for the Standard Model over the next decade or more. The four main LHC experiments, involving thousands of physicists from around the world, each need to analyze data volumes of the order or petabytes per year, about a factor of 10,000 higher than in the previous generation of CERN collider experiments. Processing of these massive amounts of data relies on the use of globally distributed computing resources, made available in the context of international Grid projects. This presentation illustrates the solutions developed for optimizing use of these resources, taking one experiment, LHCb, as an example. In particular, details are given of the experiment’s workload-management system, DIRAC, and Grid user interface, Ganga. With DIRAC, sites offering resources launch agents that pull processing requests (jobs) from a central server. The system is being used successfully to coordinate the running of many thousands of jobs per day on over 6000 CPUs, distributed across more than 80 sites and 4 continents. Ganga is a job-management framework that provides a uniform interface for accessing multiple processing systems, making it trivial to switch from tests on local batch queues to a full-scale analysis on a Grid-based system. DIRAC and Ganga both have a component architecture that readily allows customization for applications outside of particle physics. Ganga, for example, has been used in activities as diverse as software regression testing, drug.
Jennifer Hou, Zheng Zeng, Sammy Yu, Wook Shin; University of Illinois; Stanley Birge, Washington University in Saint Louis
The aging of baby boomers is creating social and economic challenges. As the population ages there will be an increasing demand on health care resources. Fortunately advances in sensing, localization, event monitoring, wireless communications technologies make possible the non-obtrusive supervision of basic needs of frail elderly and thereby replicate services of on-site health care providers. It is postulated that implementation of a cost-effective, secure, and open personal assistance system (PAS) that provides real-time interaction between elderly people and remote care providers can delay their transfers to skilled nursing facilities and improve the quality of their lives. We have been in the process of designing, developing, and deploying such a wireless-based software infrastructure. PAS exploits inexpensive, “off the shelf” technologies to assist elderly people to maintain the capability of independent living through (i) time-based reminders of daily activities from healthcare providers through the Internet to the home environment, (ii) monitoring of physiological functions and its delivery through the Internet to healthcare providers/clinicians, (iii) non-intrusive localization and tracking of residents with small sensor devices, and (iv) a fall detection and response system to track impact/orientation of residents and to provide audio communications with the health care provider in case of need. To enhance the robustness and ubiquity of PAS we are also exploring use of cell phones as both the wireless modem and the local intelligence for data aggregation and acquisition. We are currently working with Geriatricians at Washington University in Saint Louis in evaluating PAS with respect to the delay achieved in transitioning from independent living to a higher level of skill nursing care by a randomized clinical trial comparing PAS to standard of care.
Mark Bean, GlaxoSmithKline
CANDI – Retrospective on an N-tiered, .Net Remoting, Vendor-Neutral Application Suite for Liquid Chromatographic-Mass Spectrometric Analysis of Small Molecule Purity and Identity Instrument vendor-independence is a worthwhile software goal as learning and effectively using software from multiple vendors is non-trivial, expensive, and impractical for hundreds of chemists in their daily work. Such independence allows us to select instruments based on performance rather than software familiarity. There are two solutions to this goal: vendors could adopt a common file format (see paper by this author on AnIML, an XML standard), or the vendor software could isolated on servers in an N-tiered application architecture. In an N-tiered architecture, software dependencies (Oracle client, vendor software, PDF creators) are installed on application servers and a shell created around them in a single service accessible from anywhere on the network. This is the familiar Internet architecture, but is just as easily implemented for thick Windows clients making remote procedure calls to Windows services. This offers added benefits out of the box such as multi-threading (for multi-processor servers), scalability across server sets, both of which can improve performance of processor-intensive scientific applications. A useful addition for application servers is a mechanism whereby new versions and bug fixes can be hot swapped without restarting the central services, and whereby clients automatically download and run the latest version of software on startup. This paper discusses creation and maintenance of a pure application server using the CANDI software to illustrate some impressive architectural advantages.
David Valentine, University of San Diego; Ilya Zaslavsky, University of San Diego, Supercomputer Center
The CUAHSI Hydrologic Information System project is developing information technology infrastructure to support hydrologic science. The CUAHSI Observations Data Model (ODM) is a data model to store hydrologic observations data in a system designed to optimize data retrieval for integrated analysis of information collected by multiple investigators. The ODM v1 (Tarboton et. al, 2007), provides a distinct view into what information the community has determined is important to store, and what data views the community. As we began to work with ODM v1, we discovered the problem with the approach of tightly linking the community views of data to the database model. ODM v1 was difficult to populate, and the large size of the model hindered the ability to populate the data model and database. Different development groups had different approaches to handling the complexity; from populating the ODM with a bare minimum of constraints to creating a fully constrained data model. This made the integration of different tools, difficult. In the end, we decided to utilize the fully populate model which ensure maximum compatibility with the data sources. Groups also discovered that while the data model central concept was optimized for data retrieval of individual observation. In practice, the concept of data series is better to manage data, yet there is no link between data series and data value in ODM v1. We are beginning to develop ODM v2 as a series of profiles. By utilizing profiles, we intend to make the core information model smaller, more manageable, and simpler to understand and populate. We intend to keep the community semantics, improve the linkages between data series and data values, and enhance data retrieval.
Tarboton, et al. 2007. CUAHSI Community Observations Data Model (ODM), Version 1.0.
Retrieved from: http://water.usu.edu/cuah.si/odm/files/ODM1.pdf.
Jinze Liu, University of Kentucky
A central focus of genetics is the genetic basis of phenotypic traits and their variation. The recent proliferation of highthroughput bio-technologies has enabled the collection of a wealth of data describing the genetic makeup and phenotypic traits of a given biological system. For example, genome-wide SNP (single Nucleotide Polymorphism) data and gene expression data may be collected for multiple strains of mice to describe their genotypic variation and phenotypic variation, respectively. Expression Quantitative Trait Locus (eQTL) mapping seeks to identify genes whose genotypic variations are associated with the expression variations. This approach has the potential to dissect the genetic basis of gene expression, which can be further utilized to infer causal relationships between modulator and modulated genes. Existing eQTL methods suffer from lack of systematic statistical modeling of genome-wide linkages and/or are extremely demanding in computational power. We present an approximate Bayesian-based eQTL method. The Bayesian method can produce precise statements about the posterior densities of linkages between an expression trait and the genetic makeup of a gene. While the method improves on existing approaches, it introduces new computational challenges for large scale eQTL study. We employ Laplace’s method to approximate the integration of likelihood over nuisance parameters, and this has proven to be accurate and especially computationally efficient for eQTL analysis.
Rob Procter, Peter Halfpenny, Alex Voss; National Centre for e-Social Science
Among research priorities identified in a recent review of UK social science are globalization, population change and understanding individual behavior. The nature of these problems calls for collaboration across traditional disciplinary boundaries, and their complexity and scale demands more powerful research tools. At the same time, the social sciences are on the verge of what is likely to be a fundamental and decisive shift in data collection methods as it seeks to unlock the research value of “born digital” data such as administrative and transactional records. The National Centre for e-Social Science (NCeSS) was established by the UK Economic and Social Research Council (ESRC) in 2004 as its key contribution to the UK e-Science programs. The Centre’s objective is to enable social scientists to make best use of emerging eScience technologies in order to address the key challenges in their substantive research fields in new ways. In pursuit of this, NCeSS aims to stimulate the uptake and use across the UK social science research community of distributed computational resources, data infrastructures and collaboration mechanisms by coordinating a program of e-Social Science research, making available information, training, advice and support to the social research community, and leading the development of an e-Infrastructure for the Social Sciences that will provide new resources and tools for social research. NCeSS is also responsible for providing advice to the ESRC on the future strategic direction of e-Social Science. In this presentation, we will review the progress NCeSS has made to date in achieving its objective and outline its roadmap for future research and development of methodologies, tools and infrastructure.
Rob Procter, Alex Voss, Peter Halfpenny, Marzieh Asgari-Targhi; National Centre for e-Social Science
As part of a study to investigate and tackle barriers to adoption of e-Infrastructure, we have been conducting a review of project documentation, reports and academic papers in the field with the aim of establishing a typology of barriers to uptake and candidate responses to tackle them. Underlying this is the expectation that there are ways of dimensioning the problem space so as to reveal recurring patterns in adoption processes; that these barriers will be “typical” in a number of different ways, e.g., typical in a particular domain, for a given technology, for specific stakeholders, etc. Of course, the real value of this study lies in how it may prospectively afford the adoption of e-Infrastructure rather than simply explain its history. The concern here is how to make our findings re-usable by a broad range of e-Infrastructure users, both current and future. What is needed is a format that allows us to capture the different dimensions of our typology, linking what we recognize as “typical” to concrete examples so that users can navigate the space between a clear conceptual framework and a set of pertinent examples of barriers and concrete responses to them. While a simple wiki served the purpose of data collection adequately at the beginning, we are now finding that as we populate this space, a more structured and dynamic approach is required to reflect the complex relationships found. We will report on this initial phase of our data collection, further steps towards our own empirical work and the development of a rich representation of our findings. We will also talk about plans to make our work sustainable by fostering a community process that we hope will eventually carry on an active reflection within the e- Infrastructure user community about the state of adoption and effective ways forward towards realizing the ambitious goals of e-Research.
Djamila Aouada, Hamid Krim; North Carolina State University
During the last decade, 3D data acquisition techniques have been developed very quickly, contributing in an important increase in the available 3 dimensional data. This explosive growth inferred a natural need for efficient and automatic classification methods. We propose to base our classification technique on simply characterizing each 3D object by one parameter “R”, referred to as characteristic resolution. “R” is empirically defined as the minimal number of points that correctly represent the shape of an object. The power of “R” is enormous in reducing the computational cost for nearly the same quality of representation. Indeed, the initial number of points constituting a mesh may often be reduced up to ten times. Moreover, using flat norm-like measures, we show that “R” is directly related to the curvature information of each shape. Hence, our classification technique relies on this unique property of each shape. We present promising results carried on a sample dataset of 120 objects.
Cory Quammen, Russell Taylor; University of North Carolina, Chapel Hill
Increasingly, computers take on crucial roles in processing and analyzing results from experimental science. In many applications, one such role involves removing artifacts in a signal produced by the sensing device that captured the signal. Such applications typically use a model of the sensing device’s affect to remove the artifacts, producing a “restored” signal. Inferences about the object or process under study are then made by analyzing the restored signal. In contrast to a restoration approach, we propose to reverse the procedure by using computer simulation to generate the signal a sensing device would produce observing a hypothesized model of the object under study. Differences between the simulated signal and an experimentally-obtained signal can be used in an optimization loop to derive a set of model parameters that best explain the experimental signal. In this talk, I will describe an implementation of this methodology for understanding biological images from confocal microscopes.
Catherine Blake, University of North Carolina, Chapel Hill; Nassib Nassar, Renaissance Computing Institute
Scientists in healthcare and biomedical informatics have never had as much information available in electronic form as they do today. The increased variety of approaches for information retrieval and extraction offer the potential to combine different techniques and provide scientists with new ways of accessing information; but the real contribution to e-science will occur when the next generation of information tools are consistent with the work flows used by scientists in a specific discipline. One area that holds much promise is recent work that focuses on retrieving relevant passages and entities from an article, rather than an entire document. In this presentation, we will explore the degree to which a concept representation and methods that recognize textual entailment will aid in passage retrieval performance. Our approach combines concepts from the Unified Medical Language System (UMLS) with a syntax representation that has shown success in recognizing textual entailment. We use the Genomics TREC collection of 160,000 documents and 50 topics that biologists considered important. The standard measures of precision (the proportion of accurately retrieved passages divided by the number of passages retrieved), and recall (the percentage of accurately retrieved passages divided by the total number of relevant passages) and mean average precision (MAP) (the percentage of correctly retrieved passages at each level of recall) are used to evaluate results. One of the key motivation behind this work is the historically low precision values measured in passage retrieval, and the need to investigate alternative approaches that increase relative precision. In addition, the scope of the UMLS enables us to explore the impact of different vocabularies on performance and may inform both manual and automated methods of ontology construction.
Mladen Vouk, Scott Klasky, Roselyne Barreto, Terence Critchlow, Ayla Khan, Jeffery Ligon, Pierre Mouallem, Meiyappan Nagappan, Norbert Podhorszki, Leena Kora
A dashboard for petascale multi-scale simulation displays pertinent information about the simulation in an intuitive form for application scientist to easily monitor and retrieve vital information. Our vision for a petascale simulation dashboard displays the most interesting information from the simulation and combines enough provenance information so that one can inspect not only their simulation, but also the machines used. This not only allows the user to monitor the simulations and machines, but also to interact with them to perhaps adaptively show parts of the simulations and display results from queries to inspect the status of the workflow. In this paper we will discuss some of the general principles behind dashboard design for scientific workflows. Our work concentrated mainly on the following: a) monitoring large supercomputing resources and clusters; b) monitoring jobs on these large resources; c) submitting jobs, editing input files, and interacting with remote resources; d) organizing and displaying simulations one runs on these resources, including the capturing of annotations to describe the simulation; e) monitoring the simulation itself in real-time and for later post processing; f) displaying scientific information on dashboards; and finally g) the methods for interacting with running simulations and how this interacts with specialized workflows for controlling simulations. One of the key features of our dashboard allows recording of annotations in a database, capturing the provenance. This mainly affords integrating the dashboard with an electronic scientific notebook for scientists to track all of the elements of a simulation. These include the following: graphs of data (xy, contour, 3D)+time data saved, annotations of these graphs, mapping of this data with other graphs the user’s needs to compare with, and external data the user compares the simulation with including outside experimental data for validation. The main pieces behind our dashboard …
Richard Jones, David Wallom, Carl Christensen, Myles Allen, Milo Thurston, Tolu Aina, Simon Wilson
The aim of the Climateprediction.net PRECIS regional modeling experiment is to provide a public distribution of a physically based modeling system allowing detailed assessments of future climate change for any region by continuously coupling coarse resolution global and a high resolution regional climate models. Global Climate Models (GCMs) describe the important physical processes that make up the climate system but tend to have a high scale up to a few hundred kilometers in resolution. Impact, vulnerability and adaptation studies need to be studied on much finer scales. Regional Climate Models (RCMs), have the potential to improve the representation of the climate information and dynamics which is important for assessing a country’s vulnerability to climate change. PRECIS is designed as a practical and flexible regional climate model (RCM) which allows scientists to run regional simulations on their own PCs. It is intended for use by non-Annex I countries which have minimal computing resources available for climate change studies. A public resource distributed computing version of this system would allow assessment of likely ranges of detailed future climate changes over any region of the globe. This approach would employ the volunteer computing paradigm which also lends itself well to public education and outreach endeavors.
Michael Brady, Niranjan Joshi, Andrew Blake, Vicente Grau, Fergus Gleeson, Anne Trefethen
We report recent progress on a Microsoft-sponsored project that is based on collaboration between Microsoft Cambridge and Oxford University. The project has an application focus: more accurate delineation of key anatomical structures in MRI images of the colorectum in order to assess cancer staging and the feasibility of carrying out a resection. The project also has a more generic image analysis component: the analysis of existing segmentation algorithms and the development of a novel synthesis that combines their best features. Colorectal MRI images provide a tough environment in which to develop algorithms that can reliably and accurately segment structures such as the mesorectum (for surgical assessment) and lymph nodes (for staging). We are analyzing three well known techniques for image segmentation: level sets, Hidden Markov Measure Fields (HMMF), and graph cuts. Image noise, partial volume effects (mixed tissue voxels as a result of low spatial sampling in the image), and other forms of uncertainty lead us to Bayesian methods, for which the estimation of probability density functions (of intensities, local phase structures, or other image representations) (PDF) is a fundamental requirement. Noting that histograms perform poorly when given few samples of a distribution, and that kernel methods work well but are computationally intensive when optimized, we have developed a non-parametric PDF estimation scheme (NP-Windows) and extended it to handle the partial volume effect by an inequality constrained least squares method. The resulting NP-Windows-ICLS algorithm has been incorporated into the region term of a level sets segmentation algorithm. We have used the monogenic signal (local energy, phase, and orientation) to provide a range of features for the level set algorithm. The resulting method gives very accurate results on a range of clinical data. We outline the next steps, toward relating the work to HMMF and graph cuts models.
Charles Loftis, Nanthini Ganapathi; RTI International
Survey data collection projects strive to collect high quality data from survey respondents. The quality of the data collected is greatly dependent upon the effectiveness of field interviewers (FIs) to conduct in person screenings and interviews. Training FIs and subsequently assessing their knowledge of project protocol, methods and interviewing techniques is critical to the overall success of any data collection effort. For large surveys, as the number of FIs increase, the cost of in person training can become prohibitively large. As a cost effective solution to increase the quality of the field data, we developed a suite of web and media based training and assessment tools called iLearning and eHomeStudy for training field staff. Besides saving the project costs associated with in-person training, we are also able to provide refresher trainings throughout the year. This application also enables FIs to view standardized training courses at their convenience and at their own pace. This paper describes the technical details, key features and benefits of this application suite, and also it includes some details on user satisfaction and future directions.
Sarah Carrier, Jed Dube, Jane Greenberg, Hilmar Lapp, Abbey Thompson, Todd Vision and Hollie White; University of North Carolina, Chapel Hill
The DRYAD repository aims to support the preservation, discovery, sharing, use, and reuse of scientific data objects supporting published research in the field of evolutionary biology. Dryad is supported by a collaboration involving NESCent (The National Evolutionary Synthesis Center) and the Metadata Research Center (MRC) at the School of Information and Library Science, University of North Carolina at Chapel Hill. Dryad exemplifies the transformation of scientific publishing and data discovery motivated by the convergence of open access and eScience. Dryad seeks to balance a need for low barriers, which invite contribution from the wide range of scientists participating in the field of evolutionary biology, with a series of sophisticated, higher-level goals supporting data synthesis required to advance the field of evolutionary biology. In order to meet these goals, we have defined Dryad’s functional requirements. We conducted a survey of selected leading digital data and resource repository initiatives and held two stakeholder workshops (December ’06, and May ’07), with scientists (targeted depositors and users), representatives of major evolutionary biology journals and scientific societies, and metadata and digital library experts. Based on this input, we have developed Phase I of Dryad’s metadata architecture. To gather additional input we are developing a survey and a use case study that will provide data on evolutionary biologists’ experiences with and perceptions of open data repositories and the professional sharing of scientific data. This work will further inform Dryad’s future architecture. Here we present Dryad’s functional requirements, the underlying repository architecture, and the research methodologies and protocols for our forthcoming survey and use case study.
Miriam Heller, University of Southern California; Anthony E. Kelly, George Mason University; John Cherniavsky, Arlene de Strulle; National Science Foundation
Hundreds of collaboratories have emerged to transcend distance and time constraints and allow communities of research scientists and engineers to interact, share data, digital libraries and computational resources, and exploit remote instrumentation. Budget levels have ranged from $500,000 to $11,000,000 per collaboratory, possibly motivating the transformation of collaboratories from objects for research into objects of research. For instance, NSF’s Science of Collaboratories project studied over 200 collaboratories to identify sustainable, generalizable technologies for collaboration in science research. Some collaboratories claim to include virtual learning environments. The Science of Collaboratories database included thirty with learning features. Davenport (2005) notes, though, with regards to collaboratories, “…learning is not traditionally discussed or included in research proposals as a research activity.” An analogous Science of Learning Collaboratories program demands consideration, especially if collaboratories are to achieve effective integration of research and learning. Key to understanding, assessing, and optimizing emerging learning collaboratory features are a set of new Models of Educational Inquiry (MEI). The following eight MEIs are proposed and described in this poster to facilitate scholarly research of learning in a distributed, networked, collaborative environment: Curricular Content, Cyber-Learning, Teaching, Assessment and Evaluation, Educational Policy, Educational Research Design, Educational Technologies and Learning Environments, Communities of Learning and Teaching. Finholt, T. Collaboratories. In B. Cronin (Ed.). Annual Review of Information Science and Technology. Washington, DC: American Society for Information Science and Technology, 2001, 73-108. Bos, N., Zimmerman, et al.