Chair: Catharine van Ingen
Life Under Your Feet: A Wireless Soil Ecology Sensor Network
Katalin Szlavecz & Andreas Terzis , The Johns Hopkins University
Wireless sensor networks (WSNs) have the potential to revolutionize soil ecology by providing abundant data gathered at temporal and spatial granularities previously impossible. In this talk we will outline some of the open questions in soil ecology today and elaborate on the potential of WSNs to provide the data to answer these questions.
As the second part of the talk we will present an experimental network for soil monitoring that we developed and deployed for a period of one year in an urban forest in Baltimore. Each node in our network collects soil moisture and temperature measurements every 1 minute and stores them in local memory. All collected measurements are retrieved by a sensor gateway and inserted into a database in their raw and calibrated version. Stored measurements are subsequently made available to third-party applications through a web services interface.
At a high level this first deployment was a scientific success, exposing variations in the local soil micro-climate not previously observed. However, it also points to a number of challenging problems that must be addressed before sensor networks can fulfill their potential of being predictable and robust instruments empowering scientists to observe phenomena that were previously out of reach. We will close the talk by discussing how we plan to address these challenges in the second deployment of our network that we are currently designing.
The WaveScope Data Management System
Samuel Madden, Massachusetts Institute of Technology
WaveScope is a data management and continuous sensor data system that integrates relational database and signal processing operations into a single system. WaveScope is motivated by a large number of signal-oriented streaming sensor applications, such as: preventive maintenance of industrial equipment; detection of fractures and ruptures in various structures; in situ animal behavior studies using acoustic sensing; network traffic analysis; and medical applications such as anomaly detection in EKGs. These target applications use a variety of embedded sensors, each sampling at fine resolution and producing data at high rates ranging from hundreds to hundreds of thousands of samples per second. Though there has been some work on applications in the sensor network community that do this kind of signal processing (for example, shooter localization, industrial equipment monitoring, and urban infrastructure monitoring), these applications are typically custom-built and do not provide reusable high-level programming framework suitable for easily building new signal processing applications with similar functionality. This talk will discuss how WaveScope supports these types of application in a single, unified framework, providing both high run-time performance and easy application development, and will illustrate how several scientific applications are built in the WaveScope framework.
Challenges in Building a Portal for Sensors World-Wide
Feng Zhao & Suman Nath, Microsoft Research
SensorMap is a portal web site for real-time real-world sensor data. It allows data owners to easily make their data available on the map. The platform also transparently provides mechanisms to archive and index data, to process queries, to aggregate and present results on a geo-centric web interface based on Windows Live Local. In this talk, I will describe the architecture of SensorMap, key challenges in building such a portal, and current status and experience. I will also highlight how such a portal can help eScience research.
Transforming Ocean and Earth Sciences with Distributed Submarine Sensor Networks
John R. Delaney, University of Washington
Interactive, internet-linked sensor-robotic networks are the next-generation approach to enabling long-term 24/7/365 surveillance of major remote or dangerous processes that are central to the habitability of our planet. Continuous, real-time information from the environment, specifically from the ocean basins, will launch rapid growth in our understanding of the habitats and behavior of known and novel life forms, climate change, assessment and management of living and non-living marine resources, elements of homeland defense, erupting underwater volcanoes, major earthquake timing and intensity, and mitigation of natural disasters.
The NEPTUNE ocean observatory program will be a leader in this approach. The observatory’s 1400-mile network of heavily instrumented fiber-optic/power cable will convert a major sector of the Juan de Fuca tectonic plate and its overlying ocean off the coasts of Washington, Oregon, and British Columbia into an internationally accessible interactive, real-time natural laboratory reaching millions of users or viewers via the Internet.
Thousands of physical, chemical, and biological sensors distributed across the seafloor, throughout the ocean above, and within the seabed below, may be linked to partially or fully autonomous robotic platforms that are integrated into interactive networks connected via the Internet to land-based users. NEPTUNE is being designed to provide scientists, educators, policy makers, and the public with unprecedented forms of novel information about a broad host of natural and human-induced processes operating within the ocean basins. Data management and visualization challenges include handling large volumes of multidisciplinary data streams; assimilating real-time data into models; and providing data discovery and visualization tools that enable collaborative discovery by groups of researchers.
Chair: Simon Mercer
The Chemical Informatics and Cyber infrastructure Collaboration: Building a Web Service Infrastructure for Chemical Informatics
Marlon Pierce & David Wild, Indiana University
At Indiana University School of Informatics we are developing a web service, workflow and smart client infrastructure to allow the intelligent querying, mining and use of drug discovery information. As the volume and diversity of sources of chemical, biological, and other information related to drug discovery has grown, it has become increasingly difficult for scientists to effectively use this information. In this presentation we will discuss our approach to harnessing and using the information available, including the use of literature, chemical database, biological information and information generated by computational tools such as docking. We will give examples of workflows which bring together tools and information in new ways, and discuss our efforts to develop innovative interaction tools and interfaces that let scientists map their information needs onto these workflows.
What’s Your Lab Doing in My Pocket? Supporting Mobile Field Studies with Xensor for Smartphone
Henri ter Hofte, Telematica Instituut, the Netherlands
Smartphones tend to travel along with people in everyday life wherever they are and with whatever they are doing. This literally puts these devices in an ideal position to capture several aspects of phenomena, such as location of a person and proximity to others. Xensor for Smartphone is an extensible toolkit that exploits the hardware sensors and software capabilities of Windows Mobile 5.0 smartphones to capture objective data about human behavior and their context (such as location, proximity and communication activities), together with objective data about application usage and highly subjective data about user experience (such as needs, frustrations, and other feelings). The aim is to provide social science with a research instrument to gain a much more detailed and dynamic insight into social phenomena and their relations. In turn, these outcomes can inform the design of successful mobile context-aware applications.
In this talk, we present and demonstrate the support Xensor for Smartphone provides in various phases of a scientific study: configuration, deployment, data collection and analysis. We also highlight how we used various Microsoft technologies (including Windows Mobile, .NET Compact Framework, SQL Mobile and SQL Server) in an occasionally-connected smart client architecture to implement the Xensor for Smartphone system.
ProDA’s Smart Client for On-Line Scientific Data Analysis
Cyrus Shahabi, University of Southern California
In the past three years, we have designed and developed a system called ProDA (for Progressive Data Analysis), which deploys wavelet transformation and web-services technology for efficient and transparent analysis of large multidimensional data in Online Scientific Applications (OSA).
Two types of processing are needed by OSA. First, a set of data intensive operations need to be performed on terabytes of data (e.g., sampling and aggregation) to prepare a relevant subset of data for further analysis. Second, further visualization and deeper analysis of the sub-data occur in a more interactive mode of operation. For the first set of tasks, we moved the operations as close to data as possible to avoid unnecessary and costly data transmissions as well as enabling fast queries by pre-aggregating data as wavelets. Hence with ProDA, we used the .NET Framework to develop a set of customized web-services to perform typical scientific data analysis tasks efficiently in the wavelet domain and close to data. The second set of tasks is best performed using the user’s favorite tools already provided by the client platform (such as a spreadsheet application). Therefore, ProDA’s web-enabled smart client implemented in C# allows both transparent access to the 2nd tier web-services and smooth invocation of client-side tools. This architecture would also allow OSA mobile users to cache data and perform ad-hoc data analysis tasks on the cached data while disconnected from their huge data repositories.
We deployed ProDA in two different application domains: Earth Data Analysis (sponsored by JPL) and Oil Well Sensor Data Analysis (sponsored by Chevron). In this talk we will emphasize and demonstrate ProDA’s utility in the Chevron application.
Function Express Gold: A caBIG™ Grid-aware Microarray Analysis Application
Rakesh Nagarajan, Washington University
It is becoming increasingly apparent that a majority of human diseases including tumorigenesis are the product of multi-step processes which each involve the complex interplay of a multitude of genes acting at different levels of the genetic program. To study such complex diseases, many analyses on the genomic scale are possible in the post-human genome sequencing era. Foremost among these is the microarray experiment where an investigator has the ability to monitor the expression of all genes in a particular tissue. However, most end-user physician-scientists find the task of analyzing data generated from microarray experiments daunting since considerable computing power and expertise are required. To directly address this growing need, the National Cancer Institute has recently started the cancer Biomedical Informatics Grid (caBIG™at https://cabig.nci.nih.gov/) initiative to create a “network or grid connecting individuals and institutions to enable the sharing of data and tools, creating a World Wide Web of cancer research.” Using caBIG™ data and analytical services, we propose to develop Function Express Gold (FE Gold), a caGrid-aware Microsoft Smart Client microarray analysis application. In our approach, all grid sources will be accessed using web services adapters. Namely, FE Gold will acquire microarray and gene annotation data using caGrid data services, and this data will then be filtered, normalized, and mined using caGrid analytical services. Using the acquired microarray data, analysis results, and gene annotation information, FE Gold will be able to function using local computing power for graphical display and analysis even when the client is not connected to the internet. When network connectivity is available, FE Gold will check for annotation data updates at the server end in a seamless fashion. Finally, new releases as well as bug fixes will be distributed to all clients using the Background Intelligent Transfer Service.
Chair: Jim French
Sorting in Space
Hanan Samet, University of Maryland
The representation of spatial data is an important issue in computer graphics, computer vision, geographic information systems, and robotics. A wide number of representations are currently in use. Recently there has been renewed interest in hierarchical data structures such as quadtrees, octrees, and R-trees. The key advantage of these representations is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search. In this talk we give a brief overview of hierarchical spatial data structures and related research results. In addition we demonstrate the SAND Browser (http://www.cs.umd.edu/~brabec/sandjava) and the VASCO JAVA applet (http://www.cs.umd.edu/~hjs/quadtree) which illustrates these methods.
Geospatial Infrastructure Goes to the Database
Tamás Budavári & Alex Szalay, The Johns Hopkins University
Jim Gray & Jose Blakeley, Microsoft
We present a novel approach to dealing with geospatial information inside the database. The design is based on our own lightweight spatial framework that was developed for representing complex shapes on the surface of the unit sphere independent of coordinate systems or projections. This C# library is not only capable of formally describing regions of interests very accurately, but features the full set of logical operations on the regions (such as union or intersection), as well as precise area calculation.
The internal mathematical representation is tuned toward flexibility and fast point-in-region searches regardless of the area coverage. Leveraging on the CLR capabilities of Microsoft’s SQL Server 2005, we surface most of the functionalities of the spherical class library to SQL. The SQL routines use a custom serializer for storing the shapes in binary blobs inside the database. For super fast searches, we materialize and index in SQL bounding circles and an adaptive approximation with indexing using the Hierarchical Triangular Mesh.
Efficient Search Index for Spherically Distributed Spatial Data in a Relational Model
Gyorgy Fekete, The Johns Hopkins University
We discuss a project to develop a system for rapid data storage and retrieval using the Hierarchical Triangular Mesh (HTM) to perform fast indexing over a spherical spatial domain in order to accelerate storing and finding data over the Earth and sky. Spatial searches over the sky are the most frequent queries on astrophysics data, and as such are central to the National Virtual Observatory (NVO) effort and beyond.
The library has applications in astronomy and earth science. The goal is to speed up a query that involves an object (such as observation or location) and a region of interest of an arbitrary shape (such as a political boundary or satellite track). In a very large database, one wants to minimize the number of calculations needed to decide if an object meets a spatial search criterion. We use a HTM index-based method to build a coarse representation of a covermap on the fly for the query region, which is then used to eliminate most of the objects that are clearly outside the region. False positives that pass the coarse test are removed with more precise, albeit more time consuming calculations.
A challenging problem—cross matching—is to find data on the same object in separate archives. Simple boxing of rectilinear constraints are inadequate because they are singular at the poles, unstable near them, and the actual shape of areas of interest do not always fit neatly within a box. Furthermore, because of constraints imposed by instruments, engineering, and so on, scientists may need to define their own irregularly shaped query regions.
With the recent advances in the worldwide Virtual Observatory effort, we now have a standard Extensible Markup Language (XML) data model for space-time data. This data model also provides a new standard way to express spherical polygons as search criteria. Two outcomes of this project are (1) a layer that enables our search engine to run inside relational standard language query (SQL) databases that are either Open Source or commercial (such as SQLServer); and (2) to participate as a first-class access method in relational database queries. The toolkit is implemented in a highly portable framework in C# programming language, which allows seamless integration with relational database engines and Web services, and in particular, makes it possible to develop a full Web-service implementation of the library that can be accessed through remote calls.
Using Databases to Store the Space-Time Histories of Turbulent Flows
Randal Burns, ShiYi Chen, Laurent Chevillard , Charles Meneveau, Eric Perlman, Alex Szalay, Ethan Vishniac & Zuoli Xiao, Johns Hopkins University
We describe a new environment for large-scale turbulence simulations that uses a cluster of database nodes to store the complete space-time history of fluid velocities. This allows for rapid access to high resolution data that were traditionally too large to store and too computationally expensive to produce on demand. The systems perform the actual experimental analysis inside database nodes, which allows for data-intensive computations to be performed across a large number of nodes with relatively little network traffic. Currently, we have a limited-scale prototype system running actual turbulence simulations and are in the process of establishing a production cluster with high-resolution data. We will discuss our design choices, computing environment, and initial results with load balancing a data-intensive, migratory workload.
Chair: Ed Lazowska
Tools for Distributed Observatory Management
Mike Godin, Monterey Bay Aquarium Research Institute
A collection of browser-based tools for collaboratively managing an ocean observatory have been developed and used in the multi-institutional, interdisciplinary Adaptive Sampling and Prediction (ASAP) field experiment, which occurred in a 100×100 km region around Monterey Bay in the summer of 2006. The ASAP goal was to optimize data collection and analysis by adapting a 20×40 km array of up to twelve underwater robots sampling to 500 meter depths. In near real-time, researchers assimilated robotic observations into three independent, simultaneously running four-dimensional ocean models, predicting ocean conditions for the robots over the next few days.
ASAP required the continuous participation of numerous researchers located throughout North America. Powerful exercises that guided development of the required collaboration tools were “virtual experiments,” wherein simulated robots sampled a simulated ocean, generating realistic data files that experimenters could visualize and modelers could assimilate. Over the course of these exercises, the Collaborative Ocean Observatory Portal (COOP) evolved, with tools for centralizing, cataloging, and converting observations and predictions into common formats, generating automated comparison plots, and querying the data set created and organized scientific content for the portal. Centralizing data in common formats allowed researchers to manipulate data without relying on data generators’ expertise to read the data, and to query data with the Metadata Oriented Query Assistant (MOQuA). Collaborators could produce specialized products and link to these through the collaborative portal, making the experimental process more interdisciplinary and interactive.
The need for collaboration and data handling tools is important for future observatories, which will require 24-hour per day, 7-day a week interactions over many years. As demonstrated in the successful field experiment, these tools allowed scientists to manage an observatory coherently, collaboratively, and remotely. Lessons learned from operating these tools before, during, and after the field experiment provide an important foundation for future collaborative ventures.
Scalable Techniques for Scientific Visualization
Claudio T. Silva, University of Utah
Computers are now extensively used throughout science, engineering, and medicine. Advances in computational geometric modeling, imaging, and simulation allow researchers to build models of increasingly complex phenomena and thus to generate unprecedented amounts of data. These advances require a substantial improvement in our ability to visualize large amounts of data and information arising from multiple sources. Effectively understanding and making use of vast amounts of information being produced is one of the greatest scientific challenges of the 21st Century. Our research at the Scientific Computing and Imaging (SCI) Institute at the University of Utah has focused on innovative, scalable techniques for large-scale 3D visualization. In this talk, I will review the state-of-the-art in visualization techniques in high performance visualization technology, including out-of-core, streaming, and GPU-based techniques that are used to drive a range of displays devices, including large-scale display walls. I will conclude with an outline for how large-scale visualization fits into an eScience research agenda.
Keith Grochow, University of Washington
We are designing an oceanographic workbench that contains a suite of features for scientists at UW and MBARI involved with ocean observatories (the NEPTUNE and MARS projects respectively). At the core is a fast, multi-resolution terrain engine that can incorporate a broad range of bathymetric data sets. Over this we can overlay multiple images, textures, color gradients, and measurement grids to help scientists visualize the observatory site environment. For site management, there is an intuitive drag and drop interface to add, position, and determine interactions of instruments as well as cabling requirements over time. Site metrics such as cost, power needs, and bandwidth are automatically updated on the screen for the user during these editing sessions. As well as site management, this system provides a 3D data visualization environment based on the pivot table model. We allow the user to interactively move between different views of the selected data sets to analyze and visualize information about the site. This system runs on both Windows and Macintosh environments, leveraging the advanced graphics capabilities of current hardware and provides extensions to work with external analysis engines such as Matlab. Initial feedback on this tool has been very positive and we expect to move to broader user trials this fall. We would be happy to give a presentation and demo of the system at the conference.
High-Performance Computing and Visual Interaction with Large Protein Datasets
Amitabh Varshney, University of Maryland
Proteins comprise a vast family of biological macromolecules whose structure and function make them vital to all cellular processes. Understanding the relationship between protein structure and function and the ability to predict a protein’s role given its sequence or structure is the central problem in proteomics and the greatest challenge for structural biologists in the postgenomic era. The computation and visualization of various protein properties is vital to this effort. We are addressing this challenge using a two-pronged strategy: (a) The emergence of multi-core CPUs and GPUs heralds the beginning of a new era in high-performance parallel computing. Multi-core CPUs and multi-core GPUs provide us with a set of complementary computational models—a traditional von-Neumann model and a newer streaming computational model. We have characterized the kinds of applications that are well suited to each and have systematically explored the mapping of computation in one specific domain—proteomics, to each. Our work has focused on mapping of various protein properties, such as solvent-accessible surfaces and electrostatics, to the heterogeneous MIMD/SPMD computation pathways on a CPU-GPU commodity cluster environment; (b) We are working on tightly coupling the computation and visualization of large-scale proteins to allow user-assisted computational steering on large-area, high-resolution tiled displays. Because visual comprehension is greatly aided by interactive visualization abstraction, and lighting, we are also exploring techniques to enhance comprehensibility of large-scale datasets, including protein datasets. We are currently targeting protein ion channels in our research. Ion channels are a special class of proteins that are embedded in the lipid bilayer of cell membranes and are responsible for a wide variety of functions in humans. Improper functioning of ion-channels is believed to be the cause behind several ailments including Alzheimer’s disease, stroke, and cystic fibrosis.
Chair: Mark Wilkinson
Semantic Empowerment of Life Science Applications
Amit Sheth, University of Georgia
Life Science research today deals with highly heterogeneous as well as massive amounts of data. We can realize more exciting potential of exploiting this data if we have more automated ways for integration and analysis leading to insight and discovery—to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.
This talk will demonstrate some of the efforts in:
- building large life science ontologies (GlycO, an ontology for structure and function for Glycopeptides and ProPreO—an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications
- entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (such as mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data
- semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptiveResults presented here are from NSF-funded Semantic Discovery project, and NIH-funded NCRR on Integrated Technology Resource for Biomedical Glycomics, in collaboration with CCRC, UGA. Primary contributors include William S. York, Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas and Cory Henson.
Knowledge For the Masses, From the Masses
Mark Wilkinson, University of British Columbia, Canada
Knowledge Acquisition (KA) has historically been an expensive undertaking, particularly when applied to specific expert domains.Traditional KA methodology generally consists of a trained knowledge engineer working directly with one or more domain experts to encode their specific individual understanding of the domain into a formal logical framework. Since domain experts are expensive, generally have little time for such an exercise and represent only one viewpoint, we suggest that a more representative knowledge model (ontology) can be constructed more cheaply using a mass-collaborative methodology.
A prototype methodology, the iCAPTURer, was deployed at a cardiovascular and pulmonary disease conference in 2005. The iCAPTURer, and its second generation follow-up, revealed that template-based “chatterbot”-like interfaces could rapidly accumulate and validate knowledge from a large volunteer expert community. A question remained, however, as to the utility and/or quality of the resulting ontology.
Examination of existing standards for ontology evaluation revealed a lack of objective, philosophically grounded, and automatable approaches. As such, it was necessary to design a metric appropriate for evaluating the mass-collaborative ontologies we were creating. In this presentation we will discuss the iCAPTURer mass-collaboration methodology and possible extensions to it. We will then discuss the various categories of ontology evaluation metrics, including a novel epistemologically-grounded method developed in our laboratory, and examine the strengths and weaknesses of each. Finally we will show the results of our evaluation methodology as applied to the Gene Ontology one of the most widely-used ontologies in bioinformatics.
A Data Management Framework for Bioinformatics Applications
Dan Sullivan, Virginia Bioinformatics Institute
The Cyber-infrastructure (CI) group at the Virginia Bioinformatics Institute has established functional CI systems in the areas of bioinformatics and computational biology, with a focus on infectious diseases. Specifically, the CI projects include the Pathogen Portal project, the PathoSystems Resource Integration Center and the Proteomics Data Center. The bioinformatics resources developed by the CI group include tools for the curation of the genomes and PathoSystems, database systems for organizing the high-throughput data generated from the study of PathoSystems biology and software systems for analysis and visualization of the data. Integration across multiple domains is essential to enhance the functionality of CI systems. To this end, the group has formulated an integration framework based on four dimensions: data flows, schema structures, database models, and levels of system biology. This presentation focuses on the data flow dimension and describes mechanisms for coordinating the use of multiple sources of data; including database federation, Web services, and client level integration. It will also include a discussion of data provenance. Examples and use cases are drawn from projects underway at the Virginia Bioinformatics Institute.
Chair: Stuart Ozer
Building a Secure, Comprehensive Clinical Data Warehouse at the Veterans Health Administration
Jack Bates, Veterans Health Administration; Stuart Ozer, Microsoft Research
The U.S. Veterans Health Administration maintains one of the most advanced electronic health records systems in the world (VISTA), spanning a patient base of over 5 million active patients across a network of over 1,200 clinics and hospitals. This year an enterprise-wide Data Warehouse was launched at the VHA, with a charter to extract historical and daily data from VISTA and other sources and assemble it into a comprehensive database covering all aspects of patient care. Already populated with more than a billion historical vital signs, pharmacy and outpatient encounter clinical information is now being loaded. New subject areas are being integrated continually and it will eventually contain terabytes of data spanning diverse areas such as inpatient and outpatient care, administrative and financial data.
The Data Warehouse is designed to support clinical research, generate national and regional metrics and improve the quality of care throughout the VHA. In our talk we discuss the database design—our use of multiple star schemas, partitioned fact tables and conformed dimensions—as well as the common principles we use to extract data from the VISTA system. We will describe the state-of-the-art hardware environment hosting both the large database and the extraction tools. Inevitably, data quality issues are discovered when actual historical data are extracted into the Warehouse for the first time and we will present examples of how these problems have been resolved. We also review the research opportunities and some of the early results enabled by this environment and explain how the database design process has been able to accommodate the needs of both researchers and management. There are challenges inherent in bringing sensitive information together into a system that is accessible for research queries, but which also must protect patient confidentiality and adhere to HPPA requirements; we describe how the VHA Data Warehouse has pursued this balance.
Analysis of Protein Folding Dynamics
David Beck & Catherine Kehl, University of Washington
The Protein Data Bank (PDB) is an important repository of experimentally derived, static protein structures that have stimulated many important scientific discoveries. While the utility of static physical representations of proteins is not in doubt, as these molecules are fluid in vivo, there is a larger universe of knowledge to be tapped regarding the dynamics of proteins. Thus, we are constructing a complementary database comprised of molecular dynamics (MD) simulation structures for representatives of all known protein topologies or folds. We are calling this effort Dynameomics. For each fold a representative protein is simulated in its native (i.e., biologically relevant) state and along its complete unfolding pathway. There are approximately 1130 known non-redundant folds, of which we have simulated the first 250 that represent about 75% of all known proteins. We are data-mining the resulting 15 terabytes of data (not including solvent) for patterns and general features of protein dynamics and folding across all folds in addition to identifying important phenomena related to individual proteins. The data are stored in Microsoft SQL Server’s OLAP (On-Line Analytical Processing) implementation, Analysis Services. OLAP’s design is appropriate for modeling MD simulations’ inherently highly multi-dimensional data in ways that traditional relational tables are not. In particular, OLAP databases are optimized for analyses rather than transactions. The multi-dimensional expressions (MDX) query language seems to be well suited for writing complex analytical queries. This application of Microsoft’s OLAP technology is a novel use of traditional financial data management tools in the science sector.
Advanced Software Framework for Comparative Analysis of RNA Sequences, Structures and Phylogeny
Kishore Doshi, The University of Texas at Austin; Stuart Ozer, Microsoft Research
A basic principle in Molecular Biology is that the three-dimensional structure of macromolecules such as Proteins and RNA’s dictates their function. Thus, the ability to predict the structure of an RNA or protein from its sequence represents one of the grand challenges in Molecular Biology today. Comparing RNA sequences from diverse organisms spanning the tree of life has resulted in the extremely accurate determination of some RNA structures. For example, the Ribosomal RNA (rRNA) structures were predicted using fewer than 10,000 sequences. This analysis, while very successful, can be significantly enriched by expanding the analysis to include the 500,000+ Ribosomal RNA sequences which have been identified in Genbank as of August 2006, as well as new sequences which are continually appearing. A significant impediment to analyzing large RNA sequence datasets such as the rRNA is the lack of software tools capable of efficiently manipulating large datasets.
We are developing a comprehensive information technology infrastructure for the comparative analysis of RNA sequences and structures. One of the biggest challenges in developing software for comparative analysis is how to handle the memory-intensive nature of alignment construction and analysis. In-memory footprints for large RNA sequence alignments can eclipse 50GB in some cases. Our solution is based on a simple concept: co-locate the computational analysis with the data. Using Microsoft SQL Server 2005, T-SQL and C#-based stored procedures, we have successfully prototyped the integration of RNA sequence alignment storage with the most common RNA comparative analysis algorithms in a relational database system. We intend to scale-up this prototype into fully-featured public repository and eventually deliver web services for the comparative analysis of RNA sequences. In this talk, we will present a short background on RNA comparative analysis, and then focus on our framework architecture, ending with a brief demonstration of our functional prototype.
Indexing and Visualizing Large Multidimensional Databases
Istvan Csabai, Eötvös Loránd University, Hungary
Scientific endeavors such as large astronomical surveys generate databases on the terabyte scale. These, usually multidimensional databases must be visualized and mined in order to find interesting objects or to extract meaningful and qualitatively new relationships. Many statistical algorithms required for these tasks run reasonably fast when operating on small sets of in-memory data, but take noticeable performance hits when operating on large databases that do not fit into memory. We utilize new software technologies to develop and evaluate fast multi-dimensional indexing schemes that inherently follow the underlying, highly non-uniform distribution of the data: one of them is hierarchical binary space partitioning; the other is sampled flat Voronoi partitioning of the data.
Our working database is the 5-dimensional magnitude space of the Sloan Digital Sky Survey with more than 250 million data points. We use this to show that these techniques can dramatically speed up data mining operations such as finding similar objects by example, classifying objects or comparing extensive simulation sets with observations. We are also developing tools to interact with the multi-dimensional database and visualize the data at multiple resolutions in an adaptive manner.
Database Support For Unstructured Tetrahedral Meshes
Stratos Papadomanolakis, Carnegie Mellon University
Computer simulation is crucial for numerous scientific disciplines, such as fluid dynamics and earthquake modeling. Modern simulations consume large amounts of complex multidimensional data and produce an even larger output that typically describes the time evolution of a complex phenomenon. This output is then “queried” by visualization or other analysis tools. We need new data management techniques in order to scale such tools to the terabyte data volumes available through modern simulations.
We present our work on database support for unstructured tetrahedral meshes, a data organization typical for simulations. We develop efficient query execution algorithms for three important query types for simulation applications: point, range and feature queries. Point and range queries return one or more tetrahedra that contain a query point or intersect a query range respectively, while feature queries return arbitrarily shaped sets of tetrahedra (such as a mesh surface). We propose Directed Local Search (DLS), a query processing strategy based on mesh topology: we maintain connectivity information for each tetrahedron and use it to “walk” through connected mesh regions, progressively computing the query answer. DLS outperforms existing multidimensional indexing techniques that are based on geometric approximations (like minimum bounding rectangles), because the later cannot effectively capture the geometric complexity in meshes. Furthermore, DLS can be easily and efficiently implemented within modern DBMS without requiring new exotic index structures and complex pre-processing.
Building a Data Management Platform for the Scientific and Engineering Communities
José A. Blakeley, Brian Beckman, Microsoft; Tamás Budavári, The Johns Hopkins University; Gerd Heber, Cornell University
The convergence of database systems, file systems and programming language technologies is blurring the lines between records and files, directories and tables, and programs and query languages that deal with in-memory arrays as well as with persisted tables. Relational database systems have been extended to support XML, large binary objects directly as files, and are incorporating runtime systems (such as Java VM and .NET CLR) to enable scientific models, programs and libraries (such as LAPACK) to run close to the data. Scientific file formats such as HDF5 and NetCDF define their content using higher level semantic models (such as UML and Entity Relationship). Programming languages are incorporating native, declarative, set-oriented query capabilities (such as LINQ/XLINQ), which will enable support for cost-based query optimization techniques. Programming languages are also integrating transactions with exception handling to enable more reliable programming patterns. Practitioners have learned that neither file aggregates (HDF, NetCDF) nor RDBMS alone present a one-size-fits-all solution to the most common data management problems facing the scientific and engineering communities. However, the convergence of the technologies mentioned offers a unique opportunity to build a data management and data integration platform that will embrace their strengths, creating new paradigms that will revolutionize scientific programming and data modeling in the next decade. Based on our combined experience in building an industry-leading relational DBMS and use cases drawn from typical scientific and engineering applications in astronomy and computational materials science, we propose the architecture of a unified data management platform for the computational science and engineering communities.
Chair: Winston Tabb
Next-Generations Implications of Open Access
Paul Ginsparg, Cornell University
True open access to scientific publications not only gives readers the possibility to read articles without paying subscription, but also makes the material available for automated ingestion and harvesting by 3rd parties. Once articles and associated data become universally treatable as such computable objects, openly available to 3rd party aggregators and value-added services, what new services can we expect, and how will they change the way that researchers interact with their scholarly communications infrastructure? I will discuss straightforward applications of existing ideas and services, including clustering, citation analysis, collaborative filtering, external database linkages, and other forms of automated markup, and then will speculate on as yet unrealized modes of harvesting and creating new knowledge.
Long Term Data Storage
Paul Ginsparg, Cornell University
In August 2006 NASA announced it had lost the original moon landing video transmissions, dramatizing the risks for long term storage of data. Perhaps less noticed than the conventional problem of misplacing some boxes was that “The only known equipment on which the original analogue tapes can be decoded is at a Goddard centre set to close in October, raising fears that even if they are found before they deteriorate, copying them may be impossible” (Sydney Morning Herald). Today the risk that a format will become obsolete, or that nobody will remember what the data format is, exceeds the risk that a box of stuff will be lost. The quantity of data spewing from electronic sensors, and the storage of this data in formats not intelligible by humans, make long-term preservation something to consider from the beginning. The tendency for data to be stored in projects with short-term funding rather than institutions which accept long term responsibility is not helping. Against that, we have the great advantage that any digital copy is equivalent for future use. What should be done? Technical suggestions might include: 1) University libraries and archives taking a greater role in data storage; 2) Encouraging public data standards for complex data; 3) Expanding efforts like LOCKSS and encouraging their diversification into data storage as well as journal and book preservation; 4) Agreeing on a formal description of a query language so that websites representing the “dark web” can provide a machine-interpretable and standardized explanation of what kind of queries they accept. Perhaps more important, however, are some non-technical issues, such as agreeing on a formal description for digital rights management controls and creating a conference and/or journal devoted to scientific triumphs found by analyzing old data to raise the scholarly interest and prestige of data preservation.
Digital Data Preservation and Curation: A Collaboration Among Libraries, Publishers and the Virtual Observatory
Robert Hanisch, Space Telescope Science Institute
Astronomers are producing and analyzing data at ever more prodigious rates.
NASA’s Great Observatories, ground-based national observatories, and major survey projects have archive and data distribution systems in place to manage their standard data products, and these are now interlinked through the protocols and metadata standards agreed upon in the Virtual Observatory (VO).
However, the digital data associated with peer-reviewed publications is only rarely archived. Most often, astronomers publish graphical representations of their data but not the data themselves. Other astronomers cannot readily inspect the data to either confirm the interpretation presented in a paper or extend the analysis. Highly processed data sets reside on departmental servers and the personal computers of astronomers, and may or may not be available a few years hence.
We are investigating ways to preserve and curate the digital data associated with peer-reviewed journals in astronomy. The technology and standards of the VO provide one component of the necessary technology. A variety of underlying systems can be used to physically host a data repository, and indeed this repository need not be centralized. The repository, however, must be managed and data must be documented through high quality, curated metadata. Multiple access portals must be available: the original journal, the host data center, the Virtual Observatory, or any number of topically-oriented data services utilizing VO-standard access mechanisms.
Chair: Shirely Cohen
Automation of Large-scale Network-Based Scientific Workflows
Mladen A. Vouk, North Carolina State University
Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increasing complexity of processes and data volumes associated with modern distributed scientific problem solving, such as ultra-scale simulations and high-throughput experiments. The key to the solution is an integrated network-based framework that is functional, dependable, fault-tolerant, and supports data and process provenance.
Such a framework needs to make application workflows dramatically easier to develop and use so that scientist’s efforts can shift away from data management and application development to scientific research and discovery An integrated view of these activities is provided by the notion of Scientific Workflows—a series of structured activities and computations that arise in scientific problem-solving. This presentation discusses long-term practical experiences of the U.S. Department of Energy Scientific Data Management Center with automation of large scientific workflows using modern workflow support frameworks. Several case studies in the domains of astrophysics, fusion and bioinformatics, that illustrate reusability, substitutability, extensibility, customizability and composability principles of scientific process automation, are discussed. Solution fault-tolerance, ease of use, data and process provenance, and framework interoperability are given special attention. Advantages and disadvantages of several existing frameworks are compared.
Using Flowcharts to Script Scientific Workflows
Furrukh Khan, The Ohio State University
We note that the flowchart is a fundamental artifact in scientific simulation code. Unfortunately even though the flowchart is initially used by scientists to model the simulation, it is not preserved as an integral part of the code. We argue that by mapping flowcharts to workflows and leveraging Microsoft Workflow Foundation (WF) the flowchart can be separated out of the implementation code as a “first class” citizen. This separation can have profound impact on the future maintainability and transparency of the code. Furthermore, WF provides the components required by scientists to build systems for dynamically visualizing, monitoring, tracing, and altering the simulations. We also note that projects for developing, running, and maintaining complex scientific simulations are often based on distributed teams. These collaborations not only involve human-to-human workflows but also scenarios where the low lying simulation flowcharts (separated out as first class citizens) take part in higher level human workflows. We argue that the current version of Microsoft SharePoint Server with integral support for WF serves as an ideal portal for these collaborations. It provides scientists services like security, role-based authentication, team membership, discussion lists and implementation of member-to-member workflows. Furthermore, by using the Microsoft technology, Windows Communication Foundation (WCF), systems can be built that securely connect the low lying simulation workflows (running as WCF Web Services) to high level human workflows so that simulations can be visualized within the context of SharePoint. We also show that Atlas, another Microsoft technology, can be used in synergy with WF to provide highly responsive platform agnostic (Windows, Linux, Mac) browser-based smart clients in the context of SharePoint. We give examples and preliminary results from the computational electromagnetics domain based on our recently started project in collaboration with the ElectroScience Laboratory at the Ohio Sate University.
Scientific Workflows: More e-Science Mileage from Cyberinfrastructure
Bertram Ludaescher, University of California, Davis
We view scientific workflows as the domain scientist’s way to harness cyberinfrastructure for e-Science. Through various collaborative projects over the last couple of years, we have gained first-hand experience in the challenges faced when trying to realize the vision of scientific workflows. Domain scientists are often interested in “end-to-end” frameworks which include data acquisition, transformation, analysis, visualization, and other steps. While there is no lack of technologies and standards to choose from, a simple, unified framework combining data and process-oriented modeling and design for scientific workflows has yet to emerge.
Using experiences from continuing projects as well as from a recently awarded collaboration with a leading group in ChIP-chip analysis workflows (Chromatin ImmunoPrecipitation followed by genomic DNA microarray analysis), we highlight the requirements and design challenges typical of many large-scale bioinformatics workflows: Raw and derived data products come in many forms and from different sources, including custom scripts and specialized packages, e.g. for statistical analysis or data mining. Not surprisingly, the process integration problems are not solved by “making everything a web service”, nor are the data integration problems solved by “making everything XML”. The real workflow challenges are more intricate and will not go away by the adoption of any easy, one-size-fits-all silver-bullet solution or standard. The problems are further compounded by the scientists’ need to compare results from multiple workflows runs, employing various alternative (and often brand-new) analysis methods, algorithms, and parameter settings.
We describe ongoing work to combine various concepts and techniques (models of computation and provenance, actor- and flow-oriented programming, higher-order components, adapters, and hybrid types) into a coherent overall framework for collection-oriented scientific workflow modeling and design. The initial focus of our work is not on optimizing machine performance (e.g., CPU cycles or memory resources), but on optimizing a more precious resource in scientific data management and analysis: human (i.e., scientists’) time.
Chair: Yan Xu
Building Lab Information Management Systems
Qi Sun, Cornell University
At the bioinformatics core-facility for Cornell University, we are managing data for multiple genomics and proteomics laboratories. In the last few years, we have established a working model of using Microsoft SQL Server as the database system, ASP.NET as the user interface and Windows Compute Cluster as the data analysis platform. Here we will present two lab information management systems (LIMS), Pathogen Tracker (http://www.pathogentracker.net) and PPDB (http://ppdb.tc.cornell.edu), to represent two of the fastest growing biological research fields: genetic diversity and proteomics. The Pathogen Tracker software is collaboration with the Cornell Food Safety Laboratory. It includes a database and an ASP.NET web application written with Visual Basic 2005. It is being used as a tool for information exchange on bacterial subtypes and strains and for studies on bacterial biodiversity and strain diversity.
The system has a user management system, and enables the research community to contribute their data to this database through the web, allows open data exchange, and facilitates large scale analyses and studies on bacterial biodiversity. PPDB is a LIMS for managing mass spectrometry proteomics data; it is developed with Dr. Klaas van Wijk’s proteomics laboratory. The web interface we designed makes it easier for users to integrate and compare data from multiple sources. We also take advantage of the graphic library that comes with Visual Studio 2005 for generating on-the-fly images in the 2-D gel data navigation tool.
Hong Guo, McGill University
One of the most important branches of nano-science and nanotechnology research is the nano-scale electronics, or nanoelectronics. Nanoelectronic devices operate by principles of quantum mechanics; their properties are closely related to atomic and molecular structure of the device. It has been a great challenge to predict nano-scale device characteristics, especially if one wishes to predict them without using any phenomenological parameter. To advance nanoelectronic device technology, an urgent goal is to develop computational tools which can make quantitative, accurate, and efficient calculations of nanoelectronic systems from quantum mechanic first principles.
In this presentation, I will briefly review the present status of nanoelectronic device theory, the existing theoretical, numerical and computational difficulties, and some important problems of nanoelectronics. I will then report a particularly useful progress we have achieved toward quantitative predictions of non-equilibrium and non-linear charge/spin quantum transport in nanoelectronic devices from atomic point of view. Quantitative comparisons to measured experimental data will be presented. Several examples will be given including electric conduction in nano-wires and magnetic switching devices. Finally, I will briefly outline on the existing challenges of computational nanoelectronics, and on developing computational tools powerful enough for nanoelectronics design automation.
MotifSpace: Mining Patterns in Protein Structures
Wei Wang, University of North Carolina
One of the next great frontiers in molecular biology is to understand, and predict protein function. Proteins are simple linear chains of polymerized amino acids (residues) whose biological functions are determined by the three-dimensional shapes that they fold into. Hence, understanding proteins requires a unique combination of chemical and geometric analysis. A popular approach to understanding proteins is to break them down into structural sub-components called motifs. Motifs are recurring structural and spatial units that are frequently correlated with specific protein functions. Traditionally, the discovery of motifs has been a laborious task of scientific exploration.
In this talk, I will present an eScience project MotifSpace, which includes recent data-mining algorithms that we have developed for automatically identifying potential spatial motifs. Our methods automatically find frequently occurring substructures within graph-based representations of proteins. We represent each protein’s structure as a graph, where vertices correspond to residues. Two types of edges connect residues: sequence edges connect pairs of adjacent residues in the primary sequence, and proximity edges represent physical distances, which are indicative of intra-molecular interactions. Such interactions are believed to be key indicators of the protein’s function.
This representation allows us to apply innovative graph mining techniques to explore protein databases and associated protein families. The complexity of protein structures and corresponding graphs poses significant computational challenges. The kernel of MotifSpace is an efficient subgraph-mining algorithm that detects all (maximal) frequent subgraphs from a graph database with a user-specified minimal frequency. Our algorithm uses the pattern growth paradigm with an efficient depth-first enumeration scheme, searching through the graph space for frequent subgraphs. Our most recent algorithms incorporate several improvements that take into account specific properties of protein structures.
Chair: Tony Tyson
Physical Science, Computational Science and eScience: the Strategic Role of Interdisciplinary Computing
Tim Clark, Harvard University
Data in the physical and life sciences is being accumulated at an astonishing and ever-increasing rate. The so-called “data deluge” has already outpaced scientists’ ability to exploit the wealth of information at their disposal. In order to make progress, scientists, as they have in the past, need to ask specific, well-formulated questions of the data. Many of these questions now require an unprecedented amount and variety of computing to answer, and many of the computational challenges are shared between seemingly disparate scientific disciplines. Therefore, to achieve a leadership position in many sciences today requires a strong interdisciplinary collaboration between experts in the scientific and computational disciplines, supported by an advanced computing infrastructure and skilled personnel.
Harvard’s Initiative in Innovative Computing (IIC) was launched through the Provost’s Office in late 2005 to enable the rapid expansion of advanced interdisciplinary work in scientific computing here. It aims to establish a robust yet flexible frame for research at the creative intersection between the computing disciplines and the sciences. The IIC’s research agenda encompasses a diverse array of innovative projects designed to push the boundaries of both computing and science. These projects are proposed by, and carried out in close collaboration with, researchers throughout Harvard. To keep the IIC’s agenda current, projects have a limited duration, and new ones are periodically solicited, reviewed, and added. The IIC will continuously generate and exploit some of the most exciting, meaningful opportunities for new discoveries in contemporary science.
This talk will explore some of the strategic implications and challenges of developing a program like IIC, why it is mandatory for achieving leadership in many scientific disciplines, and share some lessons learned.
Cleaning Scientific Data Objects
Dongwon Lee, The Pennsylvania State University
Real scientific data are often dirty, either syntactically or semantically. Despite active research on the integrity constraints enforcement and data cleaning, real data in real scientific applications are still dirty. Issues like heterogeneous formats of modern data, imperfect software to extract metadata, demand for large-scale scientific processing, and the lack of useful cleaning tools or system support make the problem only harder. When the base data are dirty, one cannot avoid the so-called “garbage-in, garbage-out” phenomenon. Therefore, improving the quality of the data objects has direct impacts and implications in many scientific applications.
In this talk, in the context of Quagga project which I am leading, I will present various dirty (meta-) data problems drawn from real-world cases and their potential solutions. In particular, I’ll present my recent work on: (1) scalable group linkage technique to identify duplicate data objects fast, (2) effective scientific data cleaning by Googling, (3) value imputation on microarray data set, and (4) semantically-abnormal data detection (e.g., detecting fake conferences and journals), etc.
Quagga project: http://pike.psu.edu/quagga
Part of this research was supported by a Microsoft SciData award in 2005.
Some Classification Problems from Synoptic Sky Surveys
S. George Djorgovski et al, California Institute of Technology
Analysis of data from modern digital sky surveys (individual or federated within the VO) poses a number of interesting challenges.
This is especially true for the new generation of synoptic sky surveys, which repeatedly cover large areas of the sky, producing massive data streams and requiring a self-federation of moderately heterogeneous data sets. One problem is an optimal star-galaxy classification using data from multiple passes, and incorporating external, contextual, or a priori information. Some problems require a very demanding real-time analysis, e.g., an automated robust detection and classification of transient events, using relatively sparse and heterogeneous data (a few data points from the survey itself, plus information from other, multiwavelength data sets covering the same location on the sky); and a dynamical version of this process which iterates the classification as follow-up data are harvested and incorporated in the analysis.
We will illustrate these challenges with examples from the ongoing Palomar-Quest survey, but they will become increasingly critical with the advent of even more ambitious projects such as PanSTARRS and LSST. We will also discuss some other, general issues posed by the scientific exploration of such rich data sets.
Chair: Tony Tyson
Infrastructure to Support New Forms of eScience, Publishing and Digital Libraries
Carl Lagoze, Cornell University
We are in the midst of radical changes in the way that scholars produce, share, and access the results of their work and that of their colleagues. High speed networking and computing combined with newly emerging collaborative tools will enable a new scholarly communication paradigm that is more immediate, distributed, data-centric, and dynamic. These new tools are essential for science as it confronts rapidly emerging problems such as global warming and pandemics.
In our research we are investigating infrastructure to support this new paradigm. This infrastructure allows the flexible composition of information units (such as text, data sets, and images) and distributed services for the formation of new types of scholarly results and new methods of collaborating. In this talk we will describe several components of this work:
- Fedora is open-source middleware supporting the representation, management, and dissemination of complex objects and their semantic relationships. These objects can combine distributed content, data, and web services. Fedora is the foundation for a number of international eScience initiatives including the Public Library of Science (PLOS), eSciDoc at the Max Planck Society, and the DART project in Australia.
- An Information Network Overlay is an abstraction for building innovative digital libraries that integrate selected networked resources and services, and provide the context for reuse, annotation, and refactoring of information within them. This architecture forms the basis of the NSF-funded National Science Digital Library (NSDL).
The Repository Interoperability Framework (RIF), an outgrowth of the NSF-funded Pathways project, is developing standards to support the sharing of information units (such as data, images and content) among heterogeneous scholarly repositories. The core of this work is the articulation of a common data model to represent complex digital objects and service interfaces that allow sharing of information about these digital objects among repositories and clients.
The Scientific Paper of the Future
Timo Hannay, Nature Publishing Group
The emergence of online editions of scientific journals has produced huge benefits by making the literature searchable, interlinked and available directly from scientists’ desktops. Yet this development only scratches the surface of the potential of the internet to revolutionize scientific communication. At Nature Publishing Group (NPG) we often think of these opportunities in terms of the follows ‘5Ds’:
Data Display: Figures no longer need to be static, but can become manipulable and interactive, and can provide readers with direct access to the underlying data. Dynamic Delivery: The same information does not need to be delivered to every person each time, but can instead be tailored to a user’s specific interests and their immediate needs.
Deep Data: Journals need to become better integrated with scientific databases (and in some ways ought to become more like databases too). Discussion & Dialogue: The web is a many-to-many network that enables direct discussion between readers, as well as modes of interaction that are more immediate and informal than allowed by the traditional publishing process.
Digital Discovery: Scientific information in an online world needs to be made useful not only to readers but also to software and other websites. Only in this way will the information become optimally useful to humans.
This presentation will summarize current activities in this area, inside NPG and elsewhere, and will look at where future trends might take us.
The Connection Between Scientific Literature and Data in Astronomy
Michael J. Kurtz, Harvard-Smithsonian Center for Astrophysics
For more than a century, journal articles have been the primary vector transporting scientific knowledge into the future; also during this time scientists have created and maintained complex systems of archives, preserving the primary information for their disciplines.
Modern communications and information processing technologies are enabling a synergism between the (now fully digital) archives and journals which can have profound effects on the future of research.
During the last approximately 20 years astronomers have been simultaneously building out the new digital systems for data and for literature, and have been merging these systems into a coherent, distributed whole.
Currently the system consists of a network of journals, data centers, and indexing agencies, which interact via a massive sharing of metadata between organizations. The system has been in active use for more than a decade; Peter Boyce named it Urania in 1997.
Astronomers are now on the verge of making a major expansion of these capabilities. Besides the ongoing improvement in the capabilities and interactions of existing organizations this expansion will entail the creation of new archiving and indexing organizations, as well as a new international supervisory structure for the development of metadata standards. The nature of scientific communication is clearly being changed by these developments, and with these changes will come others, such as: How will information be accessed? How will the work of individual scientists be evaluated? How will the publishing process be funded?