eScience Workshop 2006


The Microsoft eScience Workshop at John Hopkins University

Microsoft Research hosted a three-day eScience workshop on October 13-15, 2006 in the Bloomberg Center of the The Johns Hopkins University in Baltimore, Maryland.

The workshop provided a unique opportunity to share experiences, learn new techniques and influence the domain of scientific computing. It explored the evolution, challenges and potential of computing in scientific research, including how the latest tools, web services and database technologies are being applied to scientific computing. By providing a forum for scientists and researchers to share their experience and expertise with the wider academic and research communities, this workshop aimed to foster collaboration, facilitate the sharing of software components and techniques, and influence the development of Microsoft technologies for data-intensive scientific computing.

Specific areas of interest

  • Novel scientific applications using information technologies
  • Web service-based applications
  • Science data analysis, mining, and visualization
  • Smart clients and novel user interfaces for scientists
  • Healthcare Informatics
  • Scientific workflow management
  • eScience interdisciplinary curriculum development
  • Innovations in publishing scientific literature, results, and data

Keynote Speakers

Dr. Leroy Hood

Dr. Leroy Hood, President, Institute for Systems Biology (ISB)

In 2000, Dr. Hood co-founded the Institute for Systems Biology in Seattle, Washington to pioneer systems approaches to biology and medicine. Most recently, Dr. Hood’s lifelong contributions to biotechnology have earned him the prestigious 2004 Association for Molecular Pathology (AMP) Award for Excellence in Molecular Diagnostics. He has published more than 500 peer-reviewed papers, received 14 patents, and has co-authored textbooks in biochemistry, immunology, molecular biology, and genetics, and is a member of the National Academy of Sciences, the American Philosophical Society, the American Association of Arts and Sciences, and the Institute of Medicine. Dr. Hood has also played a role in founding numerous biotechnology companies, including Amgen, Applied Biosystems, Systemix, Darwin and Rosetta.

Dr. Jim Ostell

Dr. Jim Ostell, Chief of the Information Engineering Branch at the National Center for Biotechnology Information (NCBI) at the National Institutes of Health

Dr. Ostell was one of only twelve tenured NIH scientists to be appointed in 1996 to the Senior Biomedical Research Service. Under his direction, the NCBI Information Engineering Branch has produced a central computer infrastructure for biomedical information, covering the published literature, DNA and protein sequences, three-dimensional structures of biological molecules, assemblies of complete organism genomes, human genetics and phenotypes, and more. More that 2 million unique users a month use the NCBI on-line services and the NCBI user community has grown from a base of molecular biology researchers to include physicians, educators, and the general public. Some of the best-known resources provided by NCBI include GenBank, Entrez, PubMed, BLAST, dbEST, UniGene, dbSNP, LocusLink, RefSeq, Human Genome Resources, and many others.

Dr. Alexander Szalay

Dr. Alexander Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, Johns Hopkins University

Dr. Alexander Szalay spent over ten years working on the Sloan Digital Sky Survey (SDSS) — the most ambitious astronomical survey ever undertaken. When completed, it will provide detailed optical images covering more than a quarter of the sky, and a 3-dimensional map of about a million galaxies and quasars. As the survey progresses, the data are released to the scientific community and the general public in annual increments. His interests are theoretical astrophysics and galaxy formation. His research includes: Multicolor Properties of Galaxies, Galaxy Evolution, the Large Scale Power Spectrum of Fluctuations, Gravitational Lensing, and Pattern recognition and Classification Problems.

Dr. Tony Hey

Dr. Tony Hey, Corporate Vice President for Technical Computing, Microsoft

Dr. Hey is one of the pre-eminent researchers in the field of parallel computing, most recently as director of the United Kingdom’s ambitious e-Science Initiative. He reports directly to Craig Mundie, Microsoft chief technical officer and senior vice president for Advanced Strategies and Policy, and works across the company to coordinate Microsoft’s efforts to collaborate with the scientific community worldwide. He is a fellow of the U.K.’s Royal Academy of Engineering and has been a member of the European Union’s Information Society Technology Advisory Group. He also has served on several national committees in the U.K., including committees of the U.K. Department of Trade and Industry and the Office of Science and Technology. In addition, Hey has advised countries such as China, France, Ireland and Switzerland to help them advance their scientific agenda and become more competitive in the global technology economy.

Abstracts 10/13

Water Science: Schafler Auditorium

Chair: Catharine van Ingen

Stream Scouts: A Tiered Smart Client System for Annotating the Land-Water Interface

Piotr Parasiewicz & Chris Pal, University of Massachusetts at Amherst

We have developed a smart e-client system that will facilitate ad-hoc classification and validation of hydraulic features on very recent, high-resolution aerial photography of streams and rivers. It is geared toward aspects relevant to the protection of both human uses (i.e., drinking water quality, hydropower, flood protection) and ecological status. Specifically, we extend our previous setup of small hand held Pocket PC devices used to take field measurements to form ad-hoc wireless networks and communicate with Tablet PC servers located nearby on a boat or on shore. A heavyweight desktop server in a distant location hosts the river-habitat database, runs complex aquatic habitat simulation models and management applications over the internet, and facilitates data exchange with the Tablet PC. This system supports semi real-time simulations of aquatic habitats and creation of a self-learning database, enabling the development of algorithms for classification of critical habitat features from aerial photographs.

This project is a natural extension of many issues encountered in the TerraServer project, extending methods from computational geography to the domain of computational hydrology and computational ecology, through enabling the annotations necessary for simulating the dynamics of a watershed. Such a system is in high demand among environmental scientists and resource managers.

River Basin Scale Water-Quality Modeling using the CUAHSI Hydrologic Information System

Jonathan Goodall & Song Qian, Duke University

The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is a partnership of over 100 Universities in the United States. Informatics is a pillar the CUAHSI vision, and for the past three years a team of hydrologists and computer scientists have been working together to prototype a Hydrologic Information System to support river basin scale hydrologic science and management.

The Hydrologic Information System consists of (1) standard signatures for hydrologic data delivery web services, (2) a standard database schema for hydrologic observations, (3) ontologies for relating water quality parameters collected and maintained by different federal, state, and local agencies. The eventual goal of this system is to make possible hydrologic assessments and models previously too complex for individual scientists to undertake. The Hydrologic Information System, therefore, is a means to an end, the Cyber infrastructure necessary to progress scientific understanding. Therefore, we will present a large-scale water quality model that utilizes various components of the Hydrologic Information System.

This water quality model has been applied to major river basins in the United States (Chesapeake Bay, Mississippi-Missouri, etc.), but was previously limited to modeling long-term averages in nutrient loadings.

With the Hydrologic Information System, it is now feasible to gather, integrate, and summarize the Nation’s water quality records to support a temporally dynamic version of the model. This new version of the model improves our understanding of how landscape changes impact water quality and, just as important, provide evidence of the ultimate impact of water resources management decisions towards improving our Nation’s water quality.

Space-Time Series of MODIS Snow Cover Products for Hydrologic Science

Jeff Dozier & James E. Frew, University of California Santa Barbara

The Moderate-Resolution Imaging Spectroradiometer flies on two NASA/EOS satellites, each imaging most of the Earth every day, Terra in the morning, Aqua in the afternoon. MODIS has 36 spectral bands covering wavelengths from 0.4 to 14.4µm, 2 at 250m spatial resolution, 5 at 500m, and 29 at 1#km. Using reflectance values from the 7 “land” bands with 250 or 500m resolution, along with a 1km cloud product, we estimate the fraction of each 500m pixel that snow covers along with the albedo (reflectance) of that snow. Such products are then used in hydrologic models in several mountainous basins. The daily products have glitches. Sometimes the sensor cannot view the surface because of cloud cover, and even in the absence of clouds, an off-nadir view in a vegetated area “sees” less ground area than a nadir view. Therefore, we must use the daily time series in an intelligent way to improve the estimate of the measured snow properties for a particular day. We consider two scenarios: one is the “forecast” mode, whereby we use the past, but not the future, to estimate the snow-covered area and albedo on that day; the other is the “retrospective” mode, whereby in the summer after the snow is gone we reconstruct the history of the snow properties for that water year.

This space-time interpolation presents both scientific and data management challenges. The scientific question is: how do we use our knowledge of viewing geometry, snow accumulation and ablation, along with available ground data, to devise a scheme that is better than generic multidimensional interpolation? The data management involves large three-dimensional objects, identification of erroneous data, and keeping track of the lineage of the way a set of pixel values has been interpreted.

Web Services for Unified Access to National Hydrologic Data Repositories

I. Zaslavsky, D. Valentine, B. Jennings, UCSD; D. Maidment, University of Texas – Austin

The CUAHSI hydrologic information system (HIS) is designed to be a multi-tier network of grid nodes for publishing, accessing, querying, and visualizing distributed hydrologic observation data for any location or region in the United States. The core of the system is web services that provide uniform programmatic access to heterogeneous federal data repositories as well as researcher-contributed observation datasets

The currently available second generation of services support data and metadata discovery and retrieval from USGS NWIS (steam flow, groundwater and water quality repositories), DAYMET daily observations, NASA MODIS, and Unidata NAM streams, with several additional web service wrappers being added (EPA STORET, NCDC ASOS, USGS NAWQA.) Accessed from a single discovery interface developed as an ASP.NET application over ESRI’s ArcGIS Server, the web services support comprehensive hydrologic analysis at the catchments, watershed and regional levels.

Different repositories of hydrologic data use different vocabularies, and support different types of query access. Resolving the semantic and structural heterogeneities and distilling a generic set of service signatures is one of the main scalability challenges in this project, and a requirement in our web service design. To accomplish the uniformity of the web services API, different data holdings are modeled following the CUAHSI Observation Data Model. The web service responses are document-based, and use an XML schema to express the semantics in a standard format. Access to station metadata is provided via web service methods, GetSites, GetSiteInfo and GetVariableInfo, while observation values are retrieved via a generic GetValues method. The methods may execute over locally-stored metadata (in SQL Server 2005) or request the information from remote repositories directly. The services are implemented in ASP.Net 2.0 (C#), and tested with both .Net and Java clients.

The CUAHSI HIS project is funded by NSF through 2011. More information about it is available from

Early Experience Prototyping a Scientific Data Server for Environmental Data

Catharine van Ingen, Microsoft

There is an increasing desire to do science at scales larger than a single site or watershed and over times measured in years rather than seasons. This implies that the quantity and diversity of data handled by an individual scientist or small group of scientist is increasing even without the “data deluge” associated with inexpensive sensors.

Unfortunately the quality and quantity of available data and metadata varies widely and algorithms for deriving science measurements from observations are still evolving. Also, the existence of an Internet archive does not guarantee the quality of the data it contains, and the data can easily become lost or corrupted through subsequent handling. Local data recalibration, additional data derivation, and/or gap-filling are seldom tracked leading to confusion when comparing results. Today the tasks of data collection, sharing and mining are often a significant barrier to cross-site and regional analyses.

We are developing a prototype scientific data server for data sharing and curation by small groups of collaborators. This data server forms the storage part of a laboratory information management system or LIMS. The server also performs simple data mining and visualization of diverse datasets with diverse metadata. Our goal is to enable researchers to collect and share data over the long time scales typically necessary for environmental research as well as to simply analyze the data as a whole, thus dramatically increasing the feasible spatial or temporal scale of such studies.

The scientific data server prototype is being developed using the Ameriflux data in cooperation with key scientists attempting continental scale work on the global carbon cycle using long term local measurements. The Ameriflux measurement network consists of 149 micro-meteorological towers across the Americas. The collaboration is communal – principal investigator acts independently to prepare and publish data to the Oak Ridge repository. One of the near term challenges for the Ameriflux and global FLUXNET communities is to enable cross-site analyses across sites with similar locations, ecosystem, climate or other characteristics. A longer term challenge is to link the flux data to other related data such as MODIS satellite imagery.

Healthcare Informatics: Mudd Hall

Chair: Chi Dang

Real-Time Transcription of Radiology Dictation: A Case Study for Multimedia Tablet PCs

Wuchun Feng, Virginia Tech

We present the design and implementation of an integrated multimodal interface that delivers instant turnaround on transcribing radiology dictation. This instant turnaround time virtually eliminates hospital liability with respect to improper transcriptions of oral dictations and all but eliminates the need for transcribers. The multimodal interface seamlessly integrates three modes of input; speech, handwriting and written gestures, to provide an easy-to-use system for the radiologist.

Although computers have quickly become an essential part of today’s society, their ubiquity has been stymied because many still find the computer “unnatural” (and even difficult) to use. While scientists and engineers take their computer skills for granted, a large number of potential users still have limited experience in using a computer. To make computers (or products with embedded computers, e.g., an automobile) easier and more natural to use, manufacturers have proposed the use of a speech recognition system. Even for computer-savvy users, speech can be used to boost productivity because nearly everyone can talk faster than they can type, typically more than 200 words per minute (wpm) versus 50 to 75 wpm. However, speech recognition is never perfect; recognition errors are made. In order to correct these errors, the end user currently uses a keyboard and mouse.

Instead, we propose a system that seamlessly integrates speech, handwriting, and written gestures and provides a natural multimodal interface to the computer. To ensure that the interface is easier to use than a keyboard-and-mouse interface, the speech recognizer must have a high recognition rate, e.g., 95%, and the handwriting and gesture recognizers should provide nearly error-free recognition of stylus-entered handwriting and gestures, respectively, to correct errors made by the speech recognizer. These corrections can then be applied to the speech recognizer itself to improve future recognition.

Facilitating Understanding and Retention of Health Information

Gondy Leroy, Claremont Graduate University

Billions of people read online health information without understanding it, which is unfortunate since it affects their healthcare decisions. Current research focuses almost exclusively on measuring readability and (re)writing texts so they require lower reading levels. However, rewriting all texts is infeasible and little research has been done to help consumers otherwise.

We focus on automated tools that facilitate understanding and retention of information. We found that consumers read at lower grade levels but also use a significantly different vocabulary than healthcare providers. We have developed a vocabulary-based naïve Bayes classifier that distinguishes with 96% accuracy between three levels of medical specificity in text. Applying this classifier to a sample of online texts showed that only 4% of texts by governments, pharmaceuticals and non-profits use consumer-level vocabulary. As a first step we are developing a table of content (ToC) algorithm that automatically imposes a semantic structure. The ToC shows important concepts in the text. Selecting these concepts highlights the key terms and bolds the surrounding text. This makes searching the text easier and, more importantly, it may improve understanding and retention of information. The ToC visually chunks the information into easy to understand groups which may facilitate transfer from working memory to long-term memory. This is especially important for the elderly, our focus group, who often have physical ailments, deteriorated eyesight and decreased working memory.

Results from our pilot study with a first prototype indicate that question-answering when the original text is present worked as well with as without the ToC. Remembering the correct answers and recalling extra information afterwards was better with a ToC. A complete user study comparing the elderly and other adults is ongoing.

Motion-Synchronized Intensity Modulated Arc Therapy

Shuang (Sean) Luan, University of New Mexico

Modern radiotherapy is a minimally invasive treatment technique that uses high-energy X-rays to destroy tumors. The quality of a radiotherapy plan is normally measured by its dose conformity and treatment time. The dose conformity specifies how well the high radiation dose region conforms to the target tumor while sparing the surrounding normal tissues, and the treatment time describes how long a treatment takes and how efficiently the treatment machines are utilized.

One of the biggest challenges in modern radiotherapy is to treat tumors in and near the thorax, because they are subject to substantial breathing induced motions and their anatomies during the treatment may vary significantly from those used for treatment planning. To compensate such target variations, image-guidance techniques such as 4-D CT and motion tracking have recently been employed in radiotherapy to provide real-time adjustment to the treatment. A current most popular image-guidance technique is called “gating”; the key idea is to treat the patient only at a certain phase of the breathing cycle. Since most of the treatment time is spent waiting for the patient to enter the correct breathing phase, gating can be very inefficient. Further, by only treating the patient at a chosen breathing phase, gating fails to take advantage of the 4-D imaging technologies, which can record the patient’s anatomy changes with respect to time.

We have developed an image-guided radiotherapy technique for compensating breathing-induced motions called motion-synchronized intensity-modulated arc therapy. A prototype planning system running on Microsoft Windows has been implemented using Microsoft Visual C++. Unlike gating, our new scheme makes full use of 4-D CT and motion tracking and treats the patient at all breathing phases. Our preliminary study has shown the amazing ability of motion-synchronized IMAT to produce treatment plans with both superior dose conformity and a short treatment time.

Systems Biology and Proteomics in Drug and Biomarker Discovery

Mark Boguski, Novartis

Recent advances in the “omics” technologies, scientific computing and mathematical modeling of biological processes have started to fundamentally impact the way we approach drug discovery. Recent years have witnessed the development of genome-scale functional screens, large collections of reagents such as RNAi libraries, protein micro arrays, databases and algorithms for text mining and data analysis.

Taken together, these tools enable the unprecedented descriptions of complex biological systems, which are testable by mathematical modeling and simulation. While the methods and tools are advancing, it is their iterative and integrated application that defines the systems biology approach.

The Cardiovascular Research Grid

Raimond L. Winslow, JHU

The Cardiovascular Research Grid (CVRG) project is a national collaborative effort involving investigators at Johns Hopkins University, Ohio State University and the University of California at San Diego. The goals of this project are to leverage existing grid computing middleware developed as part of the Biomedical Informatics Research Network (BIRN – a brain Image sharing and data analysis grid) and the Cancer Bioinformatics Grid (caBIG – sharing of cancer data) to create a national resource for sharing and analysis of multi-scale cardiovascular data.

Data to be shared include gene and protein expression data, electrophysiological (time series) data, multimodal 3D and 4D image data and de-identified clinical data. Analysis tools to be developed and shared include machine learning methods for predicting risk of Sudden Cardiac Death based on these multi-scale data, computational anatomy tools for detecting abnormalities of heart shape and motion, and computational models of heart function in health and disease.

High Performance Computing: Remsen One

Chair: George Spix

It Takes Two (or More) to Place Data

Miron Livny, University of Wisconsin-Madison

Data intensive e-Science is by no means immune to the classical data placement problem – caching input data close to where the computation takes place and caching output data on its way to the designated storage/archiving destination. As for other aspects of e-Science, the scale, heterogeneity and dynamics of the infrastructure and the workload, increase the complexity of providing a dependable managed data placement capability. When a “chunk” of bytes is cached, a source site and destination site are actively engaged in the placement of the data. While the destination has to provide the storage space (we refer to it as a lot) to “park” the data, both sites need to co-allocate local resources like disk bandwidth and memory buffers in support of the transfer activity.

Operational cyber-infrastructure at the campus level (like the Grid Laboratory of Wisconsin (GLOW)) and at the national level (like the Open Science Grid (OSG)) expose the limitations and deficiencies of existing storage and data handling tools and protocols. While most of them focus on network performance, they offer very little in local resource manageability and coordination. Recent work to enhance the capabilities of existing tools like GridFTP and protocols like the Storage Resource Manager (SRM) as well as specialized job managers like Stork, point at promising approaches to address the data placement problem for e-Science applications. Some of this work employs matchmaking techniques to coordinate the allocation of resources at the two end points. These techniques allow the parties to express their autonomous resource allocation policies and locally enforce them. By elevating data placement to the same level as computing, data caching tasks can be easily included in workflows so that all aspects of the workload can be uniformly treated.

COMPASS – Staying Found in a Material World

Gerd Heber & Anthony R. Ingraffea, Cornell Theory Center

The Computational Materials Portal and Adaptive Simulation System is an attempt to deliver certain Computational Materials services and resources over the World Wide Web to the desks of engineers, researchers, and students in academia, government, and industry. Currently, COMPASS resources and services are available to human and non-human end-users through a portal site or XML Web services. The services and resources offered include modeling tools, simulation capabilities, imagery, and other data contributed by domain experts. With COMPASS services, each authorized user can create new resources and further process them in a private workspace.

COMPASS is a multi-tiered system which brings to bear a set of technologies. Its web tier is implemented in Microsoft ASP.NET 2.0 and Atlas. In addition to traditional RDBMS use, the middle tier and back end leverage several of the capabilities introduced with Microsoft SQL Server 2005, e.g., the native XML type and the integrated CLR. Other technologies employed are RDF/XML for metadata management and OpenDX/JDX for local and remote visualization.

COMPASS is work-in-progress: the presentation is a status report and will highlight some of the present challenges. Among them are ambient find-ability (find anything from anywhere anytime) and data resource federation and replication. COMPASS grew out of and is currently supported by the DARPA SIPS (Structural Integrity and Prognosis System) effort which aims at dramatically improving predictions of usable vehicle life based on field data, the best available material science, and multi-scale simulation.

HPC Profile: Interoperable, Standards-based Batch Job Scheduling of Scientific/Technical Applications

Marty Humphrey, University of Virginia

eScientists often use high-end computing platforms such as computational clusters to perform complex simulations. Currently, per-machine idiosyncratic interfaces and behaviors make seamless access across these platforms nearly impossible, causing often fragile middleware (or more likely the end eScientist) to attempt to manually deal with underlying differences between these back-end-resources. In collaboration with Microsoft and Platform Computing in the context of the Open Grid Forum, we have recently completed a standards-based “HPC Profile” based on Web services.

The core of the HPC Profile is the Job Submission Description Language (JSDL) and the Open Grid Services Architecture (OGSA) Basic Execution Services (BES). JSDL is a proposed standard that describes the requirements of computational jobs for submission to resources. BES is an emerging standard for specifying a service to which clients can send requests to initiate, monitor, and manage computational activities. The HPC Profile augments, clarifies and restricts JSDL and BES to create the minimal interoperable environment for realizing the vertical use case of batch job scheduling of scientific/technical applications. The HPC Profile is the cornerstone of the “Evolutionary approach to realizing the Grid vision” of Theimer, Parastatidis, Hey, Humphrey, and Fox.

In this talk, I give an overview of the HPC Profile, emphasizing its impact on the end eScientists. I will describe the state of interoperable open-source implementations of the HPC Profile (as well as the development of compliance tests for future implementations). I will give a demo of a Web Part that can be utilized by Microsoft Office SharePoint Server 2007 as a key component for building a collaboration site for an eScience project. I will conclude with some thoughts on a potential Data Profile, which builds on the success of the HPC Profile to construct a corresponding standards-based approach for interoperable data federation and management.

Globally Distributed Computing and Networking for Particle Physics Event Analysis

Julian Bunn, Caltech

Excitement in anticipation of the first proton beams at CERN’s Large Hadron Collider (LHC) in 2007 is reaching new heights amongst physicists engaged in the Compact Muon Solenoid experiment, one of four detectors that will be used to capture and analyze the LHC data in search of the Higgs Boson and new physics.

The expected LHC data rates (200Mbytes/sec to 1.5 GBytes/sec) give rise to unusually large datasets which must be distributed, processed and analyzed by a worldwide community of scientists and engineers, according to the decentralized, Tiered model of computing developed at Caltech in 1997 and since adopted by these experiments. Over the last eight years at Caltech we have been actively planning and developing computing infrastructure to meet this data challenge. The effort has several thrusts: planning, testing, evaluating, and deploying in production high speed intercontinental networks to carry scientific data on behalf of the community (LHCNet), developing and deploying Grid-based physics software and tools, with a particular focus on event data analysis (Clarens), and creating a worldwide real-time monitoring and control infrastructure for systems, networks, and services based on an agent architecture (MonALISA).

In my presentation I will describe these activities and paint a picture of how we expect to extract and analyze the LHC data for evidence of new physics over the next decade and beyond.

COMPASS – Staying Found in a Material World

Gerd Heber & Anthony R. Ingraffea, Cornell Theory Center

The Computational Materials Portal and Adaptive Simulation System is an attempt to deliver certain Computational Materials services and resources over the World Wide Web to the desks of engineers, researchers, and students in academia, government, and industry. Currently, COMPASS resources and services are available to human and non-human end-users through a portal site or XML Web services. The services and resources offered include modeling tools, simulation capabilities, imagery, and other data contributed by domain experts. With COMPASS services, each authorized user can create new resources and further process them in a private workspace.

COMPASS is a multi-tiered system which brings to bear a set of technologies. Its web tier is implemented in Microsoft ASP.NET 2.0 and Atlas. In addition to traditional RDBMS use, the middle tier and back end leverage several of the capabilities introduced with Microsoft SQL Server 2005, e.g., the native XML type and the integrated CLR. Other technologies employed are RDF/XML for metadata management and OpenDX/JDX for local and remote visualization.

COMPASS is work-in-progress: the presentation is a status report and will highlight some of the present challenges. Among them are ambient find-ability (find anything from anywhere anytime) and data resource federation and replication. COMPASS grew out of and is currently supported by the DARPA SIPS (Structural Integrity and Prognosis System) effort which aims at dramatically improving predictions of usable vehicle life based on field data, the best available material science, and multi-scale simulation.

Accelerating Statistical Biomedical Data Analysis Using a PC-Cluster Based Distributed Computing Technology

Yibin Dong, Virginia Tech

Small sample size problems in biomedical research using genomic datasets brought challenges of relevancy and scalability to research scientists. In order to obtain statistical significance in data analysis, the same computing task is usually repeated hundreds of times but on a single computer, these independent tasks in a queue cause a bottleneck in statistical data analysis. One solution to accelerate statistical biomedical data analysis is to use cluster computing which has typically been Linux based. However, many research scientists who are used to Microsoft Windows environment may not be keen to switch to an operating system which is new to them.

In May 2006, researchers at Virginia Tech Advanced Research Institute built a multi-node parallel computer using Microsoft Corporation’s beta version of Windows Compute Cluster Server (CCS) 2003. It was built using 16 HP Proliant DL145 Generation 2 Servers over a period of two months during which we successfully tested two in-house bioinformatics applications on CCS – robust biomarker selection and predictor performance estimation, using Matlab distributed computing toolbox (DCT). The time reduction rates of the two applications on the 16-node compute cluster are 84.53% and 92.08% respectively compared to running the same application on a single computer.

The improved performance using Windows CCS cluster shows the feasibility of applying Windows-based high performance computing for small to medium-sized biomedical researchers which provides significant benefits such as increased computation performance, easy deployment, easy use, high scalability, and high security.

eScience Communities: STI Auditorium

Chair: Steven Meacham

National Science Foundation and e-Science: Now and Next Steps

Maria Zemankova, National Science Foundation

The National Science Foundation (NSF: mission as stated in the Act of 1950 is: “To promote the progress of science; to advance national health, prosperity, and welfare; to secure the national defense; and for other purposes.” 55 years later, the National Science Board (NSB) that governs NSF articulated the “2020 Vision for the National Science Foundation” ( and also published a report on “Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century” ( This year, NSF established a new Office of Cyberinfrastructure that coordinates and supports the acquisition, development and provision of state-of-the-art Cyberinfrastructure resources, tools and services essential to the conduct of 21st century science and engineering research and education. The common theme is the need to promote conduct of research in the new information, computation and communications-based knowledge discovery and sharing paradigm, i.e., develop “e-Science” research infrastructure.

The Computer & Information Science & Science (CISE) and the NSF’s science, engineering, educational and infrastructure programs foster synergistic collaboration for the advancement of both CISE and domain areas. NSF supports innovative techniques for exploiting and enhancing existing information, computation and communications technologies to support domain-specific research problems, large-scale transformative research projects, or collaborative research activities with other partners, including industry or international research communities. Supported and proposed research spans new methods for modeling new, complex data types; efficient techniques for collecting, storing and accessing large volumes of dynamic data; development of effective knowledge discovery environments, including analysis, visualization, and simulation techniques; distributed collaboration and discovery process management (grids, scientific workflows); research creativity support tools, e-Science interdisciplinary curriculum development; long-term knowledge evolution and sharing; and innovations in publishing and archival of scientific literature, results, and data.

This presentation will provide information on existing research and infrastructure projects, current support opportunities, and outline future plans and wishes.

SETI@home and Public Participation Scientific Computing

Dan Werthimer, University of California, Berkeley

Werthimer will discuss the possibility of life in the universe and the search for radio and optical signals from other civilizations. SETI@home analyzes data from the world’s largest radio telescope using desktop computers from five million volunteers in 226 countries.

SETI@home participants have contributed two million years of computer time and have formed one of Earth’s most powerful supercomputer. Users have the small but captivating possibility their computer will detect the first signal from a civilization beyond Earth.

Werthimer will also discuss plans for future SETI experiments, petaop/sec FPGA based computing, and open source code for public participation distributed computing (BOINC — Berkeley Open Infrastructure for Network Computing).

Organizing, Analyzing and Visualizing Data on the TeraGrid

Kelly P. Gaither, TACC

eScience is a term used to describe computationally intensive science carried out in highly distributed network environments or using immense data sets requiring grid computing technologies. A classic example of a large scale eScience project is the TeraGrid, an open scientific discovery infrastructure combining leadership class resources at nine partner sites to create an integrated, persistent computational resource. The TeraGrid integrates high-performance computers, data and visualization resources, software tools, and high-end experimental facilities around the country. These integrated resources include more than 102 teraflops of computing capability and more than 15 petabytes of online and archival data storage with rapid access and retrieval over high-performance networks. Through the TeraGrid, researchers can access over 100 discipline-specific databases.

I currently serve as the Area Director for Data, Information Services, Visualization and Scheduling (DIVS) for the TeraGrid Grid Integration Group (GIG). In this role, I am keenly aware of the impending data and analysis issues facing our e-Science community. Data management and visualization have become priorities for the national user community and consequently for the TeraGrid. In this day of information proliferation, the need for rapid analysis and discovery is critical. Information and data are being generated at an alarming rate, through measurement, sensor, or simulation. We, as scientists and technologists, are beginning to better understand the management and manipulation of massive data stores – either through storage, co-location, or rapid movement. Getting information from the visualization process, however, is a challenging process that is still in its infancy. We have explored the issues that data analysis and visualization face and the relationships that exist between data type, size, and structure, and the corresponding analysis techniques. I will present and discuss the issues that we face on the TeraGrid with regards to the organization and analysis of large-scale data and strategies going forward.

Sector – An E-Science Platform for Distributing Large Scientific Data Sets

Robert Grossman, University of Illinois at Chicago

In this talk, we show how a peer-to-peer system called Sector can be used to distribute large scientific data sets over wide-area high performance networks. We also describe how Sector has been used recently to distribute data from the Sloan Digital Sky Survey (SDSS).

Sector is designed to exploit the bandwidth available in wide-area, high performance networks and to do this in a way that is fair to other high volume flows and friendly to traditional TCP flows. Sector employs a high performance data transport protocol called UDT to achieve this. Sector has been used to transport the SDSS BESTDR5 catalog data, which is over 1 TB in size, to locations in North America, Europe and Asia. Sector is designed to provide simple access to remote and distributed data. No other infrastructure is required than a fast network and a small Sector client application; in contrast, installing and operating the infrastructure for a data grid can sometimes be challenging.

We also describe a distributed system for integrating and analyzing data called Angle that is built over Sector. Angle is designed to perform row and column operations on distributed data. In contrast to systems that rely exclusively on a database or data warehouse for data integration and require the full semantic integration of different data schemas, Angle also supports data integration using globally unique identifiers called universal keys that can be attached to distributed data attributes. For many applications, having one or more universal keys is an often surprisingly useful. For example, geo-spatial applications can use universal keys specifying a specific latitude-longitude coordinate system for data integration operations; astronomical applications can use universal keys specifying a specific right ascension-left declination coordinate system, etc.

Abstracts 10/14

Sensor Networks: Schafler Auditorium

Chair: Catharine van Ingen

Life Under Your Feet: A Wireless Soil Ecology Sensor Network

Katalin Szlavecz & Andreas Terzis , The Johns Hopkins University

Wireless sensor networks (WSNs) have the potential to revolutionize soil ecology by providing abundant data gathered at temporal and spatial granularities previously impossible. In this talk we will outline some of the open questions in soil ecology today and elaborate on the potential of WSNs to provide the data to answer these questions.

As the second part of the talk we will present an experimental network for soil monitoring that we developed and deployed for a period of one year in an urban forest in Baltimore. Each node in our network collects soil moisture and temperature measurements every 1 minute and stores them in local memory. All collected measurements are retrieved by a sensor gateway and inserted into a database in their raw and calibrated version. Stored measurements are subsequently made available to third-party applications through a web services interface.

At a high level this first deployment was a scientific success, exposing variations in the local soil micro-climate not previously observed. However, it also points to a number of challenging problems that must be addressed before sensor networks can fulfill their potential of being predictable and robust instruments empowering scientists to observe phenomena that were previously out of reach. We will close the talk by discussing how we plan to address these challenges in the second deployment of our network that we are currently designing.

The WaveScope Data Management System

Samuel Madden, Massachusetts Institute of Technology

WaveScope is a data management and continuous sensor data system that integrates relational database and signal processing operations into a single system. WaveScope is motivated by a large number of signal-oriented streaming sensor applications, such as: preventive maintenance of industrial equipment; detection of fractures and ruptures in various structures; in situ animal behavior studies using acoustic sensing; network traffic analysis; and medical applications such as anomaly detection in EKGs. These target applications use a variety of embedded sensors, each sampling at fine resolution and producing data at high rates ranging from hundreds to hundreds of thousands of samples per second. Though there has been some work on applications in the sensor network community that do this kind of signal processing (for example, shooter localization, industrial equipment monitoring, and urban infrastructure monitoring), these applications are typically custom-built and do not provide reusable high-level programming framework suitable for easily building new signal processing applications with similar functionality. This talk will discuss how WaveScope supports these types of application in a single, unified framework, providing both high run-time performance and easy application development, and will illustrate how several scientific applications are built in the WaveScope framework.

Challenges in Building a Portal for Sensors World-Wide

Feng Zhao & Suman Nath, Microsoft Research

SensorMap is a portal web site for real-time real-world sensor data. It allows data owners to easily make their data available on the map. The platform also transparently provides mechanisms to archive and index data, to process queries, to aggregate and present results on a geo-centric web interface based on Windows Live Local. In this talk, I will describe the architecture of SensorMap, key challenges in building such a portal, and current status and experience. I will also highlight how such a portal can help eScience research.

Transforming Ocean and Earth Sciences with Distributed Submarine Sensor Networks

John R. Delaney, University of Washington

Interactive, internet-linked sensor-robotic networks are the next-generation approach to enabling long-term 24/7/365 surveillance of major remote or dangerous processes that are central to the habitability of our planet. Continuous, real-time information from the environment, specifically from the ocean basins, will launch rapid growth in our understanding of the habitats and behavior of known and novel life forms, climate change, assessment and management of living and non-living marine resources, elements of homeland defense, erupting underwater volcanoes, major earthquake timing and intensity, and mitigation of natural disasters.

The NEPTUNE ocean observatory program will be a leader in this approach. The observatory’s 1400-mile network of heavily instrumented fiber-optic/power cable will convert a major sector of the Juan de Fuca tectonic plate and its overlying ocean off the coasts of Washington, Oregon, and British Columbia into an internationally accessible interactive, real-time natural laboratory reaching millions of users or viewers via the Internet.

Thousands of physical, chemical, and biological sensors distributed across the seafloor, throughout the ocean above, and within the seabed below, may be linked to partially or fully autonomous robotic platforms that are integrated into interactive networks connected via the Internet to land-based users. NEPTUNE is being designed to provide scientists, educators, policy makers, and the public with unprecedented forms of novel information about a broad host of natural and human-induced processes operating within the ocean basins. Data management and visualization challenges include handling large volumes of multidisciplinary data streams; assimilating real-time data into models; and providing data discovery and visualization tools that enable collaborative discovery by groups of researchers.

Smart Clients: Mudd Hall

Chair: Simon Mercer

The Chemical Informatics and Cyber infrastructure Collaboration: Building a Web Service Infrastructure for Chemical Informatics

Marlon Pierce & David Wild, Indiana University

At Indiana University School of Informatics we are developing a web service, workflow and smart client infrastructure to allow the intelligent querying, mining and use of drug discovery information. As the volume and diversity of sources of chemical, biological, and other information related to drug discovery has grown, it has become increasingly difficult for scientists to effectively use this information. In this presentation we will discuss our approach to harnessing and using the information available, including the use of literature, chemical database, biological information and information generated by computational tools such as docking. We will give examples of workflows which bring together tools and information in new ways, and discuss our efforts to develop innovative interaction tools and interfaces that let scientists map their information needs onto these workflows.

What’s Your Lab Doing in My Pocket? Supporting Mobile Field Studies with Xensor for Smartphone

Henri ter Hofte, Telematica Instituut, the Netherlands

Smartphones tend to travel along with people in everyday life wherever they are and with whatever they are doing. This literally puts these devices in an ideal position to capture several aspects of phenomena, such as location of a person and proximity to others. Xensor for Smartphone is an extensible toolkit that exploits the hardware sensors and software capabilities of Windows Mobile 5.0 smartphones to capture objective data about human behavior and their context (such as location, proximity and communication activities), together with objective data about application usage and highly subjective data about user experience (such as needs, frustrations, and other feelings). The aim is to provide social science with a research instrument to gain a much more detailed and dynamic insight into social phenomena and their relations. In turn, these outcomes can inform the design of successful mobile context-aware applications.

In this talk, we present and demonstrate the support Xensor for Smartphone provides in various phases of a scientific study: configuration, deployment, data collection and analysis. We also highlight how we used various Microsoft technologies (including Windows Mobile, .NET Compact Framework, SQL Mobile and SQL Server) in an occasionally-connected smart client architecture to implement the Xensor for Smartphone system.

ProDA’s Smart Client for On-Line Scientific Data Analysis

Cyrus Shahabi, University of Southern California

In the past three years, we have designed and developed a system called ProDA (for Progressive Data Analysis), which deploys wavelet transformation and web-services technology for efficient and transparent analysis of large multidimensional data in Online Scientific Applications (OSA).

Two types of processing are needed by OSA. First, a set of data intensive operations need to be performed on terabytes of data (e.g., sampling and aggregation) to prepare a relevant subset of data for further analysis. Second, further visualization and deeper analysis of the sub-data occur in a more interactive mode of operation. For the first set of tasks, we moved the operations as close to data as possible to avoid unnecessary and costly data transmissions as well as enabling fast queries by pre-aggregating data as wavelets. Hence with ProDA, we used the .NET Framework to develop a set of customized web-services to perform typical scientific data analysis tasks efficiently in the wavelet domain and close to data. The second set of tasks is best performed using the user’s favorite tools already provided by the client platform (such as a spreadsheet application). Therefore, ProDA’s web-enabled smart client implemented in C# allows both transparent access to the 2nd tier web-services and smooth invocation of client-side tools. This architecture would also allow OSA mobile users to cache data and perform ad-hoc data analysis tasks on the cached data while disconnected from their huge data repositories.

We deployed ProDA in two different application domains: Earth Data Analysis (sponsored by JPL) and Oil Well Sensor Data Analysis (sponsored by Chevron). In this talk we will emphasize and demonstrate ProDA’s utility in the Chevron application.

Function Express Gold: A caBIG™ Grid-aware Microarray Analysis Application

Rakesh Nagarajan, Washington University

It is becoming increasingly apparent that a majority of human diseases including tumorigenesis are the product of multi-step processes which each involve the complex interplay of a multitude of genes acting at different levels of the genetic program. To study such complex diseases, many analyses on the genomic scale are possible in the post-human genome sequencing era. Foremost among these is the microarray experiment where an investigator has the ability to monitor the expression of all genes in a particular tissue. However, most end-user physician-scientists find the task of analyzing data generated from microarray experiments daunting since considerable computing power and expertise are required. To directly address this growing need, the National Cancer Institute has recently started the cancer Biomedical Informatics Grid (caBIG™at initiative to create a “network or grid connecting individuals and institutions to enable the sharing of data and tools, creating a World Wide Web of cancer research.” Using caBIG™ data and analytical services, we propose to develop Function Express Gold (FE Gold), a caGrid-aware Microsoft Smart Client microarray analysis application. In our approach, all grid sources will be accessed using web services adapters. Namely, FE Gold will acquire microarray and gene annotation data using caGrid data services, and this data will then be filtered, normalized, and mined using caGrid analytical services. Using the acquired microarray data, analysis results, and gene annotation information, FE Gold will be able to function using local computing power for graphical display and analysis even when the client is not connected to the internet. When network connectivity is available, FE Gold will check for annotation data updates at the server end in a seamless fashion. Finally, new releases as well as bug fixes will be distributed to all clients using the Background Intelligent Transfer Service.

Data Organization and Search: Remsen One

Chair: Jim French

Sorting in Space

Hanan Samet, University of Maryland

The representation of spatial data is an important issue in computer graphics, computer vision, geographic information systems, and robotics. A wide number of representations are currently in use. Recently there has been renewed interest in hierarchical data structures such as quadtrees, octrees, and R-trees. The key advantage of these representations is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search. In this talk we give a brief overview of hierarchical spatial data structures and related research results. In addition we demonstrate the SAND Browser ( and the VASCO JAVA applet ( which illustrates these methods.

Geospatial Infrastructure Goes to the Database

Tamás Budavári & Alex Szalay, The Johns Hopkins University

Jim Gray & Jose Blakeley, Microsoft

We present a novel approach to dealing with geospatial information inside the database. The design is based on our own lightweight spatial framework that was developed for representing complex shapes on the surface of the unit sphere independent of coordinate systems or projections. This C# library is not only capable of formally describing regions of interests very accurately, but features the full set of logical operations on the regions (such as union or intersection), as well as precise area calculation.

The internal mathematical representation is tuned toward flexibility and fast point-in-region searches regardless of the area coverage. Leveraging on the CLR capabilities of Microsoft’s SQL Server 2005, we surface most of the functionalities of the spherical class library to SQL. The SQL routines use a custom serializer for storing the shapes in binary blobs inside the database. For super fast searches, we materialize and index in SQL bounding circles and an adaptive approximation with indexing using the Hierarchical Triangular Mesh.

Efficient Search Index for Spherically Distributed Spatial Data in a Relational Model

Gyorgy Fekete, The Johns Hopkins University

We discuss a project to develop a system for rapid data storage and retrieval using the Hierarchical Triangular Mesh (HTM) to perform fast indexing over a spherical spatial domain in order to accelerate storing and finding data over the Earth and sky. Spatial searches over the sky are the most frequent queries on astrophysics data, and as such are central to the National Virtual Observatory (NVO) effort and beyond.

The library has applications in astronomy and earth science. The goal is to speed up a query that involves an object (such as observation or location) and a region of interest of an arbitrary shape (such as a political boundary or satellite track). In a very large database, one wants to minimize the number of calculations needed to decide if an object meets a spatial search criterion. We use a HTM index-based method to build a coarse representation of a covermap on the fly for the query region, which is then used to eliminate most of the objects that are clearly outside the region. False positives that pass the coarse test are removed with more precise, albeit more time consuming calculations.

A challenging problem—cross matching—is to find data on the same object in separate archives. Simple boxing of rectilinear constraints are inadequate because they are singular at the poles, unstable near them, and the actual shape of areas of interest do not always fit neatly within a box. Furthermore, because of constraints imposed by instruments, engineering, and so on, scientists may need to define their own irregularly shaped query regions.

With the recent advances in the worldwide Virtual Observatory effort, we now have a standard Extensible Markup Language (XML) data model for space-time data. This data model also provides a new standard way to express spherical polygons as search criteria. Two outcomes of this project are (1) a layer that enables our search engine to run inside relational standard language query (SQL) databases that are either Open Source or commercial (such as SQLServer); and (2) to participate as a first-class access method in relational database queries. The toolkit is implemented in a highly portable framework in C# programming language, which allows seamless integration with relational database engines and Web services, and in particular, makes it possible to develop a full Web-service implementation of the library that can be accessed through remote calls.

Using Databases to Store the Space-Time Histories of Turbulent Flows

Randal Burns, ShiYi Chen, Laurent Chevillard , Charles Meneveau, Eric Perlman, Alex Szalay, Ethan Vishniac & Zuoli Xiao, Johns Hopkins University

We describe a new environment for large-scale turbulence simulations that uses a cluster of database nodes to store the complete space-time history of fluid velocities. This allows for rapid access to high resolution data that were traditionally too large to store and too computationally expensive to produce on demand. The systems perform the actual experimental analysis inside database nodes, which allows for data-intensive computations to be performed across a large number of nodes with relatively little network traffic. Currently, we have a limited-scale prototype system running actual turbulence simulations and are in the process of establishing a production cluster with high-resolution data. We will discuss our design choices, computing environment, and initial results with load balancing a data-intensive, migratory workload.

Data Visualization: STI Auditorium

Chair: Ed Lazowska

Tools for Distributed Observatory Management

Mike Godin, Monterey Bay Aquarium Research Institute

A collection of browser-based tools for collaboratively managing an ocean observatory have been developed and used in the multi-institutional, interdisciplinary Adaptive Sampling and Prediction (ASAP) field experiment, which occurred in a 100×100 km region around Monterey Bay in the summer of 2006. The ASAP goal was to optimize data collection and analysis by adapting a 20×40 km array of up to twelve underwater robots sampling to 500 meter depths. In near real-time, researchers assimilated robotic observations into three independent, simultaneously running four-dimensional ocean models, predicting ocean conditions for the robots over the next few days.

ASAP required the continuous participation of numerous researchers located throughout North America. Powerful exercises that guided development of the required collaboration tools were “virtual experiments,” wherein simulated robots sampled a simulated ocean, generating realistic data files that experimenters could visualize and modelers could assimilate. Over the course of these exercises, the Collaborative Ocean Observatory Portal (COOP) evolved, with tools for centralizing, cataloging, and converting observations and predictions into common formats, generating automated comparison plots, and querying the data set created and organized scientific content for the portal. Centralizing data in common formats allowed researchers to manipulate data without relying on data generators’ expertise to read the data, and to query data with the Metadata Oriented Query Assistant (MOQuA). Collaborators could produce specialized products and link to these through the collaborative portal, making the experimental process more interdisciplinary and interactive.

The need for collaboration and data handling tools is important for future observatories, which will require 24-hour per day, 7-day a week interactions over many years. As demonstrated in the successful field experiment, these tools allowed scientists to manage an observatory coherently, collaboratively, and remotely. Lessons learned from operating these tools before, during, and after the field experiment provide an important foundation for future collaborative ventures.

Scalable Techniques for Scientific Visualization

Claudio T. Silva, University of Utah

Computers are now extensively used throughout science, engineering, and medicine. Advances in computational geometric modeling, imaging, and simulation allow researchers to build models of increasingly complex phenomena and thus to generate unprecedented amounts of data. These advances require a substantial improvement in our ability to visualize large amounts of data and information arising from multiple sources. Effectively understanding and making use of vast amounts of information being produced is one of the greatest scientific challenges of the 21st Century. Our research at the Scientific Computing and Imaging (SCI) Institute at the University of Utah has focused on innovative, scalable techniques for large-scale 3D visualization. In this talk, I will review the state-of-the-art in visualization techniques in high performance visualization technology, including out-of-core, streaming, and GPU-based techniques that are used to drive a range of displays devices, including large-scale display walls. I will conclude with an outline for how large-scale visualization fits into an eScience research agenda.

Oceanographic Workbench

Keith Grochow, University of Washington

We are designing an oceanographic workbench that contains a suite of features for scientists at UW and MBARI involved with ocean observatories (the NEPTUNE and MARS projects respectively). At the core is a fast, multi-resolution terrain engine that can incorporate a broad range of bathymetric data sets. Over this we can overlay multiple images, textures, color gradients, and measurement grids to help scientists visualize the observatory site environment. For site management, there is an intuitive drag and drop interface to add, position, and determine interactions of instruments as well as cabling requirements over time. Site metrics such as cost, power needs, and bandwidth are automatically updated on the screen for the user during these editing sessions. As well as site management, this system provides a 3D data visualization environment based on the pivot table model. We allow the user to interactively move between different views of the selected data sets to analyze and visualize information about the site. This system runs on both Windows and Macintosh environments, leveraging the advanced graphics capabilities of current hardware and provides extensions to work with external analysis engines such as Matlab. Initial feedback on this tool has been very positive and we expect to move to broader user trials this fall. We would be happy to give a presentation and demo of the system at the conference.

High-Performance Computing and Visual Interaction with Large Protein Datasets

Amitabh Varshney, University of Maryland

Proteins comprise a vast family of biological macromolecules whose structure and function make them vital to all cellular processes. Understanding the relationship between protein structure and function and the ability to predict a protein’s role given its sequence or structure is the central problem in proteomics and the greatest challenge for structural biologists in the postgenomic era. The computation and visualization of various protein properties is vital to this effort. We are addressing this challenge using a two-pronged strategy: (a) The emergence of multi-core CPUs and GPUs heralds the beginning of a new era in high-performance parallel computing. Multi-core CPUs and multi-core GPUs provide us with a set of complementary computational models—a traditional von-Neumann model and a newer streaming computational model. We have characterized the kinds of applications that are well suited to each and have systematically explored the mapping of computation in one specific domain—proteomics, to each. Our work has focused on mapping of various protein properties, such as solvent-accessible surfaces and electrostatics, to the heterogeneous MIMD/SPMD computation pathways on a CPU-GPU commodity cluster environment; (b) We are working on tightly coupling the computation and visualization of large-scale proteins to allow user-assisted computational steering on large-area, high-resolution tiled displays. Because visual comprehension is greatly aided by interactive visualization abstraction, and lighting, we are also exploring techniques to enhance comprehensibility of large-scale datasets, including protein datasets. We are currently targeting protein ion channels in our research. Ion channels are a special class of proteins that are embedded in the lipid bilayer of cell membranes and are responsible for a wide variety of functions in humans. Improper functioning of ion-channels is believed to be the cause behind several ailments including Alzheimer’s disease, stroke, and cystic fibrosis.

Bio-Data: Schafler Auditorium

Chair: Mark Wilkinson

Semantic Empowerment of Life Science Applications

Amit Sheth, University of Georgia

Life Science research today deals with highly heterogeneous as well as massive amounts of data. We can realize more exciting potential of exploiting this data if we have more automated ways for integration and analysis leading to insight and discovery—to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.

This talk will demonstrate some of the efforts in:

  • building large life science ontologies (GlycO, an ontology for structure and function for Glycopeptides and ProPreO—an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications
  • entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (such as mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data
  • semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptiveResults presented here are from NSF-funded Semantic Discovery project, and NIH-funded NCRR on Integrated Technology Resource for Biomedical Glycomics, in collaboration with CCRC, UGA. Primary contributors include William S. York, Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas and Cory Henson.

Knowledge For the Masses, From the Masses

Mark Wilkinson, University of British Columbia, Canada

Knowledge Acquisition (KA) has historically been an expensive undertaking, particularly when applied to specific expert domains.Traditional KA methodology generally consists of a trained knowledge engineer working directly with one or more domain experts to encode their specific individual understanding of the domain into a formal logical framework. Since domain experts are expensive, generally have little time for such an exercise and represent only one viewpoint, we suggest that a more representative knowledge model (ontology) can be constructed more cheaply using a mass-collaborative methodology.

A prototype methodology, the iCAPTURer, was deployed at a cardiovascular and pulmonary disease conference in 2005. The iCAPTURer, and its second generation follow-up, revealed that template-based “chatterbot”-like interfaces could rapidly accumulate and validate knowledge from a large volunteer expert community. A question remained, however, as to the utility and/or quality of the resulting ontology.

Examination of existing standards for ontology evaluation revealed a lack of objective, philosophically grounded, and automatable approaches. As such, it was necessary to design a metric appropriate for evaluating the mass-collaborative ontologies we were creating. In this presentation we will discuss the iCAPTURer mass-collaboration methodology and possible extensions to it. We will then discuss the various categories of ontology evaluation metrics, including a novel epistemologically-grounded method developed in our laboratory, and examine the strengths and weaknesses of each. Finally we will show the results of our evaluation methodology as applied to the Gene Ontology one of the most widely-used ontologies in bioinformatics.

A Data Management Framework for Bioinformatics Applications

Dan Sullivan, Virginia Bioinformatics Institute

The Cyber-infrastructure (CI) group at the Virginia Bioinformatics Institute has established functional CI systems in the areas of bioinformatics and computational biology, with a focus on infectious diseases. Specifically, the CI projects include the Pathogen Portal project, the PathoSystems Resource Integration Center and the Proteomics Data Center. The bioinformatics resources developed by the CI group include tools for the curation of the genomes and PathoSystems, database systems for organizing the high-throughput data generated from the study of PathoSystems biology and software systems for analysis and visualization of the data. Integration across multiple domains is essential to enhance the functionality of CI systems. To this end, the group has formulated an integration framework based on four dimensions: data flows, schema structures, database models, and levels of system biology. This presentation focuses on the data flow dimension and describes mechanisms for coordinating the use of multiple sources of data; including database federation, Web services, and client level integration. It will also include a discussion of data provenance. Examples and use cases are drawn from projects underway at the Virginia Bioinformatics Institute.

Databases in eScience: Mudd Hall

Chair: Stuart Ozer

Building a Secure, Comprehensive Clinical Data Warehouse at the Veterans Health Administration

Jack Bates, Veterans Health Administration; Stuart Ozer, Microsoft Research

The U.S. Veterans Health Administration maintains one of the most advanced electronic health records systems in the world (VISTA), spanning a patient base of over 5 million active patients across a network of over 1,200 clinics and hospitals. This year an enterprise-wide Data Warehouse was launched at the VHA, with a charter to extract historical and daily data from VISTA and other sources and assemble it into a comprehensive database covering all aspects of patient care. Already populated with more than a billion historical vital signs, pharmacy and outpatient encounter clinical information is now being loaded. New subject areas are being integrated continually and it will eventually contain terabytes of data spanning diverse areas such as inpatient and outpatient care, administrative and financial data.

The Data Warehouse is designed to support clinical research, generate national and regional metrics and improve the quality of care throughout the VHA. In our talk we discuss the database design—our use of multiple star schemas, partitioned fact tables and conformed dimensions—as well as the common principles we use to extract data from the VISTA system. We will describe the state-of-the-art hardware environment hosting both the large database and the extraction tools. Inevitably, data quality issues are discovered when actual historical data are extracted into the Warehouse for the first time and we will present examples of how these problems have been resolved. We also review the research opportunities and some of the early results enabled by this environment and explain how the database design process has been able to accommodate the needs of both researchers and management. There are challenges inherent in bringing sensitive information together into a system that is accessible for research queries, but which also must protect patient confidentiality and adhere to HPPA requirements; we describe how the VHA Data Warehouse has pursued this balance.

Analysis of Protein Folding Dynamics

David Beck & Catherine Kehl, University of Washington

The Protein Data Bank (PDB) is an important repository of experimentally derived, static protein structures that have stimulated many important scientific discoveries. While the utility of static physical representations of proteins is not in doubt, as these molecules are fluid in vivo, there is a larger universe of knowledge to be tapped regarding the dynamics of proteins. Thus, we are constructing a complementary database comprised of molecular dynamics (MD) simulation structures for representatives of all known protein topologies or folds. We are calling this effort Dynameomics. For each fold a representative protein is simulated in its native (i.e., biologically relevant) state and along its complete unfolding pathway. There are approximately 1130 known non-redundant folds, of which we have simulated the first 250 that represent about 75% of all known proteins. We are data-mining the resulting 15 terabytes of data (not including solvent) for patterns and general features of protein dynamics and folding across all folds in addition to identifying important phenomena related to individual proteins. The data are stored in Microsoft SQL Server’s OLAP (On-Line Analytical Processing) implementation, Analysis Services. OLAP’s design is appropriate for modeling MD simulations’ inherently highly multi-dimensional data in ways that traditional relational tables are not. In particular, OLAP databases are optimized for analyses rather than transactions. The multi-dimensional expressions (MDX) query language seems to be well suited for writing complex analytical queries. This application of Microsoft’s OLAP technology is a novel use of traditional financial data management tools in the science sector.

Advanced Software Framework for Comparative Analysis of RNA Sequences, Structures and Phylogeny

Kishore Doshi, The University of Texas at Austin; Stuart Ozer, Microsoft Research

A basic principle in Molecular Biology is that the three-dimensional structure of macromolecules such as Proteins and RNA’s dictates their function. Thus, the ability to predict the structure of an RNA or protein from its sequence represents one of the grand challenges in Molecular Biology today. Comparing RNA sequences from diverse organisms spanning the tree of life has resulted in the extremely accurate determination of some RNA structures. For example, the Ribosomal RNA (rRNA) structures were predicted using fewer than 10,000 sequences. This analysis, while very successful, can be significantly enriched by expanding the analysis to include the 500,000+ Ribosomal RNA sequences which have been identified in Genbank as of August 2006, as well as new sequences which are continually appearing. A significant impediment to analyzing large RNA sequence datasets such as the rRNA is the lack of software tools capable of efficiently manipulating large datasets.

We are developing a comprehensive information technology infrastructure for the comparative analysis of RNA sequences and structures. One of the biggest challenges in developing software for comparative analysis is how to handle the memory-intensive nature of alignment construction and analysis. In-memory footprints for large RNA sequence alignments can eclipse 50GB in some cases. Our solution is based on a simple concept: co-locate the computational analysis with the data. Using Microsoft SQL Server 2005, T-SQL and C#-based stored procedures, we have successfully prototyped the integration of RNA sequence alignment storage with the most common RNA comparative analysis algorithms in a relational database system. We intend to scale-up this prototype into fully-featured public repository and eventually deliver web services for the comparative analysis of RNA sequences. In this talk, we will present a short background on RNA comparative analysis, and then focus on our framework architecture, ending with a brief demonstration of our functional prototype.

Data Organization: Remsen One

Chair: TBA

Indexing and Visualizing Large Multidimensional Databases

Istvan Csabai, Eötvös Loránd University, Hungary

Scientific endeavors such as large astronomical surveys generate databases on the terabyte scale. These, usually multidimensional databases must be visualized and mined in order to find interesting objects or to extract meaningful and qualitatively new relationships. Many statistical algorithms required for these tasks run reasonably fast when operating on small sets of in-memory data, but take noticeable performance hits when operating on large databases that do not fit into memory. We utilize new software technologies to develop and evaluate fast multi-dimensional indexing schemes that inherently follow the underlying, highly non-uniform distribution of the data: one of them is hierarchical binary space partitioning; the other is sampled flat Voronoi partitioning of the data.

Our working database is the 5-dimensional magnitude space of the Sloan Digital Sky Survey with more than 250 million data points. We use this to show that these techniques can dramatically speed up data mining operations such as finding similar objects by example, classifying objects or comparing extensive simulation sets with observations. We are also developing tools to interact with the multi-dimensional database and visualize the data at multiple resolutions in an adaptive manner.

Database Support For Unstructured Tetrahedral Meshes

Stratos Papadomanolakis, Carnegie Mellon University

Computer simulation is crucial for numerous scientific disciplines, such as fluid dynamics and earthquake modeling. Modern simulations consume large amounts of complex multidimensional data and produce an even larger output that typically describes the time evolution of a complex phenomenon. This output is then “queried” by visualization or other analysis tools. We need new data management techniques in order to scale such tools to the terabyte data volumes available through modern simulations.

We present our work on database support for unstructured tetrahedral meshes, a data organization typical for simulations. We develop efficient query execution algorithms for three important query types for simulation applications: point, range and feature queries. Point and range queries return one or more tetrahedra that contain a query point or intersect a query range respectively, while feature queries return arbitrarily shaped sets of tetrahedra (such as a mesh surface). We propose Directed Local Search (DLS), a query processing strategy based on mesh topology: we maintain connectivity information for each tetrahedron and use it to “walk” through connected mesh regions, progressively computing the query answer. DLS outperforms existing multidimensional indexing techniques that are based on geometric approximations (like minimum bounding rectangles), because the later cannot effectively capture the geometric complexity in meshes. Furthermore, DLS can be easily and efficiently implemented within modern DBMS without requiring new exotic index structures and complex pre-processing.

Building a Data Management Platform for the Scientific and Engineering Communities

José A. Blakeley, Brian Beckman, Microsoft; Tamás Budavári, The Johns Hopkins University; Gerd Heber, Cornell University

The convergence of database systems, file systems and programming language technologies is blurring the lines between records and files, directories and tables, and programs and query languages that deal with in-memory arrays as well as with persisted tables. Relational database systems have been extended to support XML, large binary objects directly as files, and are incorporating runtime systems (such as Java VM and .NET CLR) to enable scientific models, programs and libraries (such as LAPACK) to run close to the data. Scientific file formats such as HDF5 and NetCDF define their content using higher level semantic models (such as UML and Entity Relationship). Programming languages are incorporating native, declarative, set-oriented query capabilities (such as LINQ/XLINQ), which will enable support for cost-based query optimization techniques. Programming languages are also integrating transactions with exception handling to enable more reliable programming patterns. Practitioners have learned that neither file aggregates (HDF, NetCDF) nor RDBMS alone present a one-size-fits-all solution to the most common data management problems facing the scientific and engineering communities. However, the convergence of the technologies mentioned offers a unique opportunity to build a data management and data integration platform that will embrace their strengths, creating new paradigms that will revolutionize scientific programming and data modeling in the next decade. Based on our combined experience in building an industry-leading relational DBMS and use cases drawn from typical scientific and engineering applications in astronomy and computational materials science, we propose the architecture of a unified data management platform for the computational science and engineering communities.

Scientific Publications and Archiving: STI Auditorium

Chair: Winston Tabb

Next-Generations Implications of Open Access

Paul Ginsparg, Cornell University

True open access to scientific publications not only gives readers the possibility to read articles without paying subscription, but also makes the material available for automated ingestion and harvesting by 3rd parties. Once articles and associated data become universally treatable as such computable objects, openly available to 3rd party aggregators and value-added services, what new services can we expect, and how will they change the way that researchers interact with their scholarly communications infrastructure? I will discuss straightforward applications of existing ideas and services, including clustering, citation analysis, collaborative filtering, external database linkages, and other forms of automated markup, and then will speculate on as yet unrealized modes of harvesting and creating new knowledge.

Long Term Data Storage

Paul Ginsparg, Cornell University

In August 2006 NASA announced it had lost the original moon landing video transmissions, dramatizing the risks for long term storage of data. Perhaps less noticed than the conventional problem of misplacing some boxes was that “The only known equipment on which the original analogue tapes can be decoded is at a Goddard centre set to close in October, raising fears that even if they are found before they deteriorate, copying them may be impossible” (Sydney Morning Herald). Today the risk that a format will become obsolete, or that nobody will remember what the data format is, exceeds the risk that a box of stuff will be lost. The quantity of data spewing from electronic sensors, and the storage of this data in formats not intelligible by humans, make long-term preservation something to consider from the beginning. The tendency for data to be stored in projects with short-term funding rather than institutions which accept long term responsibility is not helping. Against that, we have the great advantage that any digital copy is equivalent for future use. What should be done? Technical suggestions might include: 1) University libraries and archives taking a greater role in data storage; 2) Encouraging public data standards for complex data; 3) Expanding efforts like LOCKSS and encouraging their diversification into data storage as well as journal and book preservation; 4) Agreeing on a formal description of a query language so that websites representing the “dark web” can provide a machine-interpretable and standardized explanation of what kind of queries they accept. Perhaps more important, however, are some non-technical issues, such as agreeing on a formal description for digital rights management controls and creating a conference and/or journal devoted to scientific triumphs found by analyzing old data to raise the scholarly interest and prestige of data preservation.

Digital Data Preservation and Curation: A Collaboration Among Libraries, Publishers and the Virtual Observatory

Robert Hanisch, Space Telescope Science Institute

Astronomers are producing and analyzing data at ever more prodigious rates.

NASA’s Great Observatories, ground-based national observatories, and major survey projects have archive and data distribution systems in place to manage their standard data products, and these are now interlinked through the protocols and metadata standards agreed upon in the Virtual Observatory (VO).

However, the digital data associated with peer-reviewed publications is only rarely archived. Most often, astronomers publish graphical representations of their data but not the data themselves. Other astronomers cannot readily inspect the data to either confirm the interpretation presented in a paper or extend the analysis. Highly processed data sets reside on departmental servers and the personal computers of astronomers, and may or may not be available a few years hence.

We are investigating ways to preserve and curate the digital data associated with peer-reviewed journals in astronomy. The technology and standards of the VO provide one component of the necessary technology. A variety of underlying systems can be used to physically host a data repository, and indeed this repository need not be centralized. The repository, however, must be managed and data must be documented through high quality, curated metadata. Multiple access portals must be available: the original journal, the host data center, the Virtual Observatory, or any number of topically-oriented data services utilizing VO-standard access mechanisms.

Scientific Workflow: Schafler Auditorium

Chair: Shirely Cohen

Automation of Large-scale Network-Based Scientific Workflows

Mladen A. Vouk, North Carolina State University

Comprehensive, end-to-end, data and workflow management solutions are needed to handle the increasing complexity of processes and data volumes associated with modern distributed scientific problem solving, such as ultra-scale simulations and high-throughput experiments. The key to the solution is an integrated network-based framework that is functional, dependable, fault-tolerant, and supports data and process provenance.

Such a framework needs to make application workflows dramatically easier to develop and use so that scientist’s efforts can shift away from data management and application development to scientific research and discovery An integrated view of these activities is provided by the notion of Scientific Workflows—a series of structured activities and computations that arise in scientific problem-solving. This presentation discusses long-term practical experiences of the U.S. Department of Energy Scientific Data Management Center with automation of large scientific workflows using modern workflow support frameworks. Several case studies in the domains of astrophysics, fusion and bioinformatics, that illustrate reusability, substitutability, extensibility, customizability and composability principles of scientific process automation, are discussed. Solution fault-tolerance, ease of use, data and process provenance, and framework interoperability are given special attention. Advantages and disadvantages of several existing frameworks are compared.

Using Flowcharts to Script Scientific Workflows

Furrukh Khan, The Ohio State University

We note that the flowchart is a fundamental artifact in scientific simulation code. Unfortunately even though the flowchart is initially used by scientists to model the simulation, it is not preserved as an integral part of the code. We argue that by mapping flowcharts to workflows and leveraging Microsoft Workflow Foundation (WF) the flowchart can be separated out of the implementation code as a “first class” citizen. This separation can have profound impact on the future maintainability and transparency of the code. Furthermore, WF provides the components required by scientists to build systems for dynamically visualizing, monitoring, tracing, and altering the simulations. We also note that projects for developing, running, and maintaining complex scientific simulations are often based on distributed teams. These collaborations not only involve human-to-human workflows but also scenarios where the low lying simulation flowcharts (separated out as first class citizens) take part in higher level human workflows. We argue that the current version of Microsoft SharePoint Server with integral support for WF serves as an ideal portal for these collaborations. It provides scientists services like security, role-based authentication, team membership, discussion lists and implementation of member-to-member workflows. Furthermore, by using the Microsoft technology, Windows Communication Foundation (WCF), systems can be built that securely connect the low lying simulation workflows (running as WCF Web Services) to high level human workflows so that simulations can be visualized within the context of SharePoint. We also show that Atlas, another Microsoft technology, can be used in synergy with WF to provide highly responsive platform agnostic (Windows, Linux, Mac) browser-based smart clients in the context of SharePoint. We give examples and preliminary results from the computational electromagnetics domain based on our recently started project in collaboration with the ElectroScience Laboratory at the Ohio Sate University.

Scientific Workflows: More e-Science Mileage from Cyberinfrastructure

Bertram Ludaescher, University of California, Davis

We view scientific workflows as the domain scientist’s way to harness cyberinfrastructure for e-Science. Through various collaborative projects over the last couple of years, we have gained first-hand experience in the challenges faced when trying to realize the vision of scientific workflows. Domain scientists are often interested in “end-to-end” frameworks which include data acquisition, transformation, analysis, visualization, and other steps. While there is no lack of technologies and standards to choose from, a simple, unified framework combining data and process-oriented modeling and design for scientific workflows has yet to emerge.

Using experiences from continuing projects as well as from a recently awarded collaboration with a leading group in ChIP-chip analysis workflows (Chromatin ImmunoPrecipitation followed by genomic DNA microarray analysis), we highlight the requirements and design challenges typical of many large-scale bioinformatics workflows: Raw and derived data products come in many forms and from different sources, including custom scripts and specialized packages, e.g. for statistical analysis or data mining. Not surprisingly, the process integration problems are not solved by “making everything a web service”, nor are the data integration problems solved by “making everything XML”. The real workflow challenges are more intricate and will not go away by the adoption of any easy, one-size-fits-all silver-bullet solution or standard. The problems are further compounded by the scientists’ need to compare results from multiple workflows runs, employing various alternative (and often brand-new) analysis methods, algorithms, and parameter settings.

We describe ongoing work to combine various concepts and techniques (models of computation and provenance, actor- and flow-oriented programming, higher-order components, adapters, and hybrid types) into a coherent overall framework for collection-oriented scientific workflow modeling and design. The initial focus of our work is not on optimizing machine performance (e.g., CPU cycles or memory resources), but on optimizing a more precious resource in scientific data management and analysis: human (i.e., scientists’) time.


Chair: Yan Xu

Building Lab Information Management Systems

Qi Sun, Cornell University

At the bioinformatics core-facility for Cornell University, we are managing data for multiple genomics and proteomics laboratories. In the last few years, we have established a working model of using Microsoft SQL Server as the database system, ASP.NET as the user interface and Windows Compute Cluster as the data analysis platform. Here we will present two lab information management systems (LIMS), Pathogen Tracker ( and PPDB (, to represent two of the fastest growing biological research fields: genetic diversity and proteomics. The Pathogen Tracker software is collaboration with the Cornell Food Safety Laboratory. It includes a database and an ASP.NET web application written with Visual Basic 2005. It is being used as a tool for information exchange on bacterial subtypes and strains and for studies on bacterial biodiversity and strain diversity.

The system has a user management system, and enables the research community to contribute their data to this database through the web, allows open data exchange, and facilitates large scale analyses and studies on bacterial biodiversity. PPDB is a LIMS for managing mass spectrometry proteomics data; it is developed with Dr. Klaas van Wijk’s proteomics laboratory. The web interface we designed makes it easier for users to integrate and compare data from multiple sources. We also take advantage of the graphic library that comes with Visual Studio 2005 for generating on-the-fly images in the 2-D gel data navigation tool.

Computational Nanoelectronics

Hong Guo, McGill University

One of the most important branches of nano-science and nanotechnology research is the nano-scale electronics, or nanoelectronics. Nanoelectronic devices operate by principles of quantum mechanics; their properties are closely related to atomic and molecular structure of the device. It has been a great challenge to predict nano-scale device characteristics, especially if one wishes to predict them without using any phenomenological parameter. To advance nanoelectronic device technology, an urgent goal is to develop computational tools which can make quantitative, accurate, and efficient calculations of nanoelectronic systems from quantum mechanic first principles.

In this presentation, I will briefly review the present status of nanoelectronic device theory, the existing theoretical, numerical and computational difficulties, and some important problems of nanoelectronics. I will then report a particularly useful progress we have achieved toward quantitative predictions of non-equilibrium and non-linear charge/spin quantum transport in nanoelectronic devices from atomic point of view. Quantitative comparisons to measured experimental data will be presented. Several examples will be given including electric conduction in nano-wires and magnetic switching devices. Finally, I will briefly outline on the existing challenges of computational nanoelectronics, and on developing computational tools powerful enough for nanoelectronics design automation.

MotifSpace: Mining Patterns in Protein Structures

Wei Wang, University of North Carolina

One of the next great frontiers in molecular biology is to understand, and predict protein function. Proteins are simple linear chains of polymerized amino acids (residues) whose biological functions are determined by the three-dimensional shapes that they fold into. Hence, understanding proteins requires a unique combination of chemical and geometric analysis. A popular approach to understanding proteins is to break them down into structural sub-components called motifs. Motifs are recurring structural and spatial units that are frequently correlated with specific protein functions. Traditionally, the discovery of motifs has been a laborious task of scientific exploration.

In this talk, I will present an eScience project MotifSpace, which includes recent data-mining algorithms that we have developed for automatically identifying potential spatial motifs. Our methods automatically find frequently occurring substructures within graph-based representations of proteins. We represent each protein’s structure as a graph, where vertices correspond to residues. Two types of edges connect residues: sequence edges connect pairs of adjacent residues in the primary sequence, and proximity edges represent physical distances, which are indicative of intra-molecular interactions. Such interactions are believed to be key indicators of the protein’s function.

This representation allows us to apply innovative graph mining techniques to explore protein databases and associated protein families. The complexity of protein structures and corresponding graphs poses significant computational challenges. The kernel of MotifSpace is an efficient subgraph-mining algorithm that detects all (maximal) frequent subgraphs from a graph database with a user-specified minimal frequency. Our algorithm uses the pattern growth paradigm with an efficient depth-first enumeration scheme, searching through the graph space for frequent subgraphs. Our most recent algorithms incorporate several improvements that take into account specific properties of protein structures.

Data Classification: Remsen One

Chair: Tony Tyson

Physical Science, Computational Science and eScience: the Strategic Role of Interdisciplinary Computing

Tim Clark, Harvard University

Data in the physical and life sciences is being accumulated at an astonishing and ever-increasing rate. The so-called “data deluge” has already outpaced scientists’ ability to exploit the wealth of information at their disposal. In order to make progress, scientists, as they have in the past, need to ask specific, well-formulated questions of the data. Many of these questions now require an unprecedented amount and variety of computing to answer, and many of the computational challenges are shared between seemingly disparate scientific disciplines. Therefore, to achieve a leadership position in many sciences today requires a strong interdisciplinary collaboration between experts in the scientific and computational disciplines, supported by an advanced computing infrastructure and skilled personnel.

Harvard’s Initiative in Innovative Computing (IIC) was launched through the Provost’s Office in late 2005 to enable the rapid expansion of advanced interdisciplinary work in scientific computing here. It aims to establish a robust yet flexible frame for research at the creative intersection between the computing disciplines and the sciences. The IIC’s research agenda encompasses a diverse array of innovative projects designed to push the boundaries of both computing and science. These projects are proposed by, and carried out in close collaboration with, researchers throughout Harvard. To keep the IIC’s agenda current, projects have a limited duration, and new ones are periodically solicited, reviewed, and added. The IIC will continuously generate and exploit some of the most exciting, meaningful opportunities for new discoveries in contemporary science.

This talk will explore some of the strategic implications and challenges of developing a program like IIC, why it is mandatory for achieving leadership in many scientific disciplines, and share some lessons learned.

Cleaning Scientific Data Objects

Dongwon Lee, The Pennsylvania State University

Real scientific data are often dirty, either syntactically or semantically. Despite active research on the integrity constraints enforcement and data cleaning, real data in real scientific applications are still dirty. Issues like heterogeneous formats of modern data, imperfect software to extract metadata, demand for large-scale scientific processing, and the lack of useful cleaning tools or system support make the problem only harder. When the base data are dirty, one cannot avoid the so-called “garbage-in, garbage-out” phenomenon. Therefore, improving the quality of the data objects has direct impacts and implications in many scientific applications.

In this talk, in the context of Quagga project which I am leading, I will present various dirty (meta-) data problems drawn from real-world cases and their potential solutions. In particular, I’ll present my recent work on: (1) scalable group linkage technique to identify duplicate data objects fast, (2) effective scientific data cleaning by Googling, (3) value imputation on microarray data set, and (4) semantically-abnormal data detection (e.g., detecting fake conferences and journals), etc.

Quagga project:

Part of this research was supported by a Microsoft SciData award in 2005.

Some Classification Problems from Synoptic Sky Surveys

S. George Djorgovski et al, California Institute of Technology

Analysis of data from modern digital sky surveys (individual or federated within the VO) poses a number of interesting challenges.

This is especially true for the new generation of synoptic sky surveys, which repeatedly cover large areas of the sky, producing massive data streams and requiring a self-federation of moderately heterogeneous data sets. One problem is an optimal star-galaxy classification using data from multiple passes, and incorporating external, contextual, or a priori information. Some problems require a very demanding real-time analysis, e.g., an automated robust detection and classification of transient events, using relatively sparse and heterogeneous data (a few data points from the survey itself, plus information from other, multiwavelength data sets covering the same location on the sky); and a dynamical version of this process which iterates the classification as follow-up data are harvested and incorporated in the analysis.

We will illustrate these challenges with examples from the ongoing Palomar-Quest survey, but they will become increasingly critical with the advent of even more ambitious projects such as PanSTARRS and LSST. We will also discuss some other, general issues posed by the scientific exploration of such rich data sets.

Scientific Publications: STI Auditorium

Chair: Tony Tyson

Infrastructure to Support New Forms of eScience, Publishing and Digital Libraries

Carl Lagoze, Cornell University

We are in the midst of radical changes in the way that scholars produce, share, and access the results of their work and that of their colleagues. High speed networking and computing combined with newly emerging collaborative tools will enable a new scholarly communication paradigm that is more immediate, distributed, data-centric, and dynamic. These new tools are essential for science as it confronts rapidly emerging problems such as global warming and pandemics.

In our research we are investigating infrastructure to support this new paradigm. This infrastructure allows the flexible composition of information units (such as text, data sets, and images) and distributed services for the formation of new types of scholarly results and new methods of collaborating. In this talk we will describe several components of this work:

  • Fedora is open-source middleware supporting the representation, management, and dissemination of complex objects and their semantic relationships. These objects can combine distributed content, data, and web services. Fedora is the foundation for a number of international eScience initiatives including the Public Library of Science (PLOS), eSciDoc at the Max Planck Society, and the DART project in Australia.
  • An Information Network Overlay is an abstraction for building innovative digital libraries that integrate selected networked resources and services, and provide the context for reuse, annotation, and refactoring of information within them. This architecture forms the basis of the NSF-funded National Science Digital Library (NSDL).

The Repository Interoperability Framework (RIF), an outgrowth of the NSF-funded Pathways project, is developing standards to support the sharing of information units (such as data, images and content) among heterogeneous scholarly repositories. The core of this work is the articulation of a common data model to represent complex digital objects and service interfaces that allow sharing of information about these digital objects among repositories and clients.

The Scientific Paper of the Future

Timo Hannay, Nature Publishing Group

The emergence of online editions of scientific journals has produced huge benefits by making the literature searchable, interlinked and available directly from scientists’ desktops. Yet this development only scratches the surface of the potential of the internet to revolutionize scientific communication. At Nature Publishing Group (NPG) we often think of these opportunities in terms of the follows ‘5Ds’:

Data Display: Figures no longer need to be static, but can become manipulable and interactive, and can provide readers with direct access to the underlying data. Dynamic Delivery: The same information does not need to be delivered to every person each time, but can instead be tailored to a user’s specific interests and their immediate needs.

Deep Data: Journals need to become better integrated with scientific databases (and in some ways ought to become more like databases too). Discussion & Dialogue: The web is a many-to-many network that enables direct discussion between readers, as well as modes of interaction that are more immediate and informal than allowed by the traditional publishing process.

Digital Discovery: Scientific information in an online world needs to be made useful not only to readers but also to software and other websites. Only in this way will the information become optimally useful to humans.

This presentation will summarize current activities in this area, inside NPG and elsewhere, and will look at where future trends might take us.

The Connection Between Scientific Literature and Data in Astronomy

Michael J. Kurtz, Harvard-Smithsonian Center for Astrophysics

For more than a century, journal articles have been the primary vector transporting scientific knowledge into the future; also during this time scientists have created and maintained complex systems of archives, preserving the primary information for their disciplines.

Modern communications and information processing technologies are enabling a synergism between the (now fully digital) archives and journals which can have profound effects on the future of research.

During the last approximately 20 years astronomers have been simultaneously building out the new digital systems for data and for literature, and have been merging these systems into a coherent, distributed whole.

Currently the system consists of a network of journals, data centers, and indexing agencies, which interact via a massive sharing of metadata between organizations. The system has been in active use for more than a decade; Peter Boyce named it Urania in 1997.

Astronomers are now on the verge of making a major expansion of these capabilities. Besides the ongoing improvement in the capabilities and interactions of existing organizations this expansion will entail the creation of new archiving and indexing organizations, as well as a new international supervisory structure for the development of metadata standards. The nature of scientific communication is clearly being changed by these developments, and with these changes will come others, such as: How will information be accessed? How will the work of individual scientists be evaluated? How will the publishing process be funded?

Abstracts 10/15

Scientific Workflow: Schafler Auditorium

Chair: Roger Barga

Automatic Capture and Efficient Storage of eScience Experiment Provenance

Roger Barga Microsoft Research

Workflow is playing an increasingly important role in conducting e-Science experiments, but current systems lack the necessary support for the collection and management of provenance data. We argue that eScience provenance data should be automatically generated by the workflow enactment engine and managed over time by an underlying storage service.

In this presentation, we introduce a layered model for workflow execution provenance, which allows navigation from an abstract model of the experiment to instance data collected during a specific experiment run. We outline modest extensions to a commercial workflow engine so it will automatically capture this provenance data at runtime. We then present an approach to store this provenance data in a relational database engine. Finally, we identify important properties of provenance data captured by our model that can significantly reduce the amount of storage required, and demonstrate we can reduce the size of provenance data captured from an actual experiment to 0.4% of the original size, with modest performance overhead.

Taverna, a Workflow System for the Life Scientist in the Trenches

Tom Oinn, Manchester University, UK

Taverna is a workflow workbench designed for Life Scientists. It enables researchers with limited computing background and few technical resources to access and make-use of global scientific resources. Taverna can link together local and remote data and analytical resources, in both the private and public domains to run so called “in silico experiments”. Taverna provides a language and open source software tools to allow users to discover and access available web resources, construct complex analysis workflows, run these workflows on their own data, and others, and visualize the results. The principle is one of lowering the barrier of engagement by the user and the service provider, and providing a lightweight, flexible solution wherever possible. Taverna also aims to support the whole in silico experiment lifecycle, emphasizing the management and mining of provenance metadata and the sharing of workflows amongst colleagues. In addition to the workbench, Taverna provides methods for easing the incorporation of new applications and old favorite tools and promotes the role of common workflow patterns, building a body of verified protocol know-how, and service collections for specific problem sets.

GPFlow: A Pilot Workflow System for Interactive Bioinformatics

James M. Hogan, Queensland University of Technology, Australia

Modern genome bioinformatics is increasingly characterized by the use of application pipelines, with analyses realized through a chain of well-established and robust tools for tasks such as gene finding, sequence alignment, homology detection and motif discovery. Unfortunately, in many cases the pipeline is constructed through a laborious and error prone process of manual data reformatting and transfer, greatly limiting throughput and the ability of the scientist to undertake novel investigations.

The GPFlow project builds on Microsoft business workflow technology to provide a flexible and intuitive workflow environment for both routine and exploratory bioinformatics based research. The system provides a high level, interactive web based front-end to scientists, using a workflow model based on a spreadsheet metaphor.

This talk presents the results of an Australian Research Council/Microsoft funded pilot project, focused on the analysis of gene regulation in bacteria through the use of broad scale comparative studies. While a substantial number of bacterial genomes (approximately 200) have now been sequenced, studies of regulation are hampered by a paucity of reliably annotated regulatory regions outside model organisms such as Escherichia coli and Bacillus subtilis. The problem is compounded by the advent of rapid sequencing technologies and the need to integrate rapidly newly sequenced genomes into the comparative data set. Our demonstration workflows therefore support rapid determination of gene and regulatory homology across organisms and confirmation and discovery of regulatory motifs and identification of their underlying relationships. We shall present results of comparative studies among various species of Chlamydia and Bacillus.

Managing Exploratory Workflows

Juliana Freire, University of Utah

VisTrails is a new workflow management system which provides support for scientific data exploration and visualization. Whereas workflows have been traditionally used to automate repetitive tasks, for applications that are exploratory in nature, very little is repeated—change is the norm. As a scientist generates and evaluates hypotheses about data under study, a series of different, albeit related, workflows are created while a workflow is adjusted in an interactive process. VisTrails was designed to manage these rapidly-evolving workflows. A novel feature of VisTrails is a change-based mechanism which uniformly captures provenance information for data products and for the workflows used to generate these products. By capturing the history of the exploration process and explicitly maintaining the relationships among the workflows created, VisTrails not only allows results to be reproduced, but it also enables users to efficiently and effectively navigate through the space of workflows in an exploration task. In addition, this provenance information is used to simplify the creation, maintenance and re-use of workflows; to optimize their execution; and to provide scalable mechanisms for collaborative exploration of large parameter spaces in a distributed setting. As an important goal of our project is to produce tools that scientists can use, VisTrails provides intuitive, point-and-click interfaces that allow users to interact with and query the provenance information, including the ability to visually compare different workflows and their results. In this talk, we will give an overview of VisTrails through a live demo of the system. More information about VisTrails is available at

Cancer Informatics

Chair: Kristin Tolle

Organization and Infrastructure of the Cancer Biomedical Informatics Grid

Peter A. Covitz, Microsoft

The mission of the National Cancer Institute (NCI) is to relieve suffering and death due to cancer. NCI leadership has determined that the scale of its enterprise has reached a level that demands new, more highly coordinated approaches to informatics resource development and management. The Cancer Biomedical Informatics Grid (caBIG) program was launched to meet this challenge.

caBIG participants are organized into workspaces that tackle the various dimensions of the program. Two cross-cutting workspaces – one for Architecture the other for Vocabularies and Common Data Elements – govern syntactic and semantic interoperability requirements. These cross-cutting workspaces provide best practices guidance for technology developers as well as conducting reviews of system designs and data standards. Four domain workspaces build and test applications for Clinical Trials, Integrative Cancer Research, Imaging, and Tissue Banks and Pathology Tools, representing the highest priority areas defined by the caBIG program members themselves. Strategic-level workspaces govern caBIG requirements for Training, Data Sharing and Intellectual Capital, and overall strategic planning.

In its first year caBIG defined high-level interoperability and compatibility requirements for information models, common data elements, vocabularies, and programming interfaces. These categories were grouped into different degrees of stringency, labeled as caBIG Bronze, Silver and Gold levels of compatibility. The Silver level is quite stringent, and demands that systems adopt and implement standards for model-driven and service-oriented architecture, metadata registration, controlled terminology, and application programming interfaces. The Gold level architecture consists of a formal data and analysis grid dubbed “caGrid” that future caBIG systems will register with and plug into. caGrid is based upon the Globus Toolkit ( and a number of additional technologies such as caCORE from the NCI (http:// and Mobius from Ohio State University ( More information is available at

CancerGrid: Model- and Metadata-Driven Clinical Trials Informatics

Steve Harris & Jim Davies, Oxford University, UK

The CancerGrid project ( is developing a software architecture for ‘tissue-plus-data’ clinical trials. The project is using a model- and metadata-driven approach that makes the semantics of clinical trials explicit, facilitating dataset discovery and reuse.

The architecture is based on open standards for the composition of appropriate services, such as: randomization, minimization, clinician identity, serious adverse events relay, remote data capture, drug allocation and warehousing, and form validation.

CancerGrid is funded by the UK MRC, with additional support from the EPSRC and Microsoft Research. It brings together expertise in software engineering and cancer clinical trials from five universities: Cambridge, Oxford, Birmingham, London (UCL), and Belfast.

The project has developed a CONSORT-compliant model for clinical trials, parameterized by clinical data elements hosted in metadata repositories. A model instance can be used to generate and configure services to run the trial it describes.

This talk will describe the model, the services, and the technology employed, from XML databases to Office add-ins. It will demonstrate the aspects of the technology that have been completed, and outline plans for future releases.

caBIG Smart Client Joining the Fight Against Cancer

Tom Macura

The cancer Biomedical Informatics Grid (caBIG), dubbed the WWW of Cancer Research, is a National Cancer Institute informatics project that is likely to become the authoritative source of knowledge related to cancer. caBIG is built on a host of open-source Java technologies.

caBIG dotNET is an open-sourced web-service and client API we have developed to expose high-level caBIG Java APIs to the .NET developer community. We have used caBIG dotNET to build two GUI Smart Clients: xl-caBIG Smart Client and Smart Client MOBILE.

The xl-caBIG Smart Client is a set of extensions to Microsoft Excel 2003 that gives scientists a graphical interface for accessing caBIG data-services. It provides users intuitive access to caBIG by leveraging their intimacy with the Windows environment and Excel’s statistical tools.

The xl-caBIG Smart Client MOBILE is an alternative interface to the xl-caBIG Smart Client that is better suited for the unique input interface and limited screen area of mobile computing devices. This translates into caBIG access on a wide range of computing devices including PDAs and mobile phones.

C-ME: a Smart Client to Enable Collaboration Using a Two- or Three-Dimensional Contextual Basis for Annotations

Anand Kolatkar, The Scripps Research Institute

Collaborative Modeling Environment (C-ME) is a smart client that allows researchers to organize, visualize and share information with other researchers utilizing Microsoft Office System Server 2007 (MOSS) as a data store, Vista’s Windows Presentation Foundation for the graphics display and visual studio 2005/C# as a development platform.

C-ME addresses two important aspects of collaboration: context and information management. C-ME allows a researcher to use a 3D atomic structure model or a 2D image (e.g. an image of a slide containing cancer cells) as a contextual basis on which to attach annotations to specific atoms or groups of atoms of a 3D atomic structure or to user-definable areas on a 2D image. These annotations (Office documents, URLs, Notes, and Screen Captures) provide additional information about the atomic structure or cellular imagery, are stored on MOSS and are accessible to other scientists using C-ME provided they have appropriate permission in the Active Directory. Storing and managing the annotations on MOSS allows us to maintain a single copy of the information accessible to a collaborating group of researchers. Contributions to this single copy of information via additional annotations are again immediately available to the entire community.

Data organization is hierarchical with projects being at the top and containing one or more entities, which can be annotated as described above. We are currently enhancing the existing 3D and 2D annotation capabilities to better match researchers’ needs, increasing drag-and-drop/one-click functionality for efficiency and standardization of the GUI under Vista, and further leveraging the search and indexing capabilities of MOSS. We are also looking for outside users to install and evaluate C-ME.

The C-ME development team includes members of InterKnowlogy, Microsoft and the Kuhn-Stevens laboratories at The Scripps Research Institute. Support for C-ME is in part provided by the NIH NIGMS Protein Structure Initiative under grant U54-GM074961 and through NIH-NIAID Contract HHSN266200400058C.

Data Fusion: Remsen One

Chair: Jignesh Patel

Some Challenges in Integrating Information on Protein Interactions and a Partial Solution

H.V. Jagadish, University of Michigan

Independently constructed sources of (scientific) data frequently have overlapping, and sometimes contradictory, information content. Current methods of use fall into two categories: force the integration step onto the user, or merely collate the data, at most transforming it into a common format. The first method places an undue burden on the user to fit all of the jigsaw puzzle pieces together. The second leads to redundancy and possible inconsistency.

We propose a third: deep data integration. The idea is to provide a cohesive view of all information currently available for a protein, interaction, or other objects of scientific interest. Doing so requires that multiple pieces of data about the object, in different sources, first be identified as referring to the same object, if required through “third party” information; then that a single “record” be created comprising the union of the information in multiple matched records, keeping track of differences where these occur; and finally by tracking the provenance of every value in the dataset so scientists can judge what items to use and how to resolve differences.

The results of this process, as applied to protein interactions and pathways, is found in the Michigan Molecular Interactions Database (MiMI). MiMI deeply integrates interaction data from: HPRD, DIP, BIND, GRID, IntAct, the Center for Cancer Systems Biology dataset and the Max Delbruck Center dataset. Additionally, auxiliary data is used from: GO, OrganelleDB, OrthoMCL, PFam, ProtoNet, miBlast, InterPro and IPI. MiMI is publicly available at

In this talk, I will discuss the desiderata for a protein interaction integrated information resource obtained from our user community, and outline the architecture of the system we have developed to address these needs.

Theory in the Virtual Observatory

Gerard Lemson, Max-Planck Instutut fuer extraterrestrische Physik, Germany

I will discuss and demo efforts to introduce theory into the Virtual Observatory (VO). With the VO the astronomical community aims to create an e-science infrastructure to facilitate online access to astronomical data sets and applications.

The efforts of individual national VOs are organized in the International VO Alliance (IVOA), which seeks to define standard protocols to homogenize data access and enable interoperability of distributed services. VO efforts have generally concentrated on observational data, but recently interest has grown to include the results of large-scale computer simulations. The goal is the dissemination of simulation data per se, in particular finding ways of using such data for the planning, prediction and interpretation of observations.

Simulated data sets are in general very different from observational ones and need special treatment in the VO. As an example of this I present a relational data structure for efficiently storing tree structures representing the formation history of objects in the universe. An implementation of this is used in a web service exposing a SQLServer database storing results of the largest cosmological simulation to date.

To facilitate the use of specialized simulation data by observers, the community has invented the idea of the “virtual telescope”. These are services that mimic real telescope observations and produce results that can be directly compared to observations. I will show examples producing optical galaxy catalogues and X-Ray observations of galaxy clusters.

The implementation of virtual telescopes requires specialized expertise not generally available at a single location. Furthermore, to produce mock observations in sufficient detail for scientific purposes requires a high performance computational infrastructure.

The VO offers the appropriate framework for resolving both these issues and I will conclude with thoughts on the steps that are required to make this a reality.

Practical Experience with a Data Centric Model for Large-Scale, Heterogeneous Data Integration Challenges

Michael Gillam, Azyxxi

Heterogeneous data integration in the scientific and clinical domains is an often complex and costly process. The practical challenges have become more apparent with the ongoing technologic struggles and public failures of high-profile government and corporate data integration efforts. We describe a data-centric model of heterogeneous data integration which combines data-atomicity with metadata descriptors to create an architectural infrastructure which is highly flexible, scaleable and adaptable.

The practical application of this approach has created the most diverse real-time clinical data repository available. The system is in live use across eight hospitals with over 80,000 clinical queries per hospital per day. The database is over 60 terabytes in size with 500 terabytes of installed capacity. Over 11,000 heterogeneous data elements are stored effectively integrating textual, image and video streaming data seamlessly into a single architecture. Over 1,500 live data streams feed data into the system and are maintained using only 50% of the time of one full time employee. Data retrieval times for common clinical queries are aimed to deliver 1/8th of a second response times. All data within the system, currently spanning 10 years of live clinical use, are retrievable real-time through the system. No data are offloaded or archived inaccessibly. The system has had 99.997% uptime for the last 10 years.

The success of the data-centric approach has dramatic implications for many large-scale, heterogeneous data integration projects where data types are numerous and diverse, highly scaleable infrastructures are required, and data specifications are imprecise or evolving.

Managing Satellite Image Time Series and Sensor Data Streams for Agro-Environmental Monitoring

Claudia Bauzer Medeiros, UNICAMP, Brazil

The WebMaps project is a multidisciplinary effort involving computer scientists and experts on agricultural and environmental sciences, whose goal is to develop a platform based on Web services for agro-environmental planning in Brazil.

One of the main challenges concerns providing users with the means to summarize and analyze long time-series of satellite images. This analysis must correlate these series, for arbitrary temporal intervals and regions, with data streams from distinct kinds of weather sensor networks (for rainfall, humidity, temperature, etc). Additional sources include data to allow the characterization of a region (e.g., relief or soil properties), crop physiological characteristics and human occupation factors.

Data quality and provenance are important factors, directly influencing analysis results. Besides massive data volumes, several other factors complicate data management and analysis.

Heterogeneity is a big barrier – many kinds of data need to be considered and weather sensors are varied and often faulty. Moreover, data correlations must consider a variety of time warp factors. For instance, the effect of rainfall in a region, combined with temperature and soil parameters, takes months to be reflected in vegetation growth detected by satellite imagery.

Other factors – e.g., effects of human activity – cannot be directly measured, being thus derived from primary sources such as images. Furthermore, there is a wide range of user profiles, with many kinds of analysis and summarization needs.

The first prototype is available, concentrating on spatio-temporal analysis of image series, creating graphs that show vegetation evolution in arbitrary areas. Present data management efforts include analysis of time series co-evolution, visualization, annotation, and provenance management.

The presentation will concentrate on handling heterogeneity and time series co-evolution, and implementation difficulties. The same problems are cropping up in other projects – in biodiversity and health care.

Web Services in eScience: STI Auditorium

Chair: Marty Humphrey

From Terabytes to Petabytes: Towards Truly Astronomical Data Sizes

Ani Thakar, The Johns Hopkins University

The Sloan Digital Sky Survey (SDSS) has been serving a multi-Terabyte catalog dataset to the astronomical community for a few years now. By the beginning of the next decade, the Large Synoptic Survey Telescope (LSST) will be acquiring data at the rate of one SDSS every 3-4 nights and serving a Petabyte-scale dataset to the community by 2015 or so.

I will discuss the lessons learned from SDSS and how they will guide the LSST data management design. In particular, I will highlight the developments at the Johns Hopkins University (JHU) in online database access, asynchronous query execution, data partitioning, and Web services as the pillars upon which petascale data access will be built. I will treat these topics in the larger context of the international Virtual Observatory (VO) effort that seeks to bring these technologies together so that all astronomical data is federated and accessible in an efficient manner.

JHU is a major participant in the three projects that I will discuss – SDSS, LSST and the VO – and is also building a 100-Terabyte data analysis facility to analyze data, one Terabyte at a time, from large-scale turbulence simulations.

A Web Service Approach to Knowledge Discovery in Astronomy

Andrew Connolly, University of Pittsburgh

Large scale astronomical surveys are providing a panchromatic view of the local and distant universe covering wavelengths from X-rays to radio. With the development of the National Virtual Observatory (NVO) that federates these disparate data sets, astronomers can now query the sky across many decades of the electromagnetic spectrum.

The challenges now faced by astronomy are not just how do we organize and efficiently distribute these data but how do we make use of these resources to enable a new era of science discovery. I will talk here about steps to integrate efficient machine learning techniques into the NVO to facilitate the analysis and visualization of large data sets. I will focus on a webservice-based framework that enables users to upload raw imaging data and return astrometrically and photometrically calibrated images and source catalogs, together with cross-matches of these sources with the full spectrum of catalogs available through the NVO. I will show how we integrate data mining tools into this webservice framework to automatically identify and classify unusual sources either from the resulting catalogs (e.g. using mixture models for density estimation) or directly from the images (e.g. by subtracting images observed at an earlier epoch).

As I will show, these tools are accessible by professional and amateur astronomers and have already been used to detect supernova within images and to identify very high redshift galaxies.

The Astronomical Cross Database Challenge

Maria A. Nieto-Santisteban, The Johns Hopkins University

Astronomy, like many other eSciences, has strong need for efficient database cross-reference procedures. Finding neighboring sources within either the same catalog, or across different catalogs is one of the most requested capabilities. Although there are many astronomical tools capable of finding sources near other sources, they cannot handle the volume of objects that current and future astronomical surveys like the Sloan Digital Sky Survey, or the Large Synoptic Survey Telescope are generating. Since we speak of terabytes of data and billions of records, using traditional file systems to store, access and search the data is no longer an option.

Astronomy is finally moving into the Database Management Systems (DBMS). Even though DBMS are suited for efficient data manipulation and fast access, in order to mange such a big volume, special parallelization, searching and indexing algorithms are needed. We have developed a zoning algorithm that not only speeds up all-to-all neighboring searches using only relational algebra but also partitions and distributes the workload across computers efficiently. Using this technique we can bring the two catalog 1 billion objects cross-identification problem down to the hour.

The challenge remains though in the many-to-many cross-matching process, where ‘many’ is billions of records per catalog and the number of catalogs is in the tens. In this talk we will present our experience working with very big astronomical catalogs and describe a framework that would allow for large scale data access and cross-match.

Proteus RTI: A Simple Framework for On-The-Fly Integration of Biomedical Web Services

Shahram Ghandeharizadeh, USC

On-the-fly integration refers to scenarios where a scientist wants to integrate a Web Service immediately after discovering it. The challenge is to significantly reduce the required information technology skills to empower the scientist to focus on the domain-specific problem. Proteus RTI is a first step towards addressing this challenge. It includes a simple interface to enable a scientist to register Web Services, compose them into plans, execute a plan to obtain results, and share plans with other members of their community.

This presentation provides an overview of Proteus RTI and its components. We present several animations showcasing Proteus RTI with a variety of scientific Web Services. In one example, we compose a plan that invokes different operations of NCBI Web Service to retrieve information pertaining to a keyword such as Asthma from all NCBI databases. In a more complex example, we compose KEGG’s FIND operation with eSearch of NCBI to retrieve all matching molecules with their definition and corresponding source-specific ids.

Animations showing examples discussed in this short abstract (along with others) are available from Proteus RTI is available for download from this URL.