Microsoft Research hosted a two-day e-science workshop on Thursday, October 6, 2005 and Friday, October 7, 2005 in Redmond, Washington. This workshop was a follow-on workshop to the successful SciData 2004 Workshop.
The eScience Workshop provided a unique opportunity to learn and affect what is happening in the realm of data intensive scientific computing within Microsoft. Attendees learned first hand from early adopters using Microsoft Windows, Microsoft .NET, Microsoft SQL Server, and Web services in these problem spaces, as well as explored in-depth how modern database technologies and techniques are being applied to scientific computing. By providing a forum for scientists and researchers to share their experiences and expertise with the wider academic and research communities, this workshop fostered collaboration, facilitated the sharing of software components and techniques, and established a vision for Microsoft Windows and .NET in data intensive scientific computing.
Dan Fay and Jim Gray, Microsoft Research
Predicting Tornados with Data Driven Workflows: Building a Service Oriented Grid Architecture for Mesoscale Meteorology Research
Dennis Gannon, Indiana University
Each year the loss of life and property due to mesoscale storms such as tornados and hurricanes is substantial. The current state of the art in predicting severe storms like tornados is based on static, fixed resolution forecast simulations and storm tracking with advanced radar. It is not good enough to accurately predict these storms with the accuracy needed to save lives. What is needed is the ability to do on-the-fly data mining of instrument data and to use the acquired information to launch adaptive workflows that can dynamically marshal resources to run ensembles simulations on-demand. These workflows need to be able to monitor the simulations and, where possible, retarget radars to gather more data to initialize higher resolution models that can focus the predictions. This scenario is not possible now, but it is the goal of the NSF LEAD project. To address these problems we have built a service oriented architecture that allows us to dynamically schedule remote data analysis and computational experiments. The Grid of resources used include machines at Indiana, Alabama, NCSA, Oklahoma, UNC and Ucar/Unidata, and soon Teragrid. The users gateway to the system is a Web portal and a set of desktop client tools. Five primary persistent services are used to manage the workflows: a metadata repository called MyLEAD that keeps track of each users work, a WS-Eventing based pub-sub notification system, a BPEL based workflow engine, a Web service registry for soft-state management of services and the portal server. An application factory service is used by the portal to create transient instances of the data mining and simulation applications that are orchestrated with the BPEL workflows. As the workflows execute they publish status metadata via the notification system to the users MyLEAD space. The talk will present several open research challenges that are common to many e-Science efforts.
Exposing the National Water Information System to GIS Through Web Services
Jonathan Goodall, Duke University
The National Water Information System (NWIS) is a hydrology data repository with stream flow, water quality, and groundwater observations maintained by the United States Geological Survey (USGS). The database includes 1.5 million monitoring stations in the United States and Puerto Rico, some with nearly 100 years of data. A Web service was developed using Visual Basic .Net to better expose this national-scale data resource to client applications within the hydrologic community. One such client application is an extension to the ArcMap that was developed by the author for plotting time series and performing basic water balance analysis within a mapping environment. The plotting extension was originally created to read from local databases, requiring the user to manually download time series and format them into a certain database structure. Now that the software has been extended to consume the NWIS Web service, it is possible to create “on-the-fly” plots of hydrologic observations for any station within nation.
Grid Computing Using .NET
Marty Humphrey, University of Virginia
The broad goal of our WSRF.NET project at the University of Virginia is to facilitate Grid computing on the .NET platform. In this talk, we give an update on our progress in exploiting and extending the .NET/Windows platform for Grid Computing – including our recent support for GridFTP on .NET, an OGSA-based Authorization Service based on Windows, and our alternative software stack for OGSA-based grids (based on WS-Transfer, WS-Eventing, etc.). The talk culminates with a live *demo* of how we have integrated this support for Grids on .NET/Windows with the Globus toolkit to form the basis for the UVa Campus Grid (UVaCG).
Webcast: Grid Computing Using .NET
Web Services & Computations
Web Services for HPC - Making Seamless Computing a Reality
David Lifka, Cornell Theory Center
Seamless HPC has been a goal of Computer and Computational Scientists for over a decade. Allowing researchers to focus on their research and not the quirks of complex HPC environments has been a dream waiting for a solution. Today the solution exists but many still don’t know how to apply the tools (Web Services and SQL databases) to the problems. This talk will discuss several applications of Web Services and SQL databases to real world HPC applications including solutions for embarrassingly parallel tasks, wide area distributed computing applications, and eScience/Data Intensive Computing. Real examples of applied solutions for each of these HPC problems will be presented. Examples of programming techniques to support the use of Web services installed on hundreds of distributed computing and data resources will be also be presented.
Computational Data Grid for Scientific and Biomedical Applications
Marc Garbey and Victoria Hilford, University of Houston
The goal of this project is to develop a Microsoft Windows-based Computer Grid infrastructure that will support high performance scientific computing and integration of multi source biometric applications. The University of Houston Microsoft Windows-based Computer Grid (WING) includes not only the Computer Science and the Technology Department networks, but also includes nodes in China, Germany, and several other countries. The total amount of available storage exceeds 4 Terabytes.
Four specific biomedical applications developed at University of Houston are the basis of this project:
- Computational tracking of Human Learning using Functional Brain Imaging
- Monitoring Human Physiology at a Distance by using Infrared Technology
- Multimodal Face Recognition and Facial Expression Analysis
- Relating Video, Thermal Imaging, and EEG Analysis — integrate and analyze simultaneously recorded brain activity, infrared images, and 3D video
This Biomedical Data Grid project meets the following technical requirements:
- Rapid application development (use of the Microsoft Visual Studio .NET technology)
- Visual modeling interfaces (forms driven Graphical User Interfaces)
- Database Connectivity (interface with Microsoft SQL Server 2005)
- Query support (clients can store, update, delete, retrieve database metadata)
- Context-sensitive, role-based access (Microsoft Windows Server 2003, ASP.NET)
- Robust security (HIPPA compliance through Microsoft’s Authentication and Authorization from IIS and ASP.NET)
- Connectivity to other biomedical resources (PACS, DICOM, XML)
The Biomedical Data Grid application is developed using Microsoft Windows Server 2003, Microsoft Virtual Server 2005, Microsoft Visual Studio .NET Beta 2, and the Microsoft SQL Server 2005. A Web client will be able to securely upload biomedical files to a Web server while metadata related to these files will be stored in the SQL Server 2005 database for the purpose of querying, data mining, etc. Post-processing and simulation steps on biomedical data will be using a Master node Web Service that automatically distributes a large set of parameter or sensitivity analysis tasks to Slave nodes on the Computing Grid. We will give an overview of our project and provide a few examples of our biomedical applications.
SETI@home and Public Participation Distributed Computing
Dan Werthimer, University of California at Berkeley
Werthimer will discuss the possibility of life in the universe and the search for radio signals from other civilizations. SETI@home analyzes data from the world’s largest radio telescope using desktop computers from five million volunteers in 226 countries. SETI@home participants have contributed two million years of computer time and have formed Earth’s most powerful supercomputer. Users have the small but captivating possibility their computer will detect the first signal from a civilization beyond Earth. Werthimer will also discuss plans for future SETI experiments, petaop signal processing, and open source code for public participation distributed computing (BOINC — Berkeley Open Infrastructure for Network Computing).
Making NEXRAD Precipitation Data Available to the Hydrology Community
Tomislav Urban, University of Texas at Austin
Next Generation Doppler Radar (NexRad) has enabled the possibility of collecting high-resolution precipitation data across the country that is of high value to hydrologists studying, amongst other things, flooding, evaporation, and drought. These data however, available only is non-standard, binary, formats, and in file structures not conducive to the types of queries typically performed in the domain, have been difficult to integrate easily into the hydrologist’s research. For example, whereas a hydrologist may seek to obtain data for a single variable over a small geographical entity for a fairly significant temporal extent on the order of months or years; the files on the other hand are typically available for one hour periods extending over a large region or even over the entire country. Since the level of IT support for these researchers can be low, this has presented an impediment to the ready access to NexRad data. This project seeks to provide simple Web application- and Web service-based access to these data in whatever spatial and temporal extents are most convenient to the user. By storing the data in SQL Server, we are able to quickly generate output files for precisely the variables, geographies, times and formats that are required. Additionally, as we are using the ArcHydro schema developed by out partners at the Center for Research in Water Resources (CRWR) also at the University of Texas; the data can be easily output as geo-referenced points or polygons allowing the user to bring to bear an array of GIS-based analytic tools already generally available. Looking ahead, we see this collection growing into a major repository of hydrology-related data including stream and rain gauge point data and water quality.
Integration and Visualization in Bioinformatics
Mehmet Dalkilic, Indiana University
One of the greatest benefits of escience – the use of distributed computing and data resources for scientific discovery – is the opportunity for scientists to begin working with data sets that would have been too large to work with otherwise and, consequently, ask questions that would have not been possible. There are many obvious challenges escience faces because of its distributed nature, but other challenges that, while not uniquely escientific, remain sufficiently domain sensitive that solutions do not seem easily shareable. One particularly difficult problem is integration – how to coherently bring together disparate, massive data sets. Focus has been generally placed on the physical layer, borrowing from the three layers of data modeling, where details of implementation predominate. This problem will likely continue, though there is some hope leveraging “smart” architectures like smart clients. Logical integration – how to meaningfully bring together massive, disparate data sets – from the scientists’ perspective is even more challenging. Another challenge of escience is creating meaningful, interactive visualizations of massive data sets. A direct benefit of this kind of visualization is allowing the scientist to freely explore in a setting that is more familiar and intuitive. In this presentation will we discuss three ongoing projects, CATPA (Curation and Alignment Tool for Protein Analysis), INGeNE (Integrated, Gene Network Explorer), and SNPEx (SNP Explorer) that address the challenges of integration and visualization. CATPA is a smart client application that allows for the curation of protein families at the residue level, including deletions. Interaction is done visually. INGeNE is an application that allows for functional genomic discovery by building networks of relationships where an edge is a determined by a combination of microarray data, protein-protein data, gene-gene interaction data, and phenotypic expression data. SNPEx is an application that includes a novel algorithm to find the most informative set of tagging SNPs. Additionally, we decided to implement SNPEx in both Java/MySQL and C#/SQL Server 2000 to compare performance of the two systems and found the later to be superior in our suite of tests.
The WiFi eTransit Village
Uma Shama, Bridgewater State College
As a foundational research project of the Federal Transit Administration, the GeoGraphics Laboratory at Bridgewater State College has developed of a Web-based transit technology prototype focused on the needs of the consumer to access safe and secure transit service while also providing for enhanced personal productivity and travel assistance while on-board the transit vehicle and at the bus stop. The project takes advantage of emerging community-wide outdoor Internet connectivity and very large scale data storage as the enabling technology for a full-featured e-transit village. The project uniquely uses the wireless local area network infrastructure (WLAN) and international standards (WiFi or wireless fidelity 802.11b) to demonstrate customer-oriented applications of transit technology for community transportation providers. The transit technology prototype uses the campus transit system for Bridgewater State College provided by the Brockton (MA) Area Transit Authority (BAT) and the surrounding New England village of Bridgewater, Massachusetts. Proof of concept milestones to date include GPS-based automatic vehicle location mapping transit vehicles with a one-second refresh rate using Microsoft’s Web service and transmission of video from the transit vehicle with GPS date/time and latitude/longitude at simultaneous one-second intervals for real-time Internet display and archiving on Microsoft’s custom built 2-terabyte server. Research continues in developing an opportunistic approach to optimizing reacquisition of access points and reauthorization of wireless local area network security from a moving transit vehicle. A field operational test of the proof of concept is planned for 2006 with an opportunity for deployment by the local sponsoring transit authority in 2007.
Webcast: The WiFi eTransit Village
Streamlining Scientific Research via Electronic Laboratory Notebooks and Wireless Sensors
Patrick Anquetil, MIT BioInstrumentation Lab
This talk will discuss the use of computing to assist research in an academic laboratory environment. Two projects conducted at the MIT BioInstrumentation Laboratory within the framework of the MIT/Microsoft iCampus project will be discussed. These two projects are named iLabNotebook and iDat.
The iLabNoteBook is an experiment in which we attempt to replace traditional laboratory notebooks with Windows XP powered Tablet PCs. This new computing platform offers a multimedia environment for scientists and students to document their work and conduct scientific research. The virtual laboratory notebook empowers researchers not only to record experimental procedures digitally but also to add multiple data-format content to a lab notebook page. In addition these electronic notebooks can be easily searched, backed-up, transported and shared amongst colleagues worldwide. Evaluation of this technology was conducted for a one year period among fourteen scientists at MIT.
NeuroScholar: A Practical Solution Addressing Information Overload in Systems-Level Neuroscience
Gully Burns, University of Southern California
Systems-level neuroscience lacks a formal theoretical structure, relying on argumentation based on experimental findings expressed in the primary literature. Theoretical models may typically be represented as summary diagrams in a paper’s discussion. Within a subject as complex and multifaceted as neuroscience, this lack of formalization leads inevitably to problems of information overload for individual researchers as it is a significant challenge to manage and manipulate large volumes of information from a distributed resource such as the literature and scientists’ own individual records. We present “NeuroScholar”, a knowledge base desktop application that specifically targets literature- and laboratory-based information, providing a structured knowledge engineering approach for neuroscience. It provides a general object-oriented data model to encapsulate complex data into entities, and a graph-theoretical approach that represents relations between entities edges between nodes in a graph. The system has frameworks for unit testing, plugins (to embed external applications within NeuroScholar), proxies (to export NeuroScholar’s knowledge management capabilities to external applications) and knowledge acquisition based on questionnaires. Specialized plugins include an annotation mechanism for pdf files (built with Multivalent, a third party library); an electronic laboratory notebook component; an annotation mechanism for vector graphics; and NeuARt, a neuroanatomical data viewer based on standard atlases that can also use the proxy framework to act as a standalone neuroanatomical data management tool. The knowledge acquisition subsystem provides an easy way to link free-form document annotation with structured knowledge representations for specific types of experiment. We are applying the system directly in two systems-level neuroscience laboratories, one focused on neuroanatomy, the other on neuroendocrinology. It is anticipated that NeuroScholar may provide a platform for theoretical research in neuroscience by delivering knowledge engineering capabilities directly to experimental scientists to facilitate analysis and communication.
Patrick Hogan, NASA
NASA World Wind, a Smart Client application built almost effortlessly on the .NET platform, lets you zoom from satellite altitude into any place on Earth. Leveraging Landsat satellite imagery and Shuttle Radar Topography Mission data, World Wind lets you experience Earth terrain in visually rich 3D, just as if you were really there. Virtually visit any place in the world. Look across the Andes, into the Grand Canyon, over the Alps, or along the African Sahara.
NASA World Wind is a free an open source application, providing an excellent opportunity to understand and work with Smart Client architecture and the .NET framework, be it an academic exercise in understanding the technology or to better appreciate development of scientific research tools. All data leveraged is in the public domain. The technology allows for implementing a variety of formats, including ESRI Shapefiles, and server protocols (for example, WMS).
Computationally-intensive biomedical research projects supported by the National Institutes of Health
Milton Corn, M.D., NIH
The need for computational partnerships in biomedical research has increased sharply in recent years as the Human Genome project and other high-throughput biomedical research has underscored important new requirements for data processing, information retrieval, database design, data mining, and quantitative biology. At the National Library of Medicine as well as a number of other Institutes at the National Institutes of Health campus, research funding opportunities increasingly require significant computational expertise, and specifically require applicants to include in the project collaborations between biologists and computational experts. This talk will provide a survey of current computationally-intensive opportunities at NIH, suggestions for computer scientists and engineers looking for biomedical partners, and some guidance about the NIH grant processes.
Creating the Personal Supercomputer
Kyril Faenov, Microsoft
As computing power has increased so have the complexities of our computer simulations. We’re at a point now where many scientists, engineers, and researchers are hitting the upper limit of their high end workstations, further driving the need for supercomputing resources. Microsoft’s goal in entering the high performance computing space is to enable what we call “personal supercomputing” which sounds like an oxymoron. What we want to do is move super computing resources out of distant labs and bring them closer to the people that use those resources. In most cases it would be a workgroup sized system with 32 or 64 nodes, but in the most extreme case, the personal supercomputing case, it would mean a small 4-8 node cluster sitting in a scientist’s office running off 15 amp wall power. Come hear why we think this is the direction of supercomputing and how we’ll make it a reality.
Webcast: Creating the Personal Supercomputer
Using .NET and Web Services to build an e-Science Application: Looking for White Dwarfs
The Web Services Grid Application Framework (WS-GAF) project (Jan 2004 – Jan 2005) aimed to demonstrate the value of using standard, widely-accepted, well-supported Web Services technologies for scientific and commercial Internet-scale (a.k.a. “Grid”) applications. The scientific application developed as part of this project is a tool aimed at astronomers who wish to combine and analyse information from the SuperCOSMOS (UK) and Sloan Digital Sky Survey (US) scientific archives. This presentation will discuss the WS-GAF approach to building Internet-scale applications, the steps followed in creating a tool for scientists, and the implementation challenges and solutions.
Data & Databases
Cyberinfrastructure for E-Science
Tony Hey, Microsoft
Webcast: Cyberinfrastructure for E-Science
The Gateway to Biological Pathways: A Platform to Enable Semantic Web-Based Biological Pathway Datasets
Keyuan Jiang, Purdue
Biological pathways represent our current understanding of biological processes. A large amount of biological pathway data has been accumulated either by curation of the scientific literature or by automatic machine inference of high-throughput laboratory experiments. There exist over 180 biological pathway databases, covering metabolic pathways, signal transductions, protein-protein interactions, and regulatory pathways. The data have been collected by diverse research organizations with particular interests, various techniques, incompatible schemas, and different access methods. Biologists utilize the pathway data to formulate hypotheses, verify experiment results, and share research outcome. Due to the incompatibility, depth and breath of database coverage, it is not uncommon for biologists to query multiple datasets, a time-consuming and error-prone process, to address intriguing biological problems.
The Gateway to Biological Pathways project leverages the BioPAX standard in storing and providing pathways datasets consumable by Semantic Web applications, and offers a unified interface to query biological pathway data. The proposed BioPAX standard provides a common format for exchange biological pathway datasets. The BioPAX ontology, written in W3C recommended Web Ontology Language (OWL), supports the vision of Semantic Web. With the BioPAX, a pathway is composed of a number of entities and relationships among the entities.
In the Gateway application, the pathway entities are the basic data unit that is naturally stored in its XML format in a native XML datatype column of a SQL Server 2005 database. The support of native XML format eases the database design by which the number of tables can be reduced while relationships among pathway entities are still maintained. Storing XML data in XML datatype column provides an efficient way of accessing and processing data. The XQuery provided facilitates diverse searching functionality with the XML datasets. The Gateway application provides a Web service by which biological pathways can be queried and the data returned are of BioPAX format. In addition, the HTTP GET and POST methods are implemented for directly querying the pathway data. The pathway datasets of E. coli and Human from BioCyc are currently available at the Gateway, and more data are to be added. A client capable of consuming the BioPAX format data is being developed for visualizing and navigating biological pathways.
Environmental Science from Satellites
Jeff Dozier, University of California at Santa Barbara
Imagery from Earth-orbiting satellites provides a rich but voluminous source of raw data for scientific investigation of environmental processes and trends. Analyses of the data are, however, generally outside the traditional realm of “image processing.” Instead, we think of an image as a geospatial raster of radiometric values, and an image’s resolution includes spatial, spectral, radiometric, and temporal attributes. Translation of images into a suite of geophysical products requires technologies and procedures that support extensive computation and spatial operations on large objects, along with mechanisms to track the legacy of computations performed and allow revisiting as algorithms change.
DopplerSource: .NET Framework for Accesing Doppler Radar Data
Beth Plale, Indiana University
Doppler radar data, which has proven its value in meteorology research, has tremendous potential for use in many other research endeavors if only it weren’t so difficult to work with. In DopplerSource we are removing the hurdles that prevent broader use of the data through a service-based framework for storing, operating on, and serving the data. The 130 WSR-88D (Doppler) radars located throughout the United States generate Level II data continuously 24×7. The data has been valuable in many aspects of meteorology research and education, for instance, for the real time warning of hazardous spring and winter weather, for initializing numerical weather prediction models, and for verifying the occurrence of past events, such as the location of damaging hail. But it has broader potential. Level II data is used in bird and insect migration student, bird strike avoidance, urban pollution transport, and the tracking of hazardous atmospheric releases. This larger goal of facilitating additional avenues of science cannot be fully realized without significant improvements in the accessibility and availability of the data over what exists today.
In this project partially funded through Microsoft e-Science, we are constructing a .NET framework for storing, operating on, and serving NEXRAD Level II data and the knowledge products derived from the data. Our pilot project is aimed the six nearest radars surrounding Bloomington, Indiana.
The project focus areas are in:
- Storing and indexing large volumes of streaming data using a SQL Server database
- Generating metadata on-the-fly to describe data and capture features of time-sequence in which the data arrived
- Simple retrieval of Doppler data through a spatial-temporal interface. The user selects a region of interest, and specifies a temporal range.
- Support services to query, process, clean, filter, and fuse data on the fly
- Authentication mechanisms to avoid denial of service abuse by over-taxing the computational resource
- Scalability-level of performance that balances continuous input stream arrival, computationally intense user services, and rich query access over highly correlated temporal and spatial data
- Log analysis to characterize arrival and anticipate user workload. Logs from related meteorology services used to analyze patterns of use that allow us to better anticipate future usage patterns
The storage needs for the pilot radars alone is substantial. The 6 radars generate 27.5 TB per year of raw Level II data that can be compressed to 1/25th size, requiring 1TB/yr of storage. A useful transformation of the data is into the binary netCDF format. The converted data adds another 2.5 TB/year. The arriving data products are tagged with metadata to facilitate searching. The metadata needs for the pilot data products are estimated at 170GB/yr. The knowledge products generated on-demand by statistical analysis and data mining services are estimated at 0.5 TB/yr. This places the total storage need at 4.5TB/year of data. The tools used include Web service framework (.NET), database management system (SQL Server), XML metadata schema (leveraging LEAD Metadata Schema from the NSF LEAD project), and Integrated Radar Data Services (IRaDS) support for the Doppler streams. The hardware testbed includes 16 dual Opterons with 16GB RAM each, a 3.5 TB SAN storage array, a dual Opteron, 4GB RAM, 2TB RAID 1 disk, Windows 2003 as the database server, and the Indiana University MDSS fault tolerant mass store server with a collective 1 Petabyte of storage.
Data & Databases
CasJobs and MyDB for the Virtual Observatory: Towards Distributed Asynchronous Web Services for Data Intensive Science
Ani Thakar, Johns Hopkins University
The Sloan Digital Sky Survey (SDSS) Catalog Archive Server (CAS) provides online access to the multi-TB SQL Server-based SDSS Science Archive via the SkyServer Web portal. This synchronous, ASP-based Web access is fine for casual and quick queries that request moderately sized resultsets, but for the data intensive queries that are necessary for serious research with the SDSS archive, we have developed CasJobs, a C#.NET batch query workbench Web service that provides asynchronous queue-based access to the SDSS CAS and a personal SQL Server database for every user (MyDB) to save their query results. I will describe amd briefly demo the batch query workbench, and discuss the future of CasJobs: a distributed CasJobs/MyDB for the Virtual Observatory (VO). CasJobs provides two modes of query execution: quick (synchronous) and batch (queued) execution. Quick queries are limited to 1 minute execution time and can be run even without login, while batch queries (which require login) are virtually unlimited. Results from batch queries are routed to the user’s MyDB by default. Users can then preview and download these results at their convenience and in their chosen format (ASCII/CSV, binary or XML). They can also share their MyDB tables with other collaborators, and use them in other queries and stored procedures to perform complex data intensive tasks like neighbor searches and cross-matches. Distributed CasJobs will require distributed security and storage. The international VO community is converging on the VOStore standard which will essentially combine Web services security with MyDB-like data stores accessible via asynchronous Web services. VOStores will also enable distributed Web services like Open SkyQuery to operate asynchronously so that large cross-matches between catalogs can be performed on demand.
A Platform for Computational Comparative Genomics on the Web
Sun Kim, Indiana University
We have been developing a Web-based system for comparing multiple genomes, PLATCOM, where users can choose genomes and perform analysis of the selected genomes with a suite of computational tools. PLATCOM is built on internal databases such as GenBank, COG, KEGG, and Pairwise Comparison Database (PCDB) that contains all pairwise comparisons (97,034 entries) of protein sequence files (.faa) and whole genome sequence files (.fna) of 312 replicons. The pre-computed PCDB makes it possible to complete genome analysis very fast even on the Web, so that users can choose any combination of genomes and analyze them with data mining tools. Genome comparison requires combining many sequence analysis tools. However, combining multiple tools for sequence analysis requires a significant amount of programming work and knowledge on each tool, thus it is very challenging to provide a service for comparing genomes on the Web by using standard sequence analysis tools. Thus, to make genome comparison be done on the Web, well-defined data mining concept and tools are very important since they can make genome comparison much easier. It is also important that the data mining tools for genome comparison should be scalable. We have been developing such scalable tools: a sequence clustering algorithm BAG, a metabolic pathway analysis tool MetaPath, a gene fusion event detection tool FuzFinder, a gene neighborhood navigation tool OperonViz, an algorithm for mining correlated gene sets MCGS, a genome sequence alignment tool GAME, a multiple genome sequence alignment algorithm by clustering local matches mgAlign, and a pairwise genome visulization tool COMPAM. The analysis results are summarized with visualization tools. We are currently working on integrating the data mining modules such that users can combine these in a very flexible way. In addition to sequence data, PLATCOM will include more data types such as gene expression data.
Querying Breast Cancer Image Databases
Hanan Samet, University of Maryland
Breast cancer remains a leading cause of cancer deaths among women in many parts of the world. In the United States alone, over forty thousand women die of the disease each year. Mammography is currently the most cost-effective method for early detection of breast cancer. Alternative medical imaging approaches such as ultrasound or MRI could be more effective than mammography at detecting cancers or evaluating malignancy in certain types of women. A database with images from multiple technologies like mammograms, MRI, PET, and ultrasound will enable research into the effectiveness and usefulness of each technique at cancer screening and the determination of malignancy. We created this database with Microsoft SQL Server and will be using it to develop a tool that will use it to provide doctors with a Web-based query tool to access the data via Web services. Doctors will also be able to find cases similar to a current patient thereby improving the accuracy of the diagnosis. This database will be an invaluable tool for the improvement of computer aided detection (CAD) techniques by providing quality data sets, the storage of feature sets for comparison, a tool for the complex combination of features through spatial relationships and across images, and built-in statistical analysis. We will develop a pictorial query specification system for this tool that will enable users to specify queries by identifying the desired features, shapes, or characteristics and specifying the spatial relationship between them using distance and direction. Additionally, the secure data storage and retrieval enables long-distance, electronic image transmission (telemammography/teleradiology) for clinical consultations. The database will match images from the same patient, thus improving the capability of comparing images through time, which will enable the determination of extremely early cancerous indicators, and thus hopefully improve the cancer survival rate.
SANGAM: A System for Integrating Web Services to Investigate Stress-Circuitry-Gene Coupling
Shahram Ghandeharizadeh, University of Southern California
In 1993, NIH launched the Human Brain Project (HBP) to develop and support neuroinformatics as a new science to make experimental data pertaining to the brain publicly available on the Internet. The success of HBP is demonstrated by the Society of Neuroscience maintaining a directory of 83 databases and 48 knowledge bases developed and maintained by different academic, government, and commercial institutions. A challenge is how to integrate data from these diverse sources to answer a scientific enquiry. SANGAM focuses on this challenge from the perspective of Stress-Circuitry-Gene coupling. It strives to address the following scientific question: Does every type of stress stimulus recruit the same set of brain circuits and activate the same genes, or do such circuits and genes vary across different stressors? An answer to this question helps clinicians and drug manufacturers to develop better treatments and drugs for stress disorders. Currently, a prototype of SANGAM is operational and in-use by our neuroscientists. A key insight from developing SANGAM is a general framework for neuroscience information integration consisting of 3 components: Run-time integration (RTI), Plan Composition (PLC), and Schema and Data Mapper (SDM). We present an overview of these components along with performance results from both centralized and distribution (using WSE 2.0) implementation of RTI component.
Bio-Workflow Using BizTalk
Paul Roe, Queensland University of Technology
Workflow is an important enabling technology for eScience. Research into workflow systems for eScience has yielded several specialized workflow engines. We have been investigating the nature of scientific workflow and experimenting with the BizTalk business integration server to support scientific workflows. In particular we have built a simple Web portal for bioinformatics which uses Biztalk as the underlying workflow engine. The portal has a novel Web based interface making it accessible to a wide variety of users. In this presentation we will describe the overall system and demonstrate some simple workflows.
Webcast: Bio-Workflow Using BizTalk
Developing GEMSTONE, a Next Generation Cyberinfrastructure
Karan Bhatia, San Diego Supercomputer Center
We are developing an integrated framework for accessing grid resources that supports scientific exploration, workflow capture and replay, and a dynamic services-oriented architecture. The framework, called GEMSTONE for – grid enabled molecular science through online networked environments – provides researchers in the molecular sciences with a tool to discover remote grid application services and compose them as appropriate to the chemical and physical nature of the problem at hand. The initial set of application services to date include molecular quantum and classical chemistries (GAMESS, APBS, Polyrate), along with supporting services for visualization (QMView), databases, auxillary chemistry services, as well as documentation and education materials.
This presentation will focus on the technologies used to build the GEMSTONE frontend: a rich-client application built using the Mozilla framework that provides access to remote registries for application discovery using RSS, dynamically loaded user interfaces using XUL, and visualization services (both local and remote) using SVG, Flash, and OPENGL. The GEMSTONE frontend supports the GSI-based secure Web services infrastructure being created by the National Biomedical Computation Resource (NBCR), an NIH funded center at UCSD, and supports the Grid Account Management Architecture (GAMA) for credential management. The remote Web services support large-scale clusters for parallel and high-throughput jobs and provide science-oriented strong datatypes for semantic composition. Finally, GEMSTONE adds a workflow composition tool, based on the Informnet engine, that composes existing Web services into workflows that are accessible as new Web services.
A Web Interface to Large, High-Resolution X-Ray Computed Tomography Data Sets
Julian Humphries, University of Texas at Austin
High-resolution X-ray computed tomography (HRXCT) provides highly detailed three-dimensional data on the exterior form and interior structure of solid objects. The data produced by the UTCT Lab facility at the University of Texas at Austin are HRXCT scans of various biological organisms, ranging from dinosaurs to mice and geological objects including meteorites and deep sea cores. The Digital Library of Vertebrate Morphology (or Digimorph Project) has for the last eight years acquired and scanned some of the world’s most spectacular organisms. To date, these data have been released as highly compressed renderings in the form of movies and Web sized versions of the data. In order to provide access to full sized datasets and enhance research tools for viewing these data we have developed UTCT. These data sets are large (1-4 GB in size) and the display and dissemination of these datasets is a challenge. We have built a SQL Server based system which hosts metadata and raw imagery and which allows rapid and flexible access to volumetric data.
The visualization options on the Web site range from simple to complex. Users can currently choose from a “light-table” viewer or a Java slice viewer from this site. They can also download all or parts of the data at multiple bit-depths and file formats. Finally, users will soon have the option to remotely volume render these data using one of several tools being developed. One approach uses the Meshviewer/Vista combination on Maverick, a TeraGrid-funded visualization system to remotely render their volume using a VNC client and a high-speed Internet connection. Other possibilities are also under development. The combination of these options gives users a rich set of tools for exploring the data.
The Zecosystem: Cyberinfrastructure Education and Discovery for the Next Generation
Krishna P.C. Madhavan, Purdue University
Learning experiences of the future will be multi-sensory, engage technologies and significant computational power continuously and invisibly, and will be completely engaging. The Zecosystem will offer cyber-services that incorporate science, technology, engineering, and mathematics concepts into the students’ everyday experiences seamlessly. Through this project, we expect to transform common day-to-day student activities such as gaming, eating at the cafeteria, or visiting the library into learning experiences. Our vision is to develop a Cyberinfrastructure Education Ecosystem where learning co-exists with students’ lifestyles, technology choices, and emerging national cyberinfrastructure. To this end, we will leverage significant on-going R&D in computational infrastructure, middleware, and science gateways funded by the National Science Foundation (NSF) and other industrial partners at Purdue University.
Our goal is to leverage the national cyberinfrastructure effort for day-to-day discovery and learning practices. Given the emergence of highly cross-disciplinary areas such as nanotechnology, bioinformatics, and computational science as critical for scientific progress, teaching and learning at colleges and universities can no longer be locked behind computational walls. Furthermore, several national reports have identified the dire need to train and develop the next generation of students to take up careers in science, technology, and engineering. We strongly believe that in order to reach the current generation of students — the aptly labeled Gen-Z #151; information technology needs to be at the heart of educational efforts and play more than an add-on role. Simply put, we need to rethink education ground-up.
The goal of the first phase of the project is to develop a robust set of Cyberinfrastructure (CI) Education Services that will extend the capabilities of existing and emerging science gateways such as the nanoHUB to a mobile environment. We are developing a series of Web services that plug into middleware and infrastructure layers currently being developed and deployed at Purdue University. These services will be made available to the larger scientific and education community, while simultaneously consumed to develop new cutting-edge CI-services focused and tailored at students. This project will allow students to deploy large computational jobs to the national cyberinfrastructure from their cell phones, PDAs, gaming environments, and other mobile devices.
In this talk, we will focus on the vision of the Zecosystem and provide concrete arguments that are derived both from a scientific discovery, as well as from the pedagogical viewpoints. We will also provide examples of various project elements that are already in progress. In many cases, the prototypes are expected to be ready by the end of the coming academic year. All of the projects, which will be highlighted, are either funded by large NSF grants or by support from our industrial partners.
The BioInstrumentation Laboratory in the Department of Mechanical Engineering at MIT is dedicated to the development of novel modern medical instrumentation requiring the combination of many traditional disciplines including biology, optics, mechanics, mathematics, electronics and chemistry. It is uniquely placed to bring together these areas of research with its broad array of students and post doctoral research scientists from Mechanical Engineering, Electrical Engineering, Physics, Chemistry, Materials Science and Biology. In addition, we have extensive laboratory facilities including mechanical, electrical and optical work shops, a BL2 biology work area, a chemistry laboratory and a clean room complete with an electron microscope. These facilities allow our researchers to move quickly from a medical device concept to a prototype and rapidly iterate their designs.
Dr. Bhatia is the group leader for the Grid Development group at the San Diego Supercomputer Center (SDSC). The group is funded by various grid projects to build grid middleware components for existing and emerging cyberinfrastructure projects. The group is developing a wide variety of software, including GEMSTONE, INCA, and GEON Systems components among others. Prior to joining SDSC, Dr. Bhatia was a senior engineer at Entropia, a enterprise distributed computing infrastructure company. Dr. Bhatia received a PhD in Computer Science from UCSD specializing in distributed computing and fault tolerance.
Gully studied physics as an undergraduate at Imperial College in London, when, half-way through a early-morning Friday lecture on detectors for sub-atomic particles, he had an epiphany that he wanted to study how the brain works. He started a Ph.D. at Oxford, only to find that the theoretical foundations of neuroscience were not the stringent, mathematically-defined coda he was used to. Since then, Gully has striven to formalize the theoretical basis of his subject by building practical systems that might be used in laboratories. Since completing his Ph.D., he has worked as a postdoc and then an assistant research professor in the neuroanatomy laboratory of Professor Larry Swanson at USC.
Cohen was a Senior Software Developer at the Texas Advanced Computing Center, but has just recently started graduate school at the University of Pennsylvania.
Milton Corn, M.D.
Dr. Milton Corn is Associate Director of the National Library of Medicine (NLM), and Director of the Library’s grant programs. He is a graduate of Yale College and Yale Medical School. Post-graduate training includes internal medicine at Harvard’s Peter Bent Brigham Hospital, and hematology at Johns Hopkins. Most of his academic career was spent at Georgetown University School of Medicine, where he held the appointment of Professor of Medicine. In 1984-85 he was Medical Director of Georgetown University Hospital. He served as Dean of Georgetown’s Medical School 1985-89. He joined N.I.H. in 1990, and administers a broad spectrum of grant programs in the domain of biomedical computing relevant for basic research, health care delivery and education. He is board-certified in internal medicine and hematology, and is a fellow of the American College of Physicians and of the American College of Medical Informatics.
He is a Professor of Computational Methods in the Computational Engineering Design Research Group (CED) Group within the School of Engineering Sciences (SES). The School was awarded grade 5* in the 2001 national assessment of research in UK universities. The primary research interests of the group are in the three broad themed areas of optimisation and design search, applied computational modelling, and computational methods. This spans design optimisation, solid material modelling, computational electromagnetics, repetitive structures, contact mechanics, structural dynamics and computational methods. The CED aims to be a centre of excellence for multi-disciplinary engineering simulation and design which combines together a range of analytical, computational, and experimental techniques. Our strength lies in this sophisticated mix of engineering methods coupled to industrial applications: a particular focus for our activities over the next few years will be the development of grid-based problem-solving services for use by academia and industry.
Jeff Dozier’s research and teaching interests are in the fields of snow hydrology, Earth system science, remote sensing, and information systems. He has pioneered interdisciplinary studies in two areas: one involves the hydrology, hydrochemistry, and remote sensing of mountainous drainage basins; the other is in the integration of environmental science and computer science and technology. In addition, he has played a role in development of the educational and scientific infrastructure. He founded UC Santa Barbara’s Donald Bren School of Environmental Science & Management and served as its first Dean for six years. He was the Senior Project Scientist for NASA’s Earth Observing System in its formative stages when the configuration for the system was established. After receiving his PhD from the University of Michigan in 1973, he has been a faculty member at UCSB since 1974. He is a Fellow of the American Geophysical Union, the American Association for the Advancement of Science, and the UK’s National Institute for Environmental eScience. He is also an Honorary Professor of the Chinese Academy of Sciences and a recipient of the NASA Public Service Medal.
Dennis Gannon is a professor in the department of Computer Science at Indiana University which he chaired from 1997 to 2004. His previous positions include the department of Computer Science at Purdue University. He was also a senior visiting research scientist at the Center for Supercomputer Research and Development, University of Illinois. He was a partner in the NSF Computational Cosmology Grand Challenge project. He is a founding member of the DOE Common Component Architecture software group and the NCSA Alliance. From 1998-2000 he worked on the NASA Information Power Grid. He is on the steering committee for the Global Grid Forum. Gannon is also the Science Directory for the Indiana Pervasive Technologies Labs.
Founder of a new lab in Modeling and Computational Science in the Department of Computer Science at University of Houston in January 2002: the main goal is to develop interdisciplinary projects between applied mathematicians and computer scientists with applications in biology, medicine and environmental sciences.
Shahram Ghandeharizadeh received his Ph.D. degree in Computer Science from the University of Wisconsin, Madison, in 1990. Since then, he has been on the faculty at the University of Southern California. Shahram is a recipient of the National Science Foundation Young Investigator’s award for his research on physical design of parallel database systems. His primary research interest is the field of neuroinformatics, emphasizing the use of Web Services to facilitate publication, use, and integration of autonomous data sources.
Jonathan Goodall is an Assistant Professor of the Practice of Geospatial Analysis at Duke University in the Nicholas School of the Environment and Earth Sciences. His primary research and teaching interests are in geographic information systems applied to water resources science and engineering. He completed his Ph.D. in civil engineering from the University of Texas at Austin in 2005.
Dr. Victoria Hilford received her Masters in Electrical Engineering, her Masters and Ph.D. in Computer Science. She has worked in the industry for 10 years before she started teaching at University of Houston. Currently, Dr. Victoria Hilford is working on the Biomedical Data Grid project that provides database support to several projects in the Biomedical field.
Marty A. Humphrey
Marty A. Humphrey received the Bachelor of Science degree in electrical and computer engineering in 1986 and Master of Science degree in electrical and computer engineering in 1988 from Clarkson University, Potsdam, NY, and a PhD in computer science from the University of Massachusetts in 1996. From 1998 to the present, he has been with the Department of Computer Science at the University of Virginia, Charlottesville, VA where he was first a Research Assistant Professor and is currently (2002-) an Assistant Professor. His areas of research include many aspects of Grid Computing, including security, programming models, performance, Grid testing, and Grid usability. He is active in the Global Grid Forum, where he recently completed a term on the GGF Steering Committee.
Humphries is a Research Scientist in the Geology Department at the University of Texas and Project Manger for the Digital Library of Vertebrate Morphology (or Digimorph Project), an NSF funded Digital Library Project. His background is in biology and biological informatics.
Dr. Keyuan Jiang received his Ph.D. in Biomedical Engineering from Vanderbilt University, Nashville, Tennessee, and is Assistant Professor of Computer Information Systems and Information Technology at Purdue University Calumet, Hammond, Indiana. Dr. Jiang has conducted a number of research projects in the area of computer applications in biomedicine, ranging from the knowledge-based system for synthetic gene design, bedside graphical nursing charting system, to the communication log system for clinical studies. His current interests are focused on Semantic Web in life sciences and bioinformatics Web services. Dr. Jiang is a member of IEEE Engineering in Medicine and Biology Society, and is serving on the editorial board of IEEE Transactions on Information Technology in Biomedicine. As a faculty member at Purdue University Calumet, he has taught courses of software development and bioinformatics. Prior to his current position, Dr. Jiang was a Technical Advisor at two private companies in delivering e-business solutions using Microsoft technologies.
Sun Kim is currently Associate Director of Bioinformatics Program , Assistant Professor in School of Informatics, Associate Faculty at the Center for Genomics and Bioinformatics at Indiana University – Bloomington. Prior to IU, he worked at DuPont Central Research as Senior Computer Scientist from 1998 to 2001, and at the University of Illinois at Urbana-Champaign from 1997 to 1998 as Director of Bioinformatics and Postdoctoral Fellow at the Biotechnology Center and a Visiting Assistant Professor of Animal Sciences .Sun Kim received B.S. and M.S. and Ph.D. in Computer Science from Seoul National University, Korea Advanced Institute of Science and Technology (KAIST) , and the University of Iowa respectively. Sun Kim is a recipient of Outstanding Junior Faculty Award at Indiana University 2004-2005, NSF CAREER Award DBI-0237901 from 2003 to 2008, and Achievement Award at DuPont Central Research in 2000.
David Lifka is the Director of High Performance Systems and Innovative computing for Computing and Information Sciences at Cornell University. His duties include management of the technical staff providing systems administration, consulting and systems research and development for CIS, Computer Science and the Cornell Theory Center. Lifka is an expert in Windows based high performance computing and led CTC’s technical move from proprietary UNIX to Windows-based industry standard high performance computing, working with strategic partners, including Microsoft, Intel, Dell, Unisys, Giganet, and ADIC. His areas of expertise include parallel job scheduling and resource management systems, UNIX-to-Windows migration, and HPC services. Lifka’s vision is that HPC must become pervasive and as easy to use out- of-the-box as a personal computer to make it a viable tool for more than those at academic institutions and research laboratories. Lifka is actively involved in eScience and Data Intensive Computing efforts at Cornell. Understanding the manageability and maintainability of petabyte size data repositories as well as the use of SQL Server and Web services for developing seamless HPC interfaces are of primary interest to Lifka. Lifka came to Cornell University from Argonne National Laboratory in 1995. Lifka has a Ph.D. in Computer Science from the Illinois Institute of Technology and serves on a number of corporate and IT advisory boards.
Krishna P.C. Madhavan
Dr. Krishna P.C. Madhavan is a Research Scientist for Teaching and Learning Applications with the Rosen Center for Advanced Computing and the NSF-funded Envision Center for Data Perceptualization Information Technology at Purdue. Dr. Madhavan is also the Educational Technology Director for the NSF-funded Network for Computational Nanotechnology (NCN). He serves as the Curriculum Director for the Supercomputing 2005 Education Program and is also the Chair for the Supercomputing 2006 Education Program. Dr. Madhavan also spearheads the Zecosystem project at Purdue University.
Beth Plale is an Assistant Professor in the Department of Computer Science at Indiana University. Prior to joining Indiana University, Professor Plale held a Postdoc in the Center for Experimental Research and Computer Systems at Georgia Tech. Plale’s Ph.D. is in computer science from State University of New York Binghamton. She earned a M.S. in computer science from Temple University in 1991, an MBA from University of LaVerne in 1986, and a B.Sc. in computer science from University of Southern Mississippi in 1984. Professor Plale’s interest in experimental systems was heavily influenced by time spent as a software engineer in the defense industry in the 80’s. Her research interests include data-driven applications, parallel and distributed computing, data management, and grid computing.
Paul Roe is an associate professor at QUT where he leads the programming language and system research group. His research and teaching interests lie in the areas of distributed computing, particularly grid computing and web services, and programming languages. Paul has published over 60 papers and has received numerous grants; much of his research is done in conjunction with industry. For the past five years he has been using .NET in both his teaching and research.
Dr. Uma Shama is a professor in the Department of Mathematics and Computer Science at Bridgewater State College. She is Co-Director (with Mr. Harman) of the GeoGraphics Laboratory at the Moakley Center for Technological Applications. She is the principal investigator of the 2005 National Transit Use of GIS Survey. Mr. Harman is the president of Harman Consulting LLC and a co-director the GeoGraphics Laboratory, a public/private partnership with Bridgewater State College. He is principal investigator of a Federal Transit Administration-sponsored Small Business Innovative Research (SBIR) project using remote sensing and unmanned aerial vehicles for transit safety and security. Mr. Harman and Dr. Shama are co-principal investigators of the Federal Transit Administration’s WiFi e-transit village prototype research project.
Ani Thakar a Research Scientist in the Center for Astrophysical Sciences at the Johns Hopkins University in Baltimore. My research is centered around data intensive science with large astronomical databases. I am primarily involved in the development of data mining tools and services for the Sloan Digital Sky Survey Science Archive (www.sdss.jhu.edu) and the US National Virtual Observatory (www.us-vo.org). I am also involved in the development and planning for the Large Synoptic Survey Telescope (www.lsst.org).
Urban is currently Manager of the Data and Information Systems group at the Texas Advanced Computing Center. He has worked at TACC for three years, focusing on database and scientific data collections issues. Prior to TACC, Urban worked for several years as an application and database architect in the private sector.
Dan Werthimer is director of the Serendip Seti program and chief scientist of SETI@home at the University of California, Berkeley. Werthimer was associate professor in the engineering and physics departments of San Francisco State University and has been a visiting professor at Beijing Normal University, the University of St. Charles in Marseille, Eotvos University in Budapest, and taught at universities in Peru, Egypt, Ghana, Ethiopia, Zimbabwe, Uganda and Kenya. Werthimer has published numerous papers in the fields of SETI, radio astronomy, instrumentation and science education; he is co-author of ‘SETI 2020’ and editor of ‘Astronomical and Biochemical Origins and the Search for Life in the Universe’.