4-page Case Study - Posted 8/6/2008
Views: 546
Rate This Evidence:
Researchers Use Commercial Database Tools to Analyze Terabytes of Data More Easily
The Daggett Research Group within the University of Washington Department of Bioengineering is world-renowned in the study of protein stability, function, and folding. However, because the group had more than 64 terabytes of simulation data stored on more than 10 Linux-based file servers and 100 disks, mining the data was complicated and time-consuming. The Daggett Research Group used Microsoft® SQL Server® 2005 to address the challenge, taking advantage of the software’s relational and online analytical processing (OLAP) capabilities to build a 20-terabyte data warehouse that vastly simplifies the management and analysis of simulation data. By fundamentally changing how the group analyzes simulation results, the solution has accelerated time-to-insight for new discoveries and enabled researchers to investigate questions that were practically impossible to answer in the past.
Situation
The Daggett Research Group within the University of Washington Department of Bioengineering is world-renowned in the study of protein stability, function, and folding—some of the fundamental unsolved problems in molecular biology. Although much is known of native folded conformation of proteins, very little is known about the actual folding process, the understanding of which has important implications for research into all biological processes, including aging and human diseases. One of the group’s main areas of focus is its Dynameomics project, an effort to investigate and catalog the native states and unfolding pathways of all protein folds.
Given that experimental approaches provide only limited amounts of information, the Daggett Research Group uses computer simulations, which require massive amounts of computation. The bulk of that computing is done using donated time on high-performance computing (HPC) systems at the U.S. Department of Energy’s National Energy Research Scientific Computing Center (NERSC) in Berkeley, California. The group augments donated computing resources with local HPC clusters based on Linux and Windows® Compute Cluster Server 2003, which, since 2006, has been the group’s platform of choice for new HPC systems in its lab.
The Daggett Research Group identifies specific proteins of interest by how frequently a protein fold occurs in known structures, that protein fold’s involvement in disease, and researcher interest. Researchers run simulations using an in-house-developed modeling program called in lucem Molecular Mechanics (ilmm), which outputs compressed binary files that contain three-dimensional coordinates for every atom in the simulation over time. However, such simulation runs are only the beginning; researchers must then analyze the results of the simulations, which generate massive amounts of data. For the Dynameomics project alone, in which a typical simulation may generate more than 21,000 sets of coordinates for each of up to 60,000 atoms, the group already has produced more than 64 terabytes of data and is generating an additional 15 terabytes a year.
With simulation results stored on 10 Linux-based file servers, researchers found that mining the data from simulations was complicated and time-consuming. “The more than 2,300 simulations we’ve completed so far present a unique opportunity for cross-simulation studies to characterize the general dynamics of proteins, but they also pose some unique challenges in terms of data management and analysis,” says Andrew Simms, a graduate student in the Daggett Lab. “In the past, even the simplest queries involved writing intricate programs to locate and parse data of interest. Programs to answer questions that require data from multiple simulations were even more difficult to write, presenting a major obstacle to the discovery of new knowledge.”
To address the challenges of managing and mining such a large amount of data, the Daggett Research Group needed a reliable and manageable repository for its terabytes of simulation results. The repository would need to store data in a format that enabled portions of interest to be accessed quickly, easily, and in a manner that facilitated analysis—especially cross-simulation analysis. “One of our long-term goals is to provide a repository for simulation data, both from our lab and from other research groups, for the benefit of the broader community,” says David Beck, a postdoctoral fellow at Daggett Lab. “That goal mandates a data repository that is flexible and extensible enough to accommodate different data formats, with enough data captured about the different protocols under which simulations were run to allow meaningful comparisons to be made.”
Solution
The Daggett Research Group developed the Molecular Dynamics Data Warehouse, a scalable repository that vastly simplifies the management and analysis of simulation data. To construct the data warehouse, the Daggett Research Group took advantage of the relational and online analytical processing (OLAP) capabilities of Microsoft® SQL Server® 2005 data management software.
“Data-warehousing techniques have been applied widely in business and financial applications; however, application in scientific research has been less common,” says Simms. “We used SQL Server 2005 to design a novel, hybrid database that accommodates our large, varied datasets and is now the primary data access method for members of our lab.”
Developed and enhanced over a period of three years, the Molecular Dynamics Data Warehouse utilizes commodity server computers running the Windows Server® 2003 Enterprise x64 Edition operating system and Microsoft SQL Server 2005 Enterprise Edition (64-bit), including SQL Server 2005 Analysis Services for OLAP processing. The repository, which captures the decisions and workflows of the researchers producing the data along with simulation results, consists of four primary components:
- A relational “prep” database, which models a view of protein fold space, potential folds of interest, and simulations of those targets.
- A relational directory database, which maps simulation data to specific servers.
- Multiple relational simulation databases, which store the results of simulations.
- A multidimensional OLAP database, which removes the structural complexity of the relational store and presents a single, easy-to-use view of all coordinate and analysis data.
All databases reside on three Silicon Mechanics server computers. Two are configured with two Intel Xeon 5030 dual-core processors, while the third has two Intel Xeon E5405 quad-core processors. All have 16 gigabytes (GB) of RAM and an Intel PRO/1000 gigabit Ethernet adapter. The data warehouse was developed using the Microsoft Visual Studio® 2005 development system, which is the same tool that the Daggett Research Group used to adapt ilmm to run on Windows Compute Cluster Server 2003.
Molecular Dynamics Data Warehouse
The three servers upon which the Molecular Dynamics Data Warehouse is built collectively store more than 20 terabytes of simulation data, with approximately 7 terabytes on each of the dual-processor systems. The quad-processor system, which the group is in the process of loading, holds about 6 terabytes out of a projected 12-terabyte capacity. “We’re about 50 percent complete in loading all target data,” says Simms. “We expect to fill our third database server by July 2008 and are starting to prepare a fourth one, which we expect to fill by the end of the year.”
 |
Our use of SQL Server 2005 plays a pivotal role in supporting overall lab workflows and has enabled us to attack problems in new ways and at new magnitudes. |
 |
|
Valerie Daggett Professor, University of Washington |
|
|
Each server supports multiple simulation databases, and each database contains the results of about 100 simulations. The databases range from 100 GB to 1.8 terabytes in size, with individual table sizes of up to 43 GB, representing up to 496 million rows. “Although there’s no practical size limit for the databases and their tables, we’ve elected to limit their sizes to take advantage of parallel loading across multiple servers,” says Simms. “By loading simulations into separate tables, we can apply constraints and indexes early in the process, yet can still use heap-structured tables that are optimized for bulk loading.”
In the beginning, the Daggett Research Group used SQL Server 2005 Integration Services to load data into the warehouse. However, the group has transitioned to stored procedures and table functions that are written using C# and run under the Microsoft .NET Framework Common Language Runtime within SQL Server 2005. The loading process also takes advantage of the Client for Network File System (NFS) provided by Microsoft, which enables the group to mount its Windows-based systems directly to its Linux-based file servers without having to use SAMBA open source software, resulting in significantly better performance.
“It takes about an hour to load coordinate trajectories from one simulation into a table, including error checking,” says Simms. “That’s not bad, considering there are about 100 million rows of flat-file data, representing about 18 gigabytes. The Client for NFS has really changed how we load data, making the process much faster. We were ecstatic when we found out that the tool existed.”
During data loading, the import code builds Structured Query Language (SQL) views that contain all the data in each individual simulation—as well as the server and database name—in a single, easy-to-query structure. “Database-level views are aggregated across the data warehouse in the directory database, and the linked server feature in SQL Server 2005 is used to implement a federated structure that spans servers and databases,” says Simms. “Tables participating in database-level views are built with clustered primary keys, with constraints on key dimensional items, such as simulation ID, further optimizing the view structures for efficient access.”
Multidimensional Analysis
Now that simulation data is stored in SQL Server 2005–based relational databases, the Daggett Research Group is taking advantage of SQL Server 2005 Analysis Services to further expedite its analysis. By using Analysis Services to load the relational data into OLAP cubes, researchers avoid having to embed SQL calls for data access into some other program that can handle the analysis (such as one written in C++), or writing the full analysis in SQL, which would require a fairly in-depth understanding of the database schema and of relational calculus.
“Our project is unusual for its focus on the multidimensional and analytical capabilities of OLAP rather than its data-aggregation features,” says Catherine Kehl, who started the project and created the first OLAP prototypes. “Our datasets are uneven in structure, with different numbers of atoms and residues per simulation, different numbers of atoms per residue, and different time slices, resulting in a large number of properties that must be tracked and accounted for—and relationships that need to be modeled that are more complex than those easily described by relational calculus. The data-modeling capabilities of SQL Server Analysis Services have proved to be very flexible—well able to accommodate our needs.”
The Daggett Research Group has built three different OLAP cubes to perform various analyses and is now in the process of building a fourth to analyze amyloids, which are proteins that have been linked to some diseases like Mad Cow and Alzheimer’s. Relationships in the cube are built automatically according to primary-key and foreign-key relationships specified in the individual simulation databases. Additional relationships are added to accommodate partial foreign-key relationships between entities, and the partitioning capabilities of Analysis Services are used to add both coordinate and analysis data. The OLAP cube for the Dynameomics project, which is 32 GB in size, contains 10 measure groups and 12 dimensions, with up to 400 million members per dimension.
Researchers query the OLAP cubes using Multidimensional Expressions (MDX), a highly expressive query language that uses terms such as from, where, select, and with. “MDX allows for a great degree of flexibility both in defining complex analyses and in how those data are filtered and formatted for display,” says Simms. “The multidimensional nature of our data often means that we require a matrix output for our data, so the ability to use MDX to control the data returned from a query in terms of both rows and columns is a major advantage over structured query language.”
One example of this advantage can be seen in the effort to measure the distances between all residues in a protein at a given time. “Using MDX, it’s trivial to set the columns and rows to give an all-versus-all representation of the residues in a matrix,” says Simms. “However, to get a matrix representation using SQL, one must individually set each column to represent each residue.”
Double-Tasking the Windows-Based HPC Cluster
In addition to running SQL Server 2005 on the three servers that house its Molecular Dynamics Data Warehouse, the Daggett Research Group also runs SQL Server 2005 on its Windows-based HPC clusters, providing yet another means of analyzing the terabytes of information in its repository of simulation results. “By running SQL Server 2005 on top of Windows Compute Cluster Server 2003, we’re able to double-task our HPC cluster, using it for both simulation and analysis,” says Valerie Daggett, Professor of Bioengineering at the University of Washington. “This is something that we just can’t do with our Linux-based HPC systems. It provides a huge advantage, and it’s the reason why Windows Compute Cluster Server 2003 is our platform of choice for new local HPC resources.”
The Daggett Research Group runs SQL Server 2005 on each node on both of its Windows-based HPC clusters. The larger cluster, purchased in 2006, has 20 nodes, each of which is configured with two 2-gigahertz (GHz) Intel 5130 Xeon dual-core processors, 4 GB of RAM, and Intel PRO/1000 gigabit Ethernet adapters. The second such system, purchased in 2007, has 10 nodes, each configured with two 2-GHz Intel E5335 Xeon quad-core processors, 4 GB of RAM, and Intel PRO/1000 gigabit Ethernet adapters. Both run on Silicon Mechanics hardware.
Here’s how the process works: Researchers submit queries to the SQL Server 2005 databases running on each cluster node, which use the linked servers feature of SQL Server 2005 to execute commands against one or more of the databases running on the three servers in the Molecular Dynamics Data Warehouse. The HPC server nodes each do part of the work, generating intermediate results that are then combined into one master result set on the data warehouse. Similarly, the cluster nodes are used to store intermediate OLAP cubes.
“By parallelizing a query, we can have each of the servers in our 20-node cluster examine 5 percent of a target database, caching the temporary results on the cluster node,” says Beck. “Temporary results stored in tables on the cluster nodes are then synthesized into a final result set, which is fed back to the main data warehouse. It’s a lot like the gather-scatter paradigm in parallel computing.”
In the future, the Daggett Research Group may modify its ilmm simulation program running on the cluster nodes to write directly to the SQL Server 2005 database tables that make up the data warehouse, rather than to an intermediate, compressed binary format that must then be loaded into the warehouse. “First, we have to load the rest of our existing data into SQL Server,” says Simms. “We’ve already loaded 100 percent of our Dynameomics dataset, but only 50 to 60 percent of all target data.”
Other Uses of Relational Database Technology
The Daggett Research Group also uses SQL Server 2005 as a back end for its internally developed queuing service, which enables researchers to seamlessly dispatch jobs to either its Windows-based or Linux-based HPC clusters. The queuing service, which is written in Java and runs on Linux, uses the SQL Server 2005 Java Database Connectivity driver to access the Windows-based database servers. “In computational chemistry, there’s a strong Linux/UNIX mindset,” says Daggett. “So people in the lab were reluctant to use Windows at first. Today, now that we’ve made the process of dispatching jobs transparent, our Windows-based clusters are given more work and we’re very much a mixed lab.”
SQL Server 2005 also is used as a back end for the Dynameomics Web site (http://www.dynameomics.org/), through which the group is making a subset of the data in its Molecular Dynamics Data Warehouse available to other researchers throughout the world.
Benefits
By taking advantage of the relational and OLAP capabilities of SQL Server 2005, the Daggett Research Group has been able to fundamentally change how its terabytes of simulation data are managed and accessed, which in turn has contributed to an accelerated rate of scientific discovery. “Our use of SQL Server 2005 plays a pivotal role in supporting overall lab workflows and has enabled us to attack problems in new ways and at new magnitudes,” says Daggett. “By organizing simulation data into a scalable, manageable, and accessible repository, we can accelerate time-to-insight and begin to address substantial questions that move us closer to solving the protein-folding problem and other biomedical problems. With SQL Server 2005, we can investigate questions that were practically impossible to answer when our simulation results were stored in flat files.”
Streamlined Data Access
Moving simulation data into SQL Server 2005 has greatly reduced the complexity of managing that information, resulting in higher productivity for members of the Daggett Research Group. “Our data warehouse has greatly reduced the need to maintain hand-written lists and obscure copies of files, and we no longer need to write complex logic to navigate directory structures,” says Simms. “In addition, because data is stored in a relational format, we have access to that information through a greater variety of tools and interfaces.”
Today, researchers in the Daggett Research Group have two query language options—SQL and MDX—for accessing the data warehouse, both of which are relatively straightforward. “On the issue of query language, ease of use is of particular importance,” says Simms. “Ultimately, the database will be a resource for molecular biologists and biochemists as well, and it is likely that these users may not have extensive programming skills. Our experience with both SQL and MDX has shown that both are relatively easy to learn and use.”
Moving simulation results into SQL Server 2005 also has streamlined data access by helping to prevent problems with data integrity. “As the data are loaded into SQL Server 2005, we can easily check for issues such as gaps in time steps, using declarative integrity and check constraints to identify inconsistent data and prevent them from tainting the repository,” says Simms. “Such methods have also helped eliminate the problem of lost structures, which occurred infrequently but silently when a disk, file system, or program failure corrupted one or more compressed binary files.”
Today, the Daggett Research Group has a repository that not only can accommodate all its current data, but also can scale to support new simulation data as well. “We anticipate the current model will scale beyond 100 terabytes, based on a two-tiered view hierarchy, with up to 100 simulations per database view and up to 256 databases per top-level view,” says Simms.
Faster Questions and Answers
A relational data store also has reduced the amount of code needed to query simulation data, thereby enabling researchers to pose their questions faster. “As scientists, we need to be able to ask questions succinctly and rapidly,” says Beck. “We did a comparison of mining the data from the file system using Perl scripts and from the data warehouse using SQL statements. A representative script used to generate data for a Ramachandran plot—a visualization of dihedral angles—for 189 simulations consists of 246 lines of code and limited comments. In comparison, the SQL code to generate the same result is 26 lines long.”
By using both the relational and OLAP capabilities of SQL Server, the Daggett Research Group can take advantage of whichever technology is more suited to a particular analysis task. To compare SQL and MDX, the group developed a set of test queries that are typical of everyday queries run in its lab, coded each query in SQL and MDX, and then ran them on its dataset. “One particular test, in which all simulations were queried and the dihedral angles for all residues of a certain type were returned, ran more than 200 times as fast on OLAP as on SQL,” says Simms. “However, another query that calculates angle averages using circular statistics, in which a number of Visual Basic® for Applications trigonometry functions are called, completed within 16 seconds using SQL but did not finish within several hours when using OLAP.”
To capitalize on the differing strengths of the OLAP and relational capabilities of SQL Server 2005, the Daggett Research Group is taking a hybrid approach for its analysis work, making the most use of SQL’s better mathematical performance and OLAP’s speed and flexibility in data retrieval, especially across multiple simulations. Together, those technologies make it possible for new questions to be posed and, more importantly, answered. “We can examine 100 times more data because some tasks that used to take hours are now reduced to fractions of a second, enabling us to do things that we just couldn’t do before,” says Daggett. “It’s because of the ability to run SQL Server 2005 that any new systems we purchase are based on Windows.”
Sources
Supplemental information for this case study was drawn from The Dynameomics Data Warehouse: Design of a Computational Lab Workflow and Scientific Data Repository for Protein Simulations (Andrew M. Simms, Rudesh D. Toofanny, Catherine Kehl, Noah C. Benson, and Valerie Daggett, Protein Engineering Design & Selection 21 [2008]: 369–377) and Dynameomics: A Multi-dimensional Analysis-Optimized Database for Dynamic Protein Data (Catherine E. Kehl, Andrew M. Simms, Rudesh D. Toofanny, and Valerie Daggett, Protein Engineering Design & Selection 21 [2008]: 379–386).
Microsoft Server Product Portfolio
For more information about the Microsoft server product portfolio, go to:
www.microsoft.com/servers/default.mspx
For More Information
For more information about Microsoft products and services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (877) 568-2495. Customers who are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at (800) 892-5234 in the United States or (905) 568-9641 in Canada. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information using the World Wide Web, go to:
www.microsoft.com
For more information about the Daggett Research Group, visit the Web site at:
depts.washington.edu/daglab