Virginia Tech is one of the country’s leading research institutions with a US$454 million portfolio of projects that includes DNA sequencing analysis. The Virginia Bioinformatics Institute and the Department of Computer Science at Virginia Tech began using a network of supercomputers to locate undetected genes in a massive genome database. This and related work by other institutions has the potential to lead to exciting medical breakthroughs, including new cancer therapies and antibiotics used to combat the emergence of drug-resistant bugs.
However, the challenge of analyzing genome databases has grown along with the size. And with the advent of next-generation sequencers, this growth has become exponential. Producing 15 petabytes of genome data annually, Virginia Tech was generating information faster than it could analyze. The bioinformatics and computer science team at Virginia Tech had already recognized the potential of high-performance cloud computing to address the resource challenge. But now it wanted to develop software that would make it even easier for scientists to take advantage of cloud resources and speed analysis.
To reduce costs and improve access to DNA sequencing tools and analysis, the Virginia Tech team decided to create an on-demand cloud computing model based on Microsoft Azure HDInsight. The team was one of only 13 from across the country selected by Computing in the Cloud, a program run by the National Science Foundation and Microsoft. The team had looked at other cloud services, but found that they did not meet the technical and support requirements for its development efforts.
The Virginia Tech team created two applications: SeqInCloud, a genome analysis toolkit for analyzing next-generation sequencing data; and CloudFlow, a workflow management framework that facilitates interaction between local PCs and HDInsight. SeqInCloud runs seamlessly in the cloud and features a novel design strategy for data partitioning, data transfer, and storage optimization on Microsoft Azure. The result is more efficient use of cloud resources and better performance overall.
The CloudFlow framework delivers unique features that are not offered by existing MapReduce-based workflow managers, including the simultaneous use of local and cloud resources, automatic data-dependency handling between local and cloud resources, and the flexibility of implementing user-defined plugins for data transformations.
By moving to an on-demand cloud computing model, researchers will now have easier, more cost-effective access to DNA sequencing tools and resources, which could lead to even faster, more exciting advancements in medical research. “Microsoft Azure is enabling us to keep up with the data deluge in the DNA sequencing space,” says Wu Feng, Professor of Computer Science at Virginia Tech. “We’re not only analyzing data faster, but analyzing it more intelligently.”
- Provides significant cost savings
- Enables collaborative analysis
- Supports research anytime, anywhere