DNA sequencing analysis is a form of life sciences research that has the potential to lead to a wide range of medical and pharmaceutical breakthroughs. However, this type of analysis requires supercomputing resources and Big Data storage that
many researchers lack. Working through a grant provided by the National Science Foundation in partnership with Microsoft, a team of computer scientists at Virginia Tech addressed this challenge by developing an on-demand, cloud-computing model using the Windows
Azure HDInsight Service. By moving to an on-demand cloud computing model, researchers will now have easier, more cost-effective access to DNA sequencing tools and resources, which could lead to even faster, more exciting advancements in medical research.
The Virginia Bioinformatics Institute and the Department of Computer Science at Virginia Tech began using a network of supercomputers to locate undetected genes in a massive genome database. This and related work by other institutions has the potential to
lead to exciting medical breakthroughs, including new cancer therapies and antibiotics used to combat the emergence of drug-resistant bugs.
However, as the size of genome databases grows, so has the challenge of analyzing them. And with the advent of next-generation sequencers (NGS), this growth has been exponential. “Of the estimated 2,000 DNA sequencers worldwide, they are generating 15 petabytes
of genome data every year,” explains Wu Feng, Professor of Computer Science at Virginia Tech. Many life sciences institutions simply do not have access to the computational and storage resources required to work with data sets of this size. In other words,
says Feng, “We’re generating data faster than we can analyze it.”
The team had already recognized the potential of high-performance cloud computing to address the resource challenge. But now they wanted to develop software that would make it even easier for scientists to take advantage of these cloud resources, which would
lead to faster genome analysis. And that’s how Feng was introduced to the potential of the Windows Azure HDInsight Service running on the Windows Azure platform.
|Although the cost to compute on a genome continues to
drop exponentially, the cost to sequence a genome has
dropped even more dramatically - from $95 million in
2001 to $5,700 in 2013. This stunning decline in cost is
being driven by the arrival of next-generation
sequences (NGS), which produce 15 petabytes of data for analysis each year.
Feng’s team was one of only 13 from across the country elected by a research program called Computing in the Cloud. Run by the National Science Foundation in partnership with Microsoft, the program was designed to accelerate access to cloud computing for research
discovery, data analysis, and multidisciplinary collaboration. Based on the potential of their proposal, Feng’s team was awarded both a grant that covered the cost of using the Windows Azure platform and its supporting technical resources.
Feng had looked at an alternative cloud service on which to do the work but found that it did not meet the requirements needed for the team’s development efforts. This included resource and support levels that simply weren’t as robust as the Microsoft offering.
For Feng and his team, Windows Azure provided an ideal combination of infrastructure and technical support “to conduct the research and development necessary to facilitate personalized genomics for the broader research community.”
||Windows Azure is enabling us to keep up with the data deluge in the DNA sequencing space. We’re not only analyzing data faster, but analyzing it more intelligently.
| Wu Feng
Professor of Computer Science
Since being awarded the grant, Feng and his team have developed two software artifacts: SeqInCloud, a popular genetic variant pipeline called the Genome Analysis Toolkit (GATK), and CloudFlow, a workflow management framework that uses both client and cloud
SeqInCloud (short for "sequencing in the cloud") is based on the Broad Institute's Genome Analysis Toolkit (GATK), a toolkit for analyzing next-generation sequencing data, with the main focus on variant discovery and genotyping.
SeqInCloud seamlessly generalizes the GATK pipeline, allowing it to run in the cloud using HDInsight and Windows Azure in order to maximize portability. The SeqInCloud application also features a novel design strategy for data partitioning, data transfer, and
storage optimization on Windows Azure. The result is more efficient use of Azure cloud resources and better performance overall.
CloudFlow is a workflow management framework that can be installed on a researcher’s PC to help interactions with the Windows Azure HDInsight Service. As Feng explains, “It allows us to compose flexible MapReduce pipelines that simultaneously utilize both client
and cloud resources for running the pipeline and automating data transfers. This is where the HDInsight resource has been particularly useful.” To run large tasks, researchers can automatically provision HDInsight clusters on demand.
The CloudFlow framework delivers unique features that are not offered by existing MapReduce-based workflow managers, including enabling the simultaneous use of client and cloud resources, automatic data-dependency handling between client and cloud resources,
and the flexibility of implementing user-defined plugins for data transformations.
By taking advantage of the Windows Azure platform, Feng and his team showed how well the Windows Azure HDInsight Service can be used seamlessly to deliver cloud applications with advanced capabilities.
By making the Windows Azure HDInsight Service more effective and accessible for DNA sequencing researchers, the project has produced several key benefits.
Provides Significant Cost Savings
The cloud computing solution developed by Feng’s team can address growing resource issues that come with analysis of genome sequencing data. As Feng notes, “Life scientists and their institutions no longer have to find millions of dollars to establish their
own supercomputing center. Rather than incur the cost of housing their own data center resources and create their own provisioning and scheduling policies, this is done for them through the Windows Azure ecosystem.”
Because the data persists in Azure blob stores independently of HDInsight, additional costs savings are realized by only paying for the compute power of the HDInsight clusters for the duration of their actual use, all without losing data.
|The rate at which genome sequences are being
generated is increasing faster than researchers'
ability to analyze and store them. (Source: Scott
D. Kahn, "On the Future of Genomic Data," Science
Supports Collaborative Analysis Anytime, Anywhere
Feng also notes the value of the Azure cloud platform as an effective collaborative tool. “The model enables the easy sharing of public data sets and helps to facilitate large-scale collaborative research.” And because the applications can be accessed
from virtually anywhere, including on mobile devices, Feng sees an opportunity not far in the future when researchers will be able to engage in genome analysis outside the laboratory, “say at a hospital, which could lead to faster, prescribed treatments.”
Even in these early stages of development, the benefits of the solution are quickly being recognized by other top research institutions across the country. For example, says Feng, “The solution has already generated interest from the University of California
at Berkeley and here at the Virginia Bioinformatics Institute.”
At the same time, Feng’s team continue to expand their research using the Windows Azure HDInsight Service. “Windows Azure is enabling us to keep up with the data deluge in the DNA sequencing space,” says Feng. “We’re not only analyzing data faster, but analyzing
it more intelligently.”
For more information about Microsoft products and services, call the Microsoft Sales Information Center at
In Canada, call the Microsoft Canada Information Centre at
Customers in the United States and Canada who are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at
Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information using the World Wide Web, go to:
For more information about Virginia Tech, visit the website at:
To learn more about the DNA sequencing program described in this case study, please refer to: N. Mohamed, H. Lin, and W. Feng, “Accelerating Data- Intensive Genome Analysis in the Cloud,” in Proceedings of the 5th International Conference
on Bioinformatics and Computational Biology (Honolulu, Hawaii, March 2013), pp. 297–304
This case study is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.