Building a .NET Quality Control Tool for Next-Generation Sequencing Technologies
The challenge of DNA sequencing is central to all genomics research, and while the technology has existed since the 1970s, today’s massively-parallel sequencing instruments are capable of producing gigabytes of raw genomic data quickly and increasingly cheaply. Reconstruction of a DNA sequence from this data (for example, through de novo assembly) is a compute-intensive task, and experimentation has shown that data quantity is no substitute for quality when it comes to the accurate reconstruction of a DNA sequence. Unfortunately, not all sequencing technologies produce reliable and accurate results, and experimental data will always contain varying rates of error. Therefore, a preliminary quality control (QC) step is regularly employed to detect and counteract such sequencing errors.
The QC of sequencing results may range from simple manual filtering procedures to comprehensive automated solutions. To contribute to this area of QC tools development, we present Sequence Quality Control Studio (SeQCoS), a Microsoft .NET software suite that is designed to perform an array of QC evaluations and post-QC manipulation of sequencing data. SeQCoS generates a series of standard plots that illustrate the quality of the input data. These plots (saved in JPEG file format) provide information on commonly observed measurements, such as GC content (the proportion of guanine and cytosine nucleotide bases in a DNA sequence), and distribution of quality scores at position-specific and sequence-specific levels. In order to filter out poorly performing sequences, SeQCoS also conducts basic trimming and discarding functions to manipulate sequence files.
At Microsoft Research, the Microsoft Biology Initiative team is collaborating with academic research groups in the sequencing of various organisms. To ensure that the sequenced sample is not contaminated by other strains or sequencing vectors, SeQCoS optionally integrates NCBI BLAST for PCs running the Windows operating system to search against a BLAST-formatted database. We provide a pre-formatted database of NCBI UniVec, a repository of vector sequences, adapters, linkers and PCR (polymerase chain reaction) primers that are used in DNA sequencing; however, researchers can use a different database if they prefer.
About the Tools
SeQCoS was written in C#, using the .NET Bio (formerly the Microsoft Biology Foundation [MBF]) bioinformatics toolkit and Sho, a data analysis and visualization application. It is freely available as open-source code under the Apache 2.0 license. Further details and software downloads are available from Sequence Quality Control Studio.
.NET Bio is a library of common bioinformatics functions (file parsers, algorithms, and web service connectors) that simplify the creation of bioinformatics applications on the .NET platform and is an open-source project that is freely available for academic and commercial use under the Apache 2.0 license. While this project was initiated by Microsoft Research, it is owned by the Outercurve Foundation, a non-profit organization, and is governed by a growing community of users and contributors.
—Kevin Ha, Microsoft Research Intern