Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Scientists use machine learning to predict DNA binding rates from sequence

November 8, 2017 | By Microsoft blog editor

By Microsoft Research Lab – Cambridge and Department of Bioengineering, Rice University

Rice University / Dave Zhang / Cindy Thaung

The binding of DNA strands by Watson-Crick base pairing is a fundamental process in biotechnology, which is used around the world for reading and writing DNA sequences and for assembling DNA nanostructures. Yet this process remains poorly understood, and there is still no way to accurately predict how quickly two DNA strands will bind. The binding rate of two complementary DNA strands can vary by a factor of 10,000 or more depending on their DNA sequence, which can lead to unexpected behavior that is both costly and time-consuming. In a new paper published by the journal Nature Chemistry, scientists at Rice University and Microsoft Research describe a method that predicts the binding rate of DNA strands directly from their sequence to within a factor of three, with 91% accuracy.

The method is based on Weighted Neighbour Voting, and predicts the binding rate of a DNA sequence to its complement, based on its similarity to sequences whose rates have previously been measured. The method maps each DNA sequence to a set of bioinformatic feature values, which can be considered a point in high-dimensional feature space. Two sequences that are close in feature space are expected to have similar binding rates, where the binding rate of a new sequence is predicted from the weighted average of previously measured binding rates, with weights dropping exponentially for sequences that are farther away in feature space. To train the algorithm, scientists led by Dave Zhang at Rice University precisely measured the binding rates of 100 different pairs of complementary DNA strands of fixed length, sampled from two genes in the human genome, over a range of temperatures. Working in collaboration with scientists led by Andrew Phillips at Microsoft Research, the team used machine learning to prune an initial set of 50 rationally designed features down to a combination of just six. Computational optimization was used to select the feature weights, and a leave-one-out strategy was used to validate the predictive power of the method. The approach is highly scalable, and easily incorporates new experimental measurements to provide improved predictions, without requiring model retraining. With every new binding rate measured, the six-dimensional feature space becomes denser, ensuring that on average a new sequence will be closer to a previously measured one.

The method has so far been used to design efficient probes for target enrichment from genomic DNA, an important technique for scientific and clinical studies. More generally, the ability to accurately predict DNA binding rates could lead to substantial cost savings and faster workflows. Given the ubiquity of DNA probe and primer strands in scientific research and molecular diagnostics today, this work will garner immediate and broad interest from the many scientists who currently take a trial-and-error approach to DNA strand design.


Up Next

Data visualization, analytics, and platform, Medical, health and genomics

Helping proteomics scientists share peptide data: Azure does the heavy lifting

Scientific research breakthroughs are often achieved when many different scientists, in different labs and organizations, work together on a single task. That happened at the turn of the 21st century with the Human Genome Project, where human DNA was mapped for future reference and is now key to many breakthroughs in medicine. This is happening […]

Vani Mandava

Director, Data Science Outreach

Medical, health and genomics

Researchers build nanoscale computational circuit boards with DNA

By Microsoft Research Human-engineered systems, from ancient irrigation networks to modern semiconductor circuitry, rely on spatial organization to guide the flow of materials and information. Living cells also use spatial organization to control and accelerate the transmission of molecular signals, for example by co-localizing the components of enzyme cascades and signaling networks. In a new […]

Microsoft blog editor

Medical, health and genomics

From cancer to crop genomics — using Research as a Service at the intersection of computers and biology

By Kenji Takeda, Director, Azure for Research, AI and Research Ever since Nicola Bonzanni was a little boy playing in the tiny Italian village of Bonate Sotto, just north of Milan, he was fascinated by nature and by building things. As he grew up, he wondered how computing and nature might be intertwined. “While I […]

Microsoft blog editor