Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Scientists use machine learning to predict DNA binding rates from sequence

November 8, 2017 | By Microsoft blog editor

By Microsoft Research Lab – Cambridge and Department of Bioengineering, Rice University

Rice University / Dave Zhang / Cindy Thaung

The binding of DNA strands by Watson-Crick base pairing is a fundamental process in biotechnology, which is used around the world for reading and writing DNA sequences and for assembling DNA nanostructures. Yet this process remains poorly understood, and there is still no way to accurately predict how quickly two DNA strands will bind. The binding rate of two complementary DNA strands can vary by a factor of 10,000 or more depending on their DNA sequence, which can lead to unexpected behavior that is both costly and time-consuming. In a new paper published by the journal Nature Chemistry, scientists at Rice University and Microsoft Research describe a method that predicts the binding rate of DNA strands directly from their sequence to within a factor of three, with 91% accuracy.

The method is based on Weighted Neighbour Voting, and predicts the binding rate of a DNA sequence to its complement, based on its similarity to sequences whose rates have previously been measured. The method maps each DNA sequence to a set of bioinformatic feature values, which can be considered a point in high-dimensional feature space. Two sequences that are close in feature space are expected to have similar binding rates, where the binding rate of a new sequence is predicted from the weighted average of previously measured binding rates, with weights dropping exponentially for sequences that are farther away in feature space. To train the algorithm, scientists led by Dave Zhang at Rice University precisely measured the binding rates of 100 different pairs of complementary DNA strands of fixed length, sampled from two genes in the human genome, over a range of temperatures. Working in collaboration with scientists led by Andrew Phillips at Microsoft Research, the team used machine learning to prune an initial set of 50 rationally designed features down to a combination of just six. Computational optimization was used to select the feature weights, and a leave-one-out strategy was used to validate the predictive power of the method. The approach is highly scalable, and easily incorporates new experimental measurements to provide improved predictions, without requiring model retraining. With every new binding rate measured, the six-dimensional feature space becomes denser, ensuring that on average a new sequence will be closer to a previously measured one.

The method has so far been used to design efficient probes for target enrichment from genomic DNA, an important technique for scientific and clinical studies. More generally, the ability to accurately predict DNA binding rates could lead to substantial cost savings and faster workflows. Given the ubiquity of DNA probe and primer strands in scientific research and molecular diagnostics today, this work will garner immediate and broad interest from the many scientists who currently take a trial-and-error approach to DNA strand design.


Up Next

Medical, health and genomics

Programming biology with Dr. Andrew Phillips

Episode 67, March 13, 2019 - Today, Dr. Phillips talks about the challenges and rewards inherent in reverse engineering biological systems to see how they perform information processing. He also explains what we can learn from stressed out bacteria, and tells us about Station B, a new end-to-end platform his team is working on that aims to reduce the trial and error nature of lab experiments and help scientists turn biological cells into super-factories that could solve some of the most challenging problems in medicine, agriculture, the environment and more.

Microsoft blog editor

Medical, health and genomics

Researchers build nanoscale distributed DNA computing systems from artificial protocells

Living cells communicate with each other by sending and receiving molecular signals that diffuse between neighboring cells to activate key molecular processes. This communication enables cell populations to implement collective information processing functions that cannot be achieved by individual cells in isolation. Although synthetic biologists have made significant progress in engineering cell populations to perform […]

Microsoft blog editor

Data platforms and analytics, Medical, health and genomics

Helping proteomics scientists share peptide data: Azure does the heavy lifting

Scientific research breakthroughs are often achieved when many different scientists, in different labs and organizations, work together on a single task. That happened at the turn of the 21st century with the Human Genome Project, where human DNA was mapped for future reference and is now key to many breakthroughs in medicine. This is happening […]

Vani Mandava

Director, Data Science Outreach