Scientists use machine learning to predict DNA binding rates from sequence
The binding of DNA strands by Watson-Crick base pairing is a fundamental process in biotechnology, which is used around the world for reading and writing DNA sequences and for assembling DNA nanostructures. Yet this process remains poorly understood, and there is still no way to accurately predict how quickly two DNA strands will bind. The binding rate of two complementary DNA strands can vary by a factor of 10,000 or more depending on their DNA sequence, which can lead to unexpected behavior that is both costly and time-consuming. In a new paper published by the journal Nature Chemistry, scientists at Rice University and Microsoft Research describe a method that predicts the binding rate of DNA strands directly from their sequence to within a factor of three, with 91% accuracy.
The method is based on Weighted Neighbour Voting, and predicts the binding rate of a DNA sequence to its complement, based on its similarity to sequences whose rates have previously been measured. The method maps each DNA sequence to a set of bioinformatic feature values, which can be considered a point in high-dimensional feature space. Two sequences that are close in feature space are expected to have similar binding rates, where the binding rate of a new sequence is predicted from the weighted average of previously measured binding rates, with weights dropping exponentially for sequences that are farther away in feature space. To train the algorithm, scientists led by Dave Zhang at Rice University precisely measured the binding rates of 100 different pairs of complementary DNA strands of fixed length, sampled from two genes in the human genome, over a range of temperatures. Working in collaboration with scientists led by Andrew Phillips at Microsoft Research, the team used machine learning to prune an initial set of 50 rationally designed features down to a combination of just six. Computational optimization was used to select the feature weights, and a leave-one-out strategy was used to validate the predictive power of the method. The approach is highly scalable, and easily incorporates new experimental measurements to provide improved predictions, without requiring model retraining. With every new binding rate measured, the six-dimensional feature space becomes denser, ensuring that on average a new sequence will be closer to a previously measured one.
The method has so far been used to design efficient probes for target enrichment from genomic DNA, an important technique for scientific and clinical studies. More generally, the ability to accurately predict DNA binding rates could lead to substantial cost savings and faster workflows. Given the ubiquity of DNA probe and primer strands in scientific research and molecular diagnostics today, this work will garner immediate and broad interest from the many scientists who currently take a trial-and-error approach to DNA strand design.