Azimuth: Machine Learning-Based Predictive Modelling of CRISPR/Cas9 guide efficiency

Azimuth: Machine Learning-Based Predictive Modelling of CRISPR/Cas9 guide efficiency

Established: July 1, 2015

Project Summary

The CRISPR/Cas9 system provides state-of-the art genome editing capabilities. However, several facets of this system are under investigation for further characterization and optimization. One in particular is the choice of guide RNA that directs Cas9 to target DNA: given that one would like to target the protein-coding region of a gene, hundreds of guides satisfy the constraints of the CRISPR/Cas9 Protospacer Adjacent Motif sequence. However, only some of these guides efficiently target DNA to generate gene knockouts. One could laboriously and systematically enumerate all possible guides for all possible genes and thereby derive a dictionary of efficient guides, however, such a process would be costly, time-consuming, and ultimately not practically feasible. Instead, one can (1) enumerate all possible guides over each of some smaller set of genes, and then test these experimentally by measuring the knockout capabilities of each guide, (2) thereby assemble a training data set with which one can “learn”, by way of predictive machine learning models, which guides tend to perform well and which do not, (3) use this learned model to generalize the guide efficiency for genes not in the training data set. In particular, by deriving a large set of possible predictive features consisting of both guide and gene characteristics, one can elicit those characteristics that define guide-gene pairs in an abstract manner, enabling generalizing beyond those specific guides and genes, and in particular, for genes which we have never attempted to knock out and therefore have no experimental evidence. Based on such a set of experiments, we present a state-of-the art predictive approach to modeling which RNA guides will effectively perform a gene knockout by way of the CRISPR/Cas9 system. We demonstrate which features are critical for prediction (e.g., nucleotide identity), which are helpful (e.g., thermodynamics), and which are redundant (e.g., microhomology); then we combine our insights of useful features with exploration of different model classes, settling on one model which performs best (gradient-boosted regression trees). Finally, we elucidate which measures should be used for evaluating these models in such a context.

An introductory video on what CRISPR is and why it is important can be found here.


Please cite this paper if using our predictive model:

  • John G. Doench*, Nicolo Fusi*, Meagan Sullender*, Mudra Hegde*, Emma W.
    Vaimberg*, Katherine F. Donovan, Ian Smith, Zuzana Tothova, Craig Wilen , Robert
    Orchard , Herbert W. Virgin, Jennifer Listgarten*, David E. Root. Optimized sgRNA design to maximize activity and minimize off-target effects for genetic screens with CRISPR-Cas9 .
    Nature Biotechnology Jan 2016, doi:10.1038/nbt.3437. (*equal contributions, corresponding author)

A few more supplementary details on some of the modelling can also be found in:

Associated Data

  • The combined FC and RES data, along with the predictions from our final, published model can be found here.

How to use our software

Use our cloud-based server (Azure ML) to predict the guide effectiveness for any sgRNA guide(s) of choice, and any target gene. Access the server in one of several ways:

From within Excel

By downloading from your browser this example Excel template, you can fill in your own guides, and our prediction service will automatically populate the guide effectiveness column. Email to get the free API key that enables the template.

From a web page

Available from this server at the Broad Institute of MIT and Harvard.

From your programming language of choice (R/Python/Java/C, etc.)

Instructions here. Please email to get the free API key that enables this.

Source code and stand-alone python scripts

A python implementation of our model (and other competing methods) is available from GitHub. At a later date, Jupyter notebooks describing it’s usage will also available.

Version history

  1. The Azure-based we service has been running version 1.0 (6f8728b) since inception. It will momentarily be changed to version 2.0, but has not yet been.
  2. On 11/1/2016, we released version 2.0 (73522ac). In this version, we locked down the random state input to GradientBoostedRegressor. In previous versions, stochastic differences could appear owing to this state not being set. We also discovered that use of sklearn 0.16 yielded a different model than 0.17, which could have caused differences for some users. Finally, we updated our unit test to be much larger (~1,000 guides instead of 3), and fixed the continuous integration which had led to some of these issues not being tracked properly.
  3. On 10/28/2016, we found a bug which affects users who downloaded code between 07/15/2016 and 10/28/2016 and who did not use to retrain the model pickle files. This bug was first fixed on 10/28/2016 (d4dbca8).
  4. On 01/27/2016, we found a tiny bug and fixed it (980bf92). None of the Spearman correlations reported in the paper changed after the bug fix.
  5. On 12/18/2015, we released code corresponding to the Nature Biotech 2016 paper (6f8728b), version 1.0.
  6. To see which version has or is running on the Broad web server, please go here.


User’s guide

  • The model takes as features, (i) the 30mer guide sequence as specified in our paper, (ii) the position of the guide in the gene in terms of both the “cut site” and the “percent peptide”. While the latter features (cut site and percent peptide) are helpful for prediction, if they are not convenient for you to obtain, you may simply use “-1” for each of them, which will induce a separate model to be used in which these features are turned off.

Associated people

For questions please contact