Machine learning approach improves CRISPR-Cas9 guide pairing

Some will say that finding just the right wine to pair with a meal can improve even the finest cuisine, transforming a pleasant gustatory experience into something approaching perfection. But with potentially hundreds of wines to choose from, picking the “right” one can be a chore for the casual...

Some will say that finding just the right wine to pair with a meal can improve even the finest cuisine, transforming a pleasant gustatory experience into something approaching perfection. But with potentially hundreds of wines to choose from, picking the “right” one can be a chore for the casual wine-lover. That’s where the sommelier comes in, applying expertise to curate a list of only the best pairings to suit one’s needs.

A similar process is playing out in the very different world of gene editing, where researchers from the Broad Institute’s Genetic Perturbation Platform (GPP) and Microsoft Research have played the role of “sommelier” for researchers seeking to refine their CRISPR toolkit: they’ve developed a predictive model that reveals which single-guide RNA (sgRNA) sequences are best paired with the genome-engineering tool CRISPR-Cas9 to successfully interrogate the genome. The team reports on their method this week in Nature Biotechnology and will make their curated lists of preferred sgRNAs (the wine- and cheese-themed libraries "Brunello" and "Brie," for human and mouse genomes, respectively) available to the scientific community via Addgene

Within the past few years, the bacterial immune system CRISPR-Cas9 has become the most effective means to perturb genes in the lab—to tweak or “knock out” gene function systematically across the genome to determine what genes-of-interest do within a biological system. For the CRISPR system to knock out genes in such screens, one of its natural enzymes—in this case, Cas9, a cleaving enzyme that has been likened to “molecular scissors” that cut DNA— must be paired with the aforementioned sgRNA. An sgRNA is an RNA sequence roughly 20 nucleotides long that homes in on a matching sequence in the genetic code of the targeted DNA. When the guide finds its target, Cas9 cuts the DNA at the target site, knocking out the gene. The biological repercussions of turning off the gene reveal important clues about the gene's role within the organism.

While the CRISPR-Cas9 system has enabled these types of screens with unprecedented ease, it isn’t perfect. There are hundreds of potential sgRNAs that can be used to target a given gene, and not all of them act on their target sites equally well. There is also a risk of so called “off-target” activity, in which the Cas9/sgRNA structure binds to an unintended target, causing unwanted disruptions in gene activity that confound screen results. Such off-target effects are also a concern for those considering testing CRISPR-Cas9 in live cells.

“Obviously, people worry about these off-target effects. Whether you’re using CRISPR in a lab or whether you’re thinking of using it therapeutically, the prospect of cutting somewhere in the genome where you don’t intend to is a problem,” explains John Doench, associate director of GPP and the paper’s co-first author.

The researchers recognized that, to optimize the system’s on-target activity and minimize off-target effects, it would be useful to identify which of the many available sgRNAs most consistently find their target. Since it wouldn’t be feasible to test all of the hundreds of sgRNAs against the 20,000 genes in the human genome, they instead tested the sgRNAs against a small subset of genes (about 20) that had been well characterized in terms of their function and their performance in CRISPR screens. The idea was to look for features in the RNA guide sequences that predicted greater success in on-target performance.

The GPP team had looked at this before. For a paper published in 2014, they had used the approach to empirically define a list of rules that tended to yield more on-target activity in sgRNA (for instance, having a “G” nucleotide in a particular spot in the target sequence). The exercise resulted in the Avana and Asiago libraries, Brunello and Brie's predecessors.

While the guides in these libraries have been shown to outperform those from older libraries that selected sgRNAs by other methods, the researchers thought they could do better; by their own admission, the mathematical models they used to establish their rules were “pretty basic.” Then, a chance meeting with Jennifer Listgarten and Nicolo Fusi—researchers at Microsoft Research—brought a new approach to bear on the problem: machine learning.

“Machine learning has traditionally focused on prediction: how to predict what ads you are most likely to click on; which products or movies to recommend based on previous preferences. While talking with John, it became immediately clear that part of the project he was working on was essentially a prediction problem: how to predict which part of a gene to target in order to knock it down,” says Fusi, who is also one of the paper’s co-first authors.

The Microsoft researchers offered to lend their machine-learning expertise to the effort. They used a “supervised machine learning” approach that relied on expert input from the GPP team. That input took the form of a “training data set”—genetic perturbation screen results for a small subset of genes for which the on-target effectiveness rates were inferred by the GPP team.

“Using this ‘training data,’ the machine learning algorithms were able to infer general patterns in the data and encode them so as to generalize CRISPR effectiveness beyond just those genes used in the training data,” explains Listgarten, who was a co-senior author of the paper. The rules identified by the final, “trained” machine learning model, she notes, can be generalized to any of the 20,000 genes in the human body, to deduce how best to knock any one of them down, even if the gene has never before been evaluated in a CRISPR screen.

The GPP team then performed a new set of experiments aimed at identifying which sgRNAs tended to cause off-target activity—something they hadn’t tested for the 2014 paper. Once they’d generated the new data, they looked for sgRNA features that predicted off-target activity. Their findings were then combined with those identified by the on-target machine learning model to further refine the list of desirable sgRNAs. The choicest sgRNAs—those that found their target most often while having the fewest off-target effects—were reserved for the refined Brunello and Brie libraries.

The benefits of using these curated libraries are clear: screens simply work better if a higher percentage of the reagents used are effective, and larger projects can be done at scale at a lower cost using fewer resources. For expansive projects like the Broad’s Project Achilles, which perturbs genes across hundreds of cancer cell lines to learn what kills cancer cells, those efficiencies can add up.

“Having these libraries is also a huge advantage if we’re doing CRISPR screens in more challenging model systems for which the quantities of cells available are limited—for instance, primary cells derived from patients or in vivo mouse models,” Doench adds.

David Root, senior director of GPP and co-senior author of the paper, summarizes the value these libraries bring to scientific research: “The big advantage is that researchers will have a much better batting average when doing CRISPR-Cas9 experiments,” he says. “They’ll effectively be screening a lot more of the genome because they’ll have more reliable, on-target data; they’ll have fewer false or misleading discoveries from off-target effects; and they’ll learn more biology as a result.”

The libraries currently house sgRNA recommendations for pairing with S. pyogenes Cas9, the CRISPR enzyme most commonly used in today’s CRISPR screens, but the team expects this machine learning model can help create libraries of optimized guides for other CRISPR genome engineering systems, including the Cas9 alternatives found over the past year.

Other researchers who worked on the project include Meagan Sullender, Mudra Hegde, Emma Vaimberg, Katherine Donovan, Ian Smith, and Zuzana Tothova of Broad, and Craig Wilen, Robert Orchard, and Herbert Virgin of Washington University School of Medicine.

For more on how machine learning was applied to the CRISPR optimization effort, read the related news story on the Microsoft Research website.

Paper cited:
Doench, J et al. Optimized sgRNA design to maximize activity and minimize off-target effects for genetic screens with CRISPR-Cas9. Nature Biotechnology. Online January 18, 2016. DOI:10.1038/nbt.3437