Machine learning guides researchers to new synthetic genetic switches

A new method allows precise activation or repression of genes in specific cells and tissues.

An illustration of four circular diagrams representing plasmids, each with colored segments indicating different pieces of regulatory DNA.
Credit: Natalie Velez, Broad Communications, from Gosai and Castro et al 2024.

Researchers at The Jackson Laboratory (JAX), the Broad Institute of MIT and Harvard, and Yale University, have used artificial intelligence to design thousands of new DNA switches that can precisely control the expression of a gene in different cell types. Their new approach opens the possibility of controlling when and where genes are expressed in the body, for the benefit of human health and medical research, in ways never before possible. 

"What is special about these synthetically designed elements is that they show remarkable specificity to the target cell type they were designed for," said Ryan Tewhey, an associate professor at The Jackson Laboratory and co-senior author of the work with Steven Reilly of Yale, and Pardis Sabeti of the Broad. "This creates the opportunity for us to turn the expression of a gene up or down in just one tissue without affecting the rest of the body."

In recent years, genetic editing technologies and other gene therapy approaches have given scientists the ability to alter the genes inside living cells. However, affecting genes only in selected cell types or tissues, rather than across an entire organism, has been difficult. That is in part because of the ongoing challenge of understanding the DNA switches, called cis-regulatory elements (CREs), that control the expression and repression of genes. 

In a paper published in Nature, Tewhey, Reilly, Sabeti, and their collaborators not only designed new, never-before-seen synthetic CREs, but used the CREs to successfully activate genes in brain, liver or blood cells without turning on those genes in other cell types. 

"The more we learn about the genome, the more we see evidence of the deep influence elements like CREs have on biological function," said Sabeti, who is a core institute member at Broad and a professor at Harvard University and the Harvard T. H. Chan School of Public Health. "By applying machine learning and molecular biology to the logic of when and where CREs work, we can leverage that knowledge using generative AI to build tools for modulating gene expression in new ways experimentally and, perhaps one day, therapeutically."

Tissue- and time-specific instructions

Although every cell in an organism contains the same genes, not all the genes are needed in every cell, or at all times. CREs help ensure that genes needed in the brain are not used by skin cells, for instance, or that genes required during early development are not activated in adults. CREs themselves are not part of genes, but are separate, regulatory DNA sequences – often located near the genes they control. 

Scientists know that there are thousands of different CREs in the human genome, each with slightly different roles. But the grammar of CREs has been hard to figure out, "with no straightforward rules that control what each CRE does," explained Rodrigo Castro, a computational scientist in the Tewhey lab at JAX and co-first author of the new paper. "This limits our ability to design gene therapies that only effect certain cell types in the human body."

"This project essentially asks the question: 'Can we learn to read and write the code of these regulatory elements?'" said Reilly, who is an assistant professor of genetics at Yale and one of the senior authors of the study. "If we think about it in terms of language, the grammar and syntax of these elements is poorly understood. And so, we tried to build machine learning methods that could learn a more complex code than we could do on our own."

https://www.broadinstitute.org/news/test-5

Credit: Natalie Velez, Broad Communications; and Sager Gosai

Using a form of artificial intelligence (AI) called deep learning, the group trained a model using hundreds of thousands of DNA sequences from the human genome that they measured in the laboratory for CRE activity in three types of cells: blood, liver and brain. The AI model allowed the researchers to predict the activity for any sequence from the almost infinite number of possible combinations. By analyzing these predictions, the researchers discovered new patterns in the DNA, learning how the grammar of CRE sequences in the DNA impact how much RNA would be made – a proxy for how much a gene is activated. 

The team then developed a platform called CODA (Computational Optimization of DNA Activity), which used their AI model to efficiently design thousands of completely new CREs with requested characteristics, like activating a particular gene in human liver cells but not activating the same gene in human blood or brain cells. Through an iterative combination of ‘wet’ and ‘dry’ investigation, using experimental data to first build and then validate computational models, the researchers refined and improved the program’s ability to predict the biological impact of each CRE and enabled the design of specific CREs never before seen in nature. 

"Natural CREs, while plentiful, represent a tiny fraction of possible genetic elements and are constrained in their function by natural selection," said study co-first author Sager Gosai, a postdoctoral fellow in Sabeti's lab. "These AI tools have immense potential for designing genetic switches that precisely tune gene expression for novel applications, such as biomanufacturing and therapeutics, that lie outside the scope of evolutionary pressures."

Pick-and-choose your organ

Castro, Gosai, Reilly, Sabeti, Tewhey, and their team tested the new, AI-designed synthetic CREs by adding them into cells and measuring how well they activated genes in the desired cell type, as well as how good they were at avoiding gene expression in other cells. The new CREs, they discovered, were even more cell-type-specific than naturally occurring CREs known to be associated with the cell types. 

"The synthetic CREs semantically diverged so far from natural elements that predictions for their effectiveness seemed implausible," said Gosai. "We initially expected many of the sequences would misbehave inside living cells."

"It was a thrilling surprise to us just how good CODA was at designing these elements," said Castro. 

Tewhey and his collaborators studied why the synthetic CREs were able to outperform naturally occurring CREs and discovered that the cell-specific synthetic CREs contained combinations of sequences responsible for expressing genes in the target cell types, as well as sequences that repressed or turned off the gene in the other cell types.

Finally, the group tested several of the synthetic CRE sequences in zebrafish and mice, with good results. One CRE, for instance, was able to activate a fluorescent protein in developing zebrafish livers but not in any other areas of the fish. 

"This technology paves the way toward the writing of new regulatory elements with pre-defined functions," said Tewhey. "Such tools will be valuable for basic research but also could have significant biomedical implications where you could use these elements to control gene expression in very specific cell types for therapeutic purposes."

Adapted from a press release issued jointly with The Jackson Laboratory.

Funding

Support for this study came from the National Human Genome Research Institute. Pardis Sabeti is an Investigator with the Howard Hughes Medical Institute.

Paper cited

Gosai SJ, Castro RI, et al. Machine-guided design of cell type-specific cis-regulatory elements. Nature. Online October 23, 2024. DOI: 10.1038/s41586-024-08070-z.