Using a new mathematical framework, researchers examine signatures of natural selection in the genome, deciphering the evolutionary past and future of noncoding DNA
An ‘oracle’ for predicting the evolution of gene regulation
Despite the sheer number of genes that each human cell contains, these so-called "coding" DNA sequences comprise just 1 percent of our entire genome. The remaining 99 percent is made up of "noncoding" DNA — which, unlike coding DNA, does not carry the instructions to build proteins.
One vital function of this noncoding DNA, also called "regulatory" DNA, is to help turn genes on and off, controlling how much (if any) of a protein is made. Over time, as cells replicate their DNA to grow and divide, mutations often crop up in these noncoding regions — sometimes tweaking their function and changing the way they control gene expression. Many of these mutations are trivial, and some are even beneficial. Occasionally, though, they can be associated with increased risk of common diseases, such as type 2 diabetes, or more life-threatening ones, including cancer.
To better understand the repercussions of such mutations, researchers have been hard at work on mathematical maps that allow them to look at an organism’s genome, predict which genes will be expressed, and determine how that expression will affect the organism’s observable traits. These maps, called fitness landscapes, were conceptualized roughly a century ago to understand how genetic makeup influences one common measure of organismal fitness in particular: reproductive success. Early fitness landscapes were very simple, often focusing on a limited number of mutations. Much richer data sets are now available, but researchers still require additional tools to characterize and visualize such complex data. This ability would not only facilitate a better understanding of how individual genes have evolved over time, but would also help to predict what sequence and expression changes might occur in the future.
In a study published in Nature, a team of scientists has developed a framework for studying the fitness landscapes of regulatory DNA. They created a neural network model that, when trained on hundreds of millions of experimental measurements, was capable of predicting how changes to these noncoding sequences in yeast affected gene expression. They also devised a unique way of representing the landscapes in two dimensions, making it easy to understand the past and forecast the future evolution of noncoding sequences in organisms beyond yeast — and even design custom gene expression patterns for gene therapies and industrial applications.
"We now have an 'oracle' that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?" said Aviv Regev, a core institute member of the Broad Institute of Harvard and MIT (on leave), a professor of biology at MIT (on leave), head of Genentech Research and Early Development, and the study’s senior author. "Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways. I am also excited about the possibilities for machine learning researchers interested in interpretability; they can ask their questions in reverse, to better understand the underlying biology."
Prior to this study, many researchers had simply trained their models on known mutations (or slight variations thereof) that exist in nature. However, Regev’s team wanted to go a step further by creating their own unbiased models capable of predicting an organism’s fitness and gene expression based on any possible DNA sequence — even sequences they’d never seen before. This would also enable researchers to use such models to engineer cells for pharmaceutical purposes, including new treatments for cancer and autoimmune disorders.
To accomplish this goal, co-first authors Eeshit Dhaval Vaishnav, a graduate student at MIT and Broad's Program in Medical and Population Genetics, and Carl de Boer, now an assistant professor at the University of British Columbia, and their colleagues created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of totally random noncoding DNA sequences into yeast, and observing how each random sequence affected gene expression. They focused on a particular subset of noncoding DNA sequences called promoters, which serve as binding sites for proteins that can switch nearby genes on or off.
"This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models," Regev said. "In the broader sense, I believe these kinds of approaches will be important for many problems — like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations, or designing new molecules."
Regev, Vaishnav, de Boer, and their coauthors (including Moran Yassour [now at Hebrew University], Xian Adiconis, Joshua Levin, Dawn Thompson [now at LifeMine Therapeutics], and Lin Fan of Broad; and Jennifer Molinet and Francisco Cubillos of the University of Santiago) went on to test their model’s predictive abilities in a variety of ways, in order to show how it could help demystify the evolutionary past — and possible future — of certain promoters. "Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point," Vaishnav explained.
First, to determine whether their model could help with synthetic biology applications like producing antibiotics, enzymes, and food, the researchers practiced using it to design promoters that could generate desired expression levels for any gene of interest. They then scoured other scientific papers to identify fundamental evolutionary questions, in order to see if their model could help answer them. The team even went so far as to feed their model a real-world population data set from one existing study, which contained genetic information from yeast strains around the world. In doing so, they were able to delineate thousands of years of past selection pressures that sculpted the genomes of today’s yeast.
But, in order to create a powerful tool that could probe any genome, the researchers knew they’d need to find a way to forecast the evolution of noncoding sequences even without such a comprehensive population data set. To address this goal, Vaishnav and his colleagues devised a computational technique that allowed them to plot the predictions from their framework onto a two-dimensional graph. This helped them show, in a remarkably simple manner, how any noncoding DNA sequence would affect gene expression and fitness, without needing to conduct any time-consuming experiments at the lab bench.
"One of the unsolved problems in fitness landscapes was that we didn’t have an approach for visualizing them in a way that meaningfully captured the evolutionary properties of sequences," Vaishnav explained. "I really wanted to find a way to fill that gap, and contribute to the longstanding vision of creating a complete fitness landscape."
Even before the study was formally published, Vaishnav began receiving queries from other researchers hoping to use the model to devise noncoding DNA sequences for use in gene therapies.
"People have been studying regulatory evolution and fitness landscapes for decades now," Vaishnav said. "I think our framework will go a long way in answering fundamental, open questions about the evolution and evolvability of gene regulatory DNA — and even help us design biological sequences for exciting new applications."
Support for this study was provided by the Klarman Cell Observatory at Broad and the Howard Hughes Medical Institute.
Adapted from a press release issued by the MIT Biology.