Using 76,000 genomes, researchers build new map of regions of the human genome under natural selection

Insights will help scientists study coding and non-coding regions that, when mutated, may cause disease

cartoon crowd of people red yellow blue

Every human’s genome has millions of genetic variants, but most have little to no effect, making it difficult for clinicians to make medical diagnoses based on genetic differences.

Using patterns of variation from tens of thousands of individuals with whole-genome sequence data, a team led by investigators at Massachusetts General Hospital (MGH) and the Broad Institute of MIT and Harvard recently identified regions of the genome that lack typical variation, indicating that they are important sequences conserved during evolution and natural selection.

The authors of the study, which is published in Nature, note that when a variant arises in one of these regions, it’s more likely to have an effect on an individual’s health.

“We sought to examine how natural selection shapes patterns of human genetic variation across the whole genome, especially in the non-coding genome, which has been much less characterized than protein-coding regions,” said senior author Konrad Karczewski, an assistant professor in the Analytic and Translational Genetics Unit in the Department of Medicine at MGH and associate member of Program in Medical and Population Genetics at Broad. “While our previous work evaluated the 2 percent of the genome that encodes genes, our new metrics extend to the entire genome, greatly expanding our knowledge about which functional genomic elements likely harbor variation with potential clinical significance.”

Karczewski and his colleagues — including co-first authors Siwei Chen and Lauren Fancioli, and co-senior authors Benjamin Neale of Broad and MGH and Daniel MacArthur of Garvan Institute of Medical Research in Australia — aggregated and processed information from 76,156 human genomes into the Genome Aggregation Database (gnomAD), a large international human genome reference resource that they have been expanding and releasing to the public continuously.

The variants in this database have been helping clinical labs worldwide perform diagnoses of rare diseases, and this release greatly expands the ability to do so in non-coding regions.

The team used the results to build a “genomic constraint map” for the whole genome (called Gnocchi, for Genomic NOn-Coding Constraint of HaploInsufficient variation). The map indicates which regions of the genome are “constrained,” meaning that when variants in the region occur, they are often too damaging and are removed from the population by natural selection.

The team found that constrained regions are enriched for regulatory elements (which control gene expression) and variants implicated in complex human diseases and traits.

The scientists also found that more constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that studying non-coding constraint can aid in the identification of constrained genes.

“We anticipate that Gnocchi could be used to prioritize genetic variation discovered in non-coding regions of the genome in patients with rare diseases, which can potentially provide clues for genetic causes of diseases and starting points for targeted therapeutics,” Karczewski explained.

Next, it will be important to add genomic information from other individuals into this newly developed dataset.

“Future efforts towards a larger, more diverse human reference dataset would further improve rare disease diagnoses for all, and create better powered constraint metrics, giving us a better understanding of the distribution and effects of human genetic variation,” Karczewski said.

Adapted from a press release issued by Massachusetts General Hospital.


Support for this study was provided by the National Institute of Diabetes and Digestive and Kidney Diseases and the National Human Genome Research Institute.

Paper cited

Chen S, Francioli LC, et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature. Online December 6, 2023. DOI: 10.1038/s41586-023-06045-0.