International consortium launches catalog of overlooked parts of the genome

A Broad scientist discusses a new effort to identify oft-missed protein-coding sequences in the human genome called non-canonical open reading frames, which could yield insights into disease and what makes humans unique.

John Prensner, Broad postdoctoral researcher and pediatric oncologist at Dana-Farber Cancer Institute and Boston Children's Hospital
John Prensner, Broad postdoctoral researcher and pediatric oncologist at Dana-Farber Cancer Institute and Boston Children's Hospital

When they sequenced the human genome for the first time, researchers from the Human Genome Project identified about 20,000 genes, many of which had not been known before. To find those genes, they used a set of rules. For example, they looked for DNA sequences predicted to generate proteins that are larger than a certain size and found in other species.

But some scientists believe these rules excluded sequences that could still play an important role in health and disease: non-canonical open reading frames (ORFs), which are found in parts of the genome traditionally labeled as non-coding. Several research groups have shown that some ORFs do encode proteins with biological roles in human cells. Last year, researchers at the Broad Institute of MIT and Harvard found that, of 553 ORFs they examined, about half showed signs of producing proteins; many had implications for the survival of cancer cells.

ORFs are difficult to catalog because they may escape detection by protein sequencing technologies such as mass spectrometry. But a technology developed in the last decade, ribosome profiling (“Ribo-seq”), could help pinpoint more. Ribo-seq allows researchers to sequence segments of RNA that are being translated into protein by ribosomes, the cell’s protein-making machinery. It also detects smaller sequences than mass spectrometry and can identify them even if scientists don’t know where in the genome to look. 

Now Broad researchers are working with international collaborators to generate a catalog of ORFs for the global research community. In Nature Biotechnology, the group recently described this initiative and their plan to incorporate the ORFs in reference genome databases. 

To learn more about the motivation and challenges behind these efforts — and what information they might unearth — we talked to John Prensner, a postdoctoral researcher at Broad and a pediatric oncologist at Dana-Farber Cancer Institute and Boston Children's Hospital, who co-founded the consortium. The following interview has been edited for length and clarity.

 

Why is it important to annotate these regions of the genome?

The human genome, and specifically the database compiled by the GENCODE consortium, is used by basically every research institute on Earth, and many major research initiatives, as the fundamental basis for research projects to identify genes. And using GENCODE with missing sequences is kind of like thinking about a dictionary. If you're a native English speaker and you want to learn French, and the French word you want to look up isn't in the dictionary, it's hard to know what to do. We're trying to create a new capacity within this dictionary of the human genome that researchers across the globe will be able to use to study human disease and physiology.

What are the goals of the consortium?

The first goal is to get a global community of academic researchers in communication with various annotation databases such as GENCODE, HGNC, UniProt, Peptide Atlas, and the Human Proteome Project, to establish ground rules for confidence and how to interpret these ORFs and how to present them to the global public through these databases. 

Second, we’ll have to identify what constitutes evidence of a protein. Ribosome profiling is the basis for detecting most ORFs. But it does not directly detect a stable protein; it detects a translation event in the cell. But that translation event might result in an unstable series of amino acids. We’ll need to think about what types of proteomic evidence we consider acceptable. 

Third, we’re excited to start developing tools to understand sequence mutations in large databases, such as gnomAD. There are many tools to analyze variants in the DNA that encodes known proteins. These tools help us to define harmful mutations that lead to human disease, but are not well suited to study ORFs. Many of those tools rely upon evolutionary conservation, for example, which is frequently not present to the same degree in ORFs. We need to come up with a new way to understand what these variants may be doing and whether or not they may be causing disease. 

How has the scientific community responded so far?

Our community is excited because these annotations are long overdue. We've had many groups reach out to us saying, “Hey, we want to be involved. What can we do?” We've had great conversations with mass spectrometry communities — these are communities that define the standards of what makes a stable protein. Bridging the gap between people who study the genome and those who study the proteome has been a major effort. 

What are the challenges in trying to centralize all this data?

We need to be careful in our communication about what the consortium is, how we package results, and how to interpret them. For example, we decided not to annotate these in official human genome annotations as a protein for now. If we annotate all of these ORFs as conventional proteins, that would dramatically disrupt thousands of labs and clinical operations across the world. It’s better to introduce them as a separate type and then provide education around what they are, and maybe at an appropriate time start to merge them. For many researchers, this is going to be an entirely new kind of data.

How could these insights change how we think about genes?

Fundamental to this work is the philosophical question of what it means to be human. We tend to think of the genome as one of our defining traits. Our genome has features that are distinctly different from any other creature’s, even ones that we’re closely related to, like orangutans or chimpanzees. The human genome is part of our humanity because it has the genes that make us human.

If we're going to start identifying new features within the human genome, we’re also identifying new features that influence what it means to be human. Many of these ORFs are not found in other species, so they will start to tell us things that make us unique as humans. When we study the genomes of other species, we're going to find other ORFs that make those species uniquely them as well. This is already starting to happen. In the mouse world, there have been several elegant studies of ORFs that seem incredibly important in mice but aren’t present in humans. We're already starting to learn how these elements are going to inform what it means to be a certain type of creature.

 

This work was supported by the Wellcome Trust, the National Human Genome Research Institute of the National Institutes of Health, and the European Molecular Biology Laboratory.

Paper(s) cited

Mudge JM, Ruiz-Orera J, Prensner JR et al. Standardized annotation of translated open reading frames. Nature Biotechnology. Online July 13, 2022. DOI:10.1038/s41587-022-01369-0.

Prensner JR et al. Noncanonical open reading frames encode functional proteins essential for cancer cell survival. Nature Biotechnology. Online January 28, 2021. DOI:10.1038/s41587-020-00806-2.