#WhyIScience Q&A: A computational biologist helps build datasets for genetic disease diagnosis

Katherine Chao talks about her work managing the Genome Aggregation Database (gnomAD) and the importance of diverse datasets in rare disease diagnosis.

Portrait of Katherine Chao
Credit: Allison Colorado, Broad Communications
Katherine Chao is the product manager for gnomAD.

It was a chance conversation with a friend that led Katherine Chao to a career in computational biology. After graduating with a bachelor’s degree in biological sciences and spending a year teaching English in South Korea, Chao wasn’t sure what her next step would be. Her friend mentioned that the National Institutes of Health (NIH) was looking for biologists who wanted to learn coding. Chao joined the NIH as a Post-Baccalaureate Intramural Research Training Award fellow, where she processed and analyzed genomic data in the context of rare disease.

After two years at the NIH, Chao landed at the Broad Institute of MIT and Harvard, where she first worked as a clinical genomic variant analyst in the Center for Mendelian Genomics. In that role, Chao continued to work in rare disease genomic analysis, identifying sections of duplicated or deleted genomic sequences called copy number variants. She is now the product manager for the Genome Aggregation Database (gnomAD), a public database of human genetic variation with over 200,000 genome and exome sequences that researchers use to study the genetic basis of human disease.

In this #WhyIScience Q&A, we spoke with Chao about her career path and what makes gnomAD unique.

What drew you to computational biology?
My introduction to computational biology was through the world of rare disease. I was intrigued by the unfamiliarity of everything I worked on. When I was at the NIH, it was always rewarding to find a diagnosis that would be taken back to patients. Helping someone reach their goal of getting a diagnosis was a way of making an impact. And I love the puzzle aspect of coding. When there’s a problem, it’s on you to figure out how to solve it. 

What do you do as product manager of gnomAD?

I’m responsible for maintaining the product vision, which is the goal behind building a product. Part of that is figuring out what value our product brings and deciding how to guide the product’s growth. 

I’m also on the steering committee for gnomAD. I ensure that everyone's opinions are heard and that we're balancing where we're heading with gnomAD as a whole. We cannot do too much of one thing and everyone has to be included. It requires effective communication to make sure the different teams working on the product are aligned.

What makes gnomAD unique?

Our team is all staff scientists. That's quite unique in our field. A lot of academic software is produced by trainees or graduate students during their training. So they develop something, it’s really impactful, and then it will never get supported again because they move on. But because we’re a team of staff scientists, gnomAD has continuity, and behind the scenes, my team members are doing immense amounts of quality control, which is an important but typically unrecognized part of the research process. 

My team in the Translational Genomics Group is really inspiring to work with. The science we do is awesome, but I really like working with such genuinely wonderful people. Everyone is kind, talented, and hardworking. Members of my team lead by example and I want to be more like that. I've learned a lot from working with them, and I feel really fortunate to have worked with other computational scientists like Julia Goodrich, Kristen Laricchia, and Mike Wilson for years.

What do you like about working with gnomAD?

gnomAD is exciting because we’re always working on the cutting edge. As the dataset grows, we continue to push the limits of what can be done computationally. The next gnomAD release, v4, will contain over 800,000 individuals, 25 percent of which have been inferred to have non-European genetic ancestries. The scale of v4 will allow us to release aggregate allele frequencies — how often an allele is present in the general population — for a lot of previously undiscovered variants. We'll also be much better powered in our calculations of constraint against different types of variation. 

Moving forward, I think the most exciting thing will be continuing to increase the diversity within the dataset. Primary investigators on our team push for diversity in gnomAD by emphasizing the need for global data-sharing and encouraging scientists to bring us their diverse datasets. Some of the researchers on the gnomAD steering committee are also working to include more diverse samples in their own sequencing projects. 

Why is diversifying the gnomAD database so important?

More diversity means more genetic variation in the database. The more variation we have in a dataset like gnomAD, the more valuable it will be to the research community. When you’re looking for a genetic cause of a rare disease, one of the most important pieces of information you need is the aggregate allele frequency.

If you don't have that piece of information, it’s much harder to find a diagnosis for a patient with a rare disease. You could dig through the literature and maybe do some functional work. But if you had the aggregate allele frequency and were able to see that a certain genetic variant was very common in the population, you could immediately exclude that variant as being a driver of severe, early-onset genetic diseases. 

Having greater diversity in a database of human genetic variation gives us a better understanding of the landscape of human genetic variation. This includes identifying genes and regions that are intolerant to mutation, which means you’ll hopefully be able to quickly identify and prioritize disease-causing variants for more patients. That’s the core mission at gnomAD: to produce these aggregate allele frequencies so that people are better able to functionally or clinically interpret genome variation.