#WhyIScience Q&A: A computational biologist studies how to better protect genetic data privacy

Hyunghoon Cho talks about his work using techniques from computer science to help improve biomedical data security.

Hyunghoon Cho is a Schmidt Fellow in the Data Sciences Platform who studies biomedical data privacy
Credit: Adrienne Mathiowetz Photography
Hyunghoon Cho is a Schmidt Fellow in the Data Sciences Platform who studies biomedical data privacy
As genetic sequencing gets cheaper, faster, and more available, the risk of genetic data leaks increases. DNA can reveal a lot about a person’s health, family history, and ethnicity and keeping that information private is a pressing challenge.
Hyunghoon Cho is a Schmidt Fellow in the Data Sciences Platform, where he and his group use machine learning, cryptography and other computational techniques to strengthen biomedical data privacy. They are focused on giving individuals control over their genetic privacy, while still allowing for the open sharing of biomedical information that drives scientific progress. 
Growing up in South Korea, Cho’s childhood was steeped in scientific curiosity. His father, a computer science professor, spurred his early interest in computer science and Cho fondly remembers coding games as a child. He became fascinated with biology at a math and science high school, and moved to California at 16 to study biology at Stanford University. Although he switched his major to computer science, Cho maintained his interest in solving biological and medical problems. His current work on genetic privacy stemmed from a collaboration with a friend at MIT, where Cho completed his PhD in electrical engineering and computer science. Cho began working at the Broad in 2019. 
We spoke with Cho in a #WhyIScience Q&A about the challenge of keeping genetic data private and his advice to young scientists who feel pulled by more than one discipline.
What projects are you working on at the Broad?
My group is trying to tackle a range of different problems in computational biology, but our main focus area is biomedical data privacy. We’re trying to develop computational techniques and theories that address the challenges people face when working with sensitive data. We’re aiming for solutions that protect people’s privacy but also aren’t so overly strict that they slow down scientific progress and limit data sharing. 
The interesting part of this is that we can tackle some of these problems by phrasing them as computational problems. There are a wide range of techniques from computer science, information theory, and related domains that have been designed in the past decades to address these problems. 
One of my current projects is to develop privacy-preserving algorithms for large-scale sensitive biomedical datasets that are used by multiple groups like institutions, since institutions aren’t able to share raw data due to existing privacy regulations. We also design new techniques to improve the scalability of existing methods to preserve privacy. We study unknown privacy risks in newly emerging services, data types, and public genetic analysis services. We also develop tools that provide data owners with more fine-grain control over privacy when sharing their data, such as hiding information about specific individuals or genotypes in the dataset and tracing the source of a potential data leakage. 
How did you start working in genetic privacy? 
It started out as a collaboration with my friend from college during my PhD. We entered and won one of the tracks at the iDASH secure genome analysis competition in 2016, where we combined my knowledge of computational biology with his knowledge of cryptography. Cryptography is a collection of theories and techniques for communicating and processing sensitive information. Most people think of cryptography as decoding messages and it's actually a lot more than that.
I’ve been interested in using computational tools to solve privacy challenges ever since. For me, this area is meaningful and rewarding because this topic is so socially important and intellectually complex, and not enough people with my background are working on it.

What have been some of your biggest accomplishments so far? 
Towards the end of my Ph.D., I co-authored two papers that have since defined my trajectory. They both addressed the question of how to securely perform analysis of datasets owned by different institutions at the same time, without sharing the raw data between the institutions, using cryptographic techniques. One paper addressed how to perform a genome-wide association study and the other addressed how to train a neural network model for predicting drug targets. We weren’t the first to work on bringing these cryptographic tools to biomedical data sharing, but I like to think that we contributed to making these ideas more mainstream in the field by demonstrating that it is possible to build practical protocols based on these techniques for key biomedical tasks.
What advice would you give young scientists?
I want to tell students who are interested in interdisciplinary research to not be afraid of feeling like you don’t belong to a specific community. Instead, think of the intersection of fields as your intellectual home. There is value in being able to interact with multiple communities, which unlocks unique opportunities that can be tackled only with an interdisciplinary approach.
Another general piece of advice I have is when projects don’t go as expected, don’t be too discouraged and continue pushing in different directions. Magic happens in research when one finds a solution to a seemingly difficult problem. If one knew something would work from the beginning, that wouldn’t be research!