#WhyIScience Q&A: A machine learning engineer builds algorithms to improve clinical research

Pulkit Singh talks about her work at the intersection of computer science and biology and her efforts to promote equitable science.

Pulkit Singh
Credit: Allison Dougherty, Broad Communications
Pulkit Singh

As an undergraduate at Princeton University, Pulkit Singh loved thinking about intelligence and how humans experience the world. She dabbled in philosophy, visual arts, and computer science, each field granting her a new way to think about the mind.

During a study abroad program in Edinburgh, UK, Singh took a computational cognitive science class and knew she’d found her niche. She’d been fascinated by the brain but couldn’t see herself becoming a biologist in the lab. And although she loved computer algorithms, she hadn’t thought about how human and machine intelligence could benefit each other. Machine learning, she realized, offered the perfect tools for studying cognitive neuroscience.

Back at Princeton, Singh joined the Computational Cognitive Science Lab run by Tom Griffiths and spent the next year and a half developing machine learning models that could predict how humans categorize stimuli. In 2020, she graduated with a degree in computer science and minors in cognitive science and machine learning.

Singh now works at the Broad Institute of MIT and Harvard as a machine learning engineer on the Machine Learning For Health (ML4H) team, which is using machine learning to drive new studies on the genetic mechanisms of disease. She collaborates with researchers from the Mass General Brigham (MGB) hospital system to develop algorithms to predict risk and enable biological discovery in cardiovascular and brain health from a range of clinical data types. Singh is also a co-chair of Women@Broad, a group dedicated to building communities and advocating for trans and cis women and other marginalized genders at the Broad. 

We spoke with Singh about what it’s like working on an interdisciplinary team and the importance of diverse, equitable, and inclusive scientific communities in a #WhyIScience Q&A.

 

What is it like to work on the ML4H team?

Interdisciplinary collaboration is really fun. You need to have real respect for the expertise of the person that you're working with. I have had to google a lot of biology, and I've asked a lot of stupid questions, but that's where the real work happens.

You get to learn about something from an expert and see how your own expertise can contribute to something meaningful — I get a lot of energy from that. I get excited because I can have a conversation, go and code something up, and come back and say, “Hey, this is what I found.” It's a really good feedback loop. 

 

What kind of data do you work with? 

One of our primary data sources is a cohort of 520,000 people who have interacted with the MGB healthcare system over a long period of time. We have narrative notes from clinicians, structured measurements, billing codes, and rich multimodal data like ECGs, MRIs, CTs, and other diagnostic imaging. Machine learning can help us use these to build deep phenotypes, which are descriptions of a patient's biology that we might not have thought to look for yet, or that clinicians might not recognize by eye. We can use those representations to predict who's going to be at risk or what genetic pathways underlie disease or what drugs might work best. This isn’t the end-all answer. We still need clinical trials, but we think that these methods can speed up clinical trials or make them less expensive. 

 

What are some of the methods you use in your work?

Natural language processing is the discipline of using computational models to understand human language. The approach that's been most successful in the last few years has been a class of models called transformers. You can teach these models in an unsupervised way and they learn to pay attention to what's important. If they read enough text, they can get a sense of what a word means just by the context it occurs in. 

An application of these methods is extracting information from messy real-world data. In an ideal world, we'd be able to use measurements that someone read off an MRI to train the model. But these measurements are embedded in narrative text and every clinician writes them differently. We found we could leverage pre-trained natural language understanding from transformers to extract 21 different cardiac measurements with just 200 labeled examples that a clinician provided in about 15 hours when he was on some night shifts. That showed the potential of this kind of collaboration — in this cardiologist’s down time, he helped us build a powerful model, and now we have new measurements that we didn't have access to before. 

 

How do you hope your work will impact healthcare in the future? 

We want to make an impact on human health, and aid in scientific discovery. For me, a big piece of this is making computational models work equitably across a diverse set of populations. We've seen that they can encode or even amplify systemic oppression. The stakes are especially high when they apply to healthcare, so it's important that we're building equitable technology. We want to do our best to mitigate any health disparities with the deployment of these technologies. 

If we're building new deep phenotypes and reconsidering how we define a disease or what biological signals we use to make a diagnosis, there's a real opportunity to have that definition capture the range of biological diversity that exists depending on different identities and make sure that everyone gets the care they need. 

 

What are you most proud of in your career so far? 

I've been able to build research and intellectual relationships with people I respect, and I try very hard to be collaborative and open to feedback. I’m so lucky to work with such brilliant and thoughtful collaborators. I consider that a huge privilege. 

The other thing I care a lot about is being thorough. In some academic contexts, there can be some pressure on young researchers to complete projects as quickly as possible. There have been times in the last two years where I've realized that something is weird about the data, and raised my hand and said, “Hey, we need to fix this.” I think it’s rare that a group is set up not only to support that kind of attention to detail, but to encourage it. When you work with clinical data, commitment to rigor is really important. That's the ethos of our team and our collaborators: to be engaged deeply with data so that we can build good models. 

 

Have you had any mentors that made a big difference in your development? 

I didn't come into undergrad very prepared to be a computer science major. I was close to giving up because I felt really behind my peers. But then I took this linguistics class with a computational linguist named Christiane Fellbaum. She believed in me even though she had only known me for a semester and made me feel like I had interesting ideas. Tom Griffiths has also been an incredible research mentor and really encouraged my interest in cognitive science. 

At the Broad, my manager Puneet Batra has been an incredible supportive mentor. All the PIs our group works with — Chris Anderson, Patrick Ellinor, Jen Ho, Steve Lubitz, and Anthony Phillipakis — have also been exceptional scientific role models. I also really looked up to René Salazar, the head of the IDEA [Inclusion, Diversity, Equity, and Allyship] Office, who sadly passed away recently. Women@Broad worked a lot with him and we all miss him and respect him so much. I'm so glad that the IDEA Office is continuing to do all of the incredible work that they do. He was awesome. 

I’ve also realized how important it is to have community. I worked with the Princeton LGBT Center in undergrad, and work a lot with Women@Broad now. These communities have taught me that having a diverse and inclusive scientific community is very related to doing equitable and good science. That's impacted me on a personal level because I felt a lot of imposter syndrome and didn't even realize until six months ago that I could claim the identity of being a scientist. It's important for all of us to claim and nuance that identity and to support folks from the most underrepresented groups to be in science. 

 

Do you have any advice for younger scientists who might be struggling with imposter syndrome? 

Everyone brings something different to the table — even if you haven't been coding since you could walk, having broad interests or a diverse set of lived experiences can be really meaningful. I think I was socialized to pursue perfection rather than curiosity. My academic and research journey has really tempered my perfectionism, in a good and productive way. Just being curious instead of trying to be perfect has really changed my worldview, because when you are not expected to be perfect, you can experiment and mess up and ask questions. And that's how you really learn.