Researchers are bringing together sophisticated algorithms and rich biological data to solve previously intractable problems.
Cardiologist Jennifer Ho sees many patients with heart failure at Massachusetts General Hospital. For about half of her patients, it’s not clear how to treat them. They have a subtype of the disease, called heart failure with preserved ejection fraction (HFpEF, pronounced “heff-peff”), that has a complicated mix of risk factors and no approved therapies.
To better understand the causes of HFpEF, Ho, also an associated scientist at the Broad Institute of MIT and Harvard, is collaborating with Broad data scientists and cardiologists to use machine learning algorithms to analyze large sets of clinical data from patients with HFpEF. Figuring out how the variety of risk factors such as blood pressure, body mass index, age, and others fit together is challenging with existing tools. By turning to machine learning, Ho, a faculty member with Mass General’s Cardiovascular Research Center and a member of Broad's Cardiovascular Disease Initiative, and her collaborators hope to uncover previously imperceptible patterns in all that patient data that could help them better understand how this kind of heart failure progresses. The work is part of a larger collaboration between the Broad Institute and Bayer Healthcare that’s aimed at finding new treatments for patients with HFpEF.
This project is one of many at Broad and beyond that are harnessing machine-learning algorithms and other computational methods to find patterns in ever-growing troves of biological data. These tools — designed to analyze large amounts of diverse data types, including microscopy, clinical and genomic data — are allowing clinicians, biologists and data scientists to work together to glean new insights into key cellular and molecular processes and how they go awry in disease. Such insights could pave the way to new and better ways of diagnosing and treating disease, and in cases, are already pointing to potential new drugs and drug targets. Machine learning is even beginning to change how some researchers do their experiments, allowing them to extract more findings from similar amounts of lab work.
“The last decade has seen an explosion in biomedical data, and the ability to apply the tools of machine learning to these large datasets to address fundamentally important problems in biology and medicine is one of the most exciting opportunities of our time. We can leverage machine learning to improve patient care and our understanding of disease,” said Anthony Philippakis, who trained as a cardiologist and is Broad’s chief data officer and co-director of the Eric and Wendy Schmidt Center (EWSC) at the Broad.
“One thing that excites me about the merging of biology and machine learning is that thinking about problems in biomedicine will require us in computer science to think deeper and to develop new theories, methods, and foundations that will advance the field of machine learning and, in turn, advance biomedical research,” said Caroline Uhler, EWSC co-director, an associate professor in electrical engineering and computer science and the Institute for Data Systems and Society at MIT, and an associate member of the Broad.
Machine learning also allows scientists to go after big problems that would have previously required impractical amounts of data or lab work. “Machine learning increases the scope of our ambitions,” said David Liu, a Broad core institute member and director of the Merkin Institute for Transformative Technologies in Healthcare, whose lab is using machine learning across multiple areas. “That’s been an important change.”
Feeding hungry algorithms
Biologists at Broad and elsewhere are increasingly using machine learning because it can pick up on subtle patterns and connections in data missed by conventional tools. Researchers create a machine learning model by training it on datasets, allowing it to build statistical relationships between the data and then using those relationships to make predictions when it crunches through new datasets.
These algorithms come in a variety of types. Supervised machine learning requires datasets that are annotated with information that primes the computer to discover a relationship of interest. By contrast, unsupervised machine learning algorithms aim to find patterns in unannotated data without being asked to look for anything in particular.
However, as powerful as these tools are, they don’t have the judgment to know when a pattern they detect is meaningful. A computer can pick up random noise or inherent underlying biases within a dataset and link that to biology, and so scientists play an important role in interpreting the results machine learning algorithms produce.
Machine learning and biology are a worthy match. Both fields have advanced in recent years to a point where each is now able to enhance the other. Machine learning algorithms have grown more sophisticated, becoming attractive to researchers looking for better ways to analyze gigabytes, even petabytes, of biomedical data. Biologists have developed a rich array of tools to probe biological systems in the lab, allowing them to gather data down to individual cells and molecules and to collect a growing number of highly detailed measurements from patients. The challenge of making sense of such vast and complex data provides an opportunity for computational experts to build new and better tools and methods, and move the machine learning field forward in new ways.
Working in biology also lets data scientists help make a difference in human health, said Puneet Batra, director of machine learning in the Broad’s Data Sciences Platform and a member of Broad's Machine Learning for Health group. “People are figuring out how to analyze disease and how to help society. We have our small role in that, but it's really motivating.”
Broad researchers have been using machine learning tools to, among many projects, find new antibiotics, understand how engineered proteins behave, study cardiac structure, predict compound activity, and optimize the design of CRISPR-based diagnostics for a variety of viruses. Since 2015, the Models, Inference & Algorithms Initiative at the Broad has been bringing the biomedical, machine learning, mathematics and computer science communities together and supporting learning and collaboration through regular seminars, workshops, and other resources. And the Broad’s Data Sciences Platform developed the Genome Analysis Toolkit (GATK), a tool commonly used by geneticists to analyze whole genome sequences and find key genetic variants linked to human traits or disease. Rare variants can be important in disease, but they can be difficult to distinguish from technical errors generated by the sequencing process. Machine learning allows GATK to recognize and sift out the technical artifacts.
At first researchers designed such tools for a specific problem, but their classification prowess didn’t transfer well to similar, but slightly different, data. Now these tools are becoming increasingly generalizable, such as for image analysis, said Anne Carpenter, senior director of Broad’s Imaging Platform. For example, her team recently demonstrated a machine learning tool that can identify nuclei in images of different cell types, including ones taken from different tissues, species and even those captured by different types of microscopes.
More generalizable algorithms can increasingly identify relationships in images on their own, without being told by scientists what to look for. “Bio-image analysis is transitioning from just measuring the thing that the biologist is interested in measuring to using the computer to look beyond what the human eyeball can see,” said Carpenter. Her group has developed an algorithm that can accurately classify cells without the need to tediously stain them with antibody markers and manually identify key features. Her team recently used this strategy to screen the quality of donated red blood cells.
Linking disparate data
Much of the promise of machine learning comes from the way that it can help researchers link different data types and generate new hypotheses. Mass General cardiologist and Broad machine learning researcher James Pirruccello and his team are using machine learning to help tease out genetic risk factors for various conditions, including thoracic aortic aneurysms, which cause damage to coronary arteries that is sometimes fatal.
Using data from the UK Biobank, the researchers trained a machine learning algorithm to sift through 4 million magnetic resonance images (MRI) of participants’ ascending aortas and identify large vessels that could be at risk of bursting. They then did a genome-wide association study of 40,000 individuals to look for genomic regions associated with aortic size. In a recent preprint, the team showed that up to 60 percent of aortic size is linked to genetic factors. They also identified about 115 locations in the genome linked to aortic size. Some of these hits were near the genomic region linked to Marfan’s syndrome, a genetic disorder with a known aneurysm risk.
The results point to potential targets for new therapies for these aneurysms, and are also a starting point for screening at-risk patients for additional follow-up and possible preventative measures such as surgery, said Pirruccello, who is part of Broad's Cardiovascular Disease Initiative. The team used just 116 annotated examples to train their model to screen millions of MRI images. “This work would have been impossible without machine learning,” Pirruccello said.
Sifting out the noise
Integrating a range of data types and interpreting their relationships are important in basic biology, too. Mehrtash Babadi, a machine-learning group leader at the Broad and steering committee member of Broad's Models, Inference & Algorithms Initiative, and his group use machine learning to analyze single-cell RNA sequencing data. A typical experiment using this technology generates a lot of data, namely, sequences of every RNA in tens of thousands of cells that researchers have to analyze individually. The data can also include noise from RNA from outside of cells and other sources.
To help sort through all this data, Babadi’s group has developed an open-source software package called CellBender that teases apart signal from noise. Using this tool, researchers can more confidently and easily move from studying individual cells to integrating and analyzing them together in functional units, such as an organ, tumor, or multicellular organism. “This is important both from the perspective of basic biology, to learn how complex tissues function, and also from the perspective of clinical and pharmaceutical applications, to identify new druggable molecular targets,” said Babadi.
Machine learning is also helping researchers predict the results of genome editing, thanks to a collaboration between the groups of David Gifford at MIT, Richard Sherwood at Brigham and Women’s Hospital, and David Liu at the Broad. Such predictions are helping researchers design more precise, efficient, and versatile gene editors.
In a 2018 Nature paper, MIT graduate student Max Shen and postdoc Mandana Arbab in Liu’s lab developed a tool named inDelphi that uses machine learning to predict the exact edits that occur, and their frequency, when CRISPR/Cas9 makes a targeted cut in the DNA of mammalian cells.
The team went on to use a similar approach, in a 2020 Cell paper, to predict the editing outcomes of 11 different base editors on any target DNA site of interest. This machine learning tool, BE-Hive, helps researchers judge which editing strategy helps them best achieve their desired outcome and also provides new insights into how base editors work. “All of that led to a much deeper and more comprehensive understanding of the factors that govern base editing in mammalian cells,” Liu said.
Machine learning can do a lot more for biology than make new connections within a sea of data. Brian Cleary, a Broad Fellow and steering committee member of Broad’s Models, Inference & Algorithms Initiative, said that unlocking the full potential of machine learning means rethinking experimental design. “To take things to the next level, you take insights about how the algorithms work, and use that to rethink how you're gathering and generating data in the first place,” Cleary said. This, he added, will ultimately make experiments more efficient and maximize the knowledge gained.
For example, if researchers already know that expression of gene A is strongly correlated with the expression of gene B, they don’t need to do separate assays for each gene. Instead, algorithmic analysis suggests that when genes are strongly co-regulated, measuring just a small number of combinations of gene abundance can reveal the expression level of a larger number of individual genes.
Cleary and his colleagues recently used this approach to image RNA in single cells and learn which genes are being expressed. To do this, researchers use a technique called single-molecule fluorescence in situ hybridization (smFISH), where they fluorescently tag specific nucleic acid sequences that then bind to complementary DNA sequences of interest. This approach, however, is slow and painstaking and can generate overwhelming amounts of data when researchers try to scale it up.
To make smFISH experiments more efficient, Cleary and his collaborators used prior biological knowledge to choose which experiments to run and which data to collect.
In a recent preprint, Cleary’s team showed that with their tool, called composite in situ imaging, they could figure out the expression level of 37 genes of interest from nearly 500,000 single cells from 12 different mouse brain sections, using just 11 measurements. Cleary says that this increased efficiency expands by around 500-fold what scientists can study and learn from similar amounts of experimental work. “Now instead of looking at one small section of a tissue at a time, and doing that repeatedly, you can start to look at, say, an entire organ in one experiment,” Cleary said.
Building biology’s software
Many innovations in machine learning originated from physics, finance, and the technology industry, and are now spilling over into biomedicine. For example, about three years ago, Babadi and a few others at Broad started using a tool called Pyro, developed by machine learning researchers and software developers at Uber AI Labs. Last year, three of those scientists, Fritz Obermeyer, Martin Jankowiak and Eli Bingham moved to the Broad, attracted by the types of problems and data available in biology and the opportunity to work alongside biomedical scientists and further develop Pyro and other new tools.
Pyro is built on the Python programming language and allows researchers to incorporate more complex probabilistic modeling and levels of uncertainty into their machine learning models. It also helps researchers construct larger, modular models that combine simple statistical components with deep neural networks with much higher capacity.
Previously researchers developing machine learning models had to choose either a simple Bayesian model that they can easily interpret, or a powerful deep learning model where researchers couldn’t parse exactly how the algorithm arrived at its answer. Pyro’s more flexible strategy makes machine learning more interpretable. Depending on how much researchers know about the biology of the system they’re trying to model, they can specify how an algorithm classifies some parts of the data, while leaving the computer to make other inferences on its own.
One of the goals of the Pyro trio is to develop tools that will be useful for many problems, rather than one-off solutions that a single computational biologist might develop for one lab’s problem. “There’s an opportunity to build larger, more integrated, reusable software components,” Jankowiak said.
Meanwhile, cardiologist Jennifer Ho at Mass General is continuing to dive deep into the causes of HFpEF. She and her team are collecting a wide array of detailed measurements from a small number of patients — what researchers call “deep phenotyping” — in addition to their work with Broad data scientists. "What I'm excited about is being able to do both this deep phenotyping approach and also learn from the broad big-data analysis,” Ho said. “Hopefully both together will help us better understand why people get HFpEF in the first place."
Indeed, researchers at the intersection of biomedicine and machine learning say that all these data-driven tools can ultimately help scientists decipher and connect the key programs that run an organism's biological machinery. Decoding these programs will allow researchers to figure out how to tweak them to treat disease. “That's the grand challenge in my mind: how to compress or distill vast amounts of data into the core operating principles of a eukaryotic cell,” Babadi said. “If you can answer that question, you can answer anything.”