Schmidt Center scientists develop a robust machine learning approach for virtual drug screening and other applications

Based on an infinitely large neural network, the simple, fast, and flexible framework is easily deployable for a wide variety of tasks.

Caroline Uhler, Adit Radhakrishnan
Caroline Uhler, Adit Radhakrishnan

In the machine learning world, the “Netflix problem” is a well-known computational challenge: Knowing how a viewer has rated movies they’ve watched, machine learning algorithms predict what other films they might like.

To make these kinds of predictions, computational experts have used tools called artificial neural networks, which are modeled on the organization of cells in the brain. The neural network learns patterns in existing data (for example, a viewer’s preference for comedy or action films) and estimates missing values such as the ratings a viewer might give for unseen movies, so that it can recommend new films the viewer might enjoy. The approach, however, falls short in applications with more complex datasets, such as the effects of untested drug candidates on gene activity in a variety of cell types.

Scientists in the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard have built a new machine learning framework that is up to the task. They’ve developed a method that can be easily run on a standard laptop, avoiding the excessive computational costs of other massive neural network models. The researchers demonstrated the potential uses of their approach by conducting a virtual screen of drug candidates and filling in missing pixels in a digital image, and offered guidance on how it can be applied to other complex machine learning tasks or simpler ones like the Netflix problem. The results also shed light on how neural networks function to make predictions, which hasn’t been fully understood.

The new model, described in the Proceedings of the National Academy of Sciences, is designed to perform a common task in machine learning called matrix completion, which underlies systems that recommend things like movies. In these tasks, data are expressed as a matrix, or a grid with rows and columns, and some data points in the grid are missing. Neural networks fill in those blanks with their predictions to “complete” the matrix. The Broad team applied what is known as an “infinite width” neural network to the matrix completion problem and demonstrated that the method can quickly run complicated analyses for a range of tasks.

“There are many exciting new opportunities to use this method,” said study first author Adit Radhakrishnan, one of 15 Schmidt Center fellows who is also a graduate student at MIT. “With the speed and pace of data generation in research today, we need a method that will scale with the data. Our framework provides a robust approach for analyzing larger and larger datasets that can be tailored to pursue a range of questions in biomedicine and beyond.”

The study is the first scientific publication from the Eric and Wendy Schmidt Center, which was launched at the Broad in 2021 to enable a new field of interdisciplinary research at the intersection of data science and life science, aimed at improving human health. The Schmidt Center is co-directed by the study’s senior author and Broad core institute member Caroline Uhler, the Henry L. and Grace Doherty Associate Professor in the Department of Electrical Engineering and Computer Science and Institute for Data, Systems, and Society at MIT.

“One of our goals in the Eric and Wendy Schmidt Center is to not only bring machine learning to bear on medical and biological challenges, but to also have new problems in the biomedical sciences, such as virtual drug screening, motivate foundational developments in machine learning,” said Uhler.

To infinity

In recent years, machine learning experts have tried building larger and larger neural networks to improve their predictive abilities. They stretched the networks’ width by adding millions or even billions of mathematical functions, known as artificial neurons. The massive networks were proficient at tasks like recognizing speech or distinguishing images of cats from those of dogs, but their huge size made them unwieldy and very expensive to run. In addition, it was unclear exactly how they accomplished those tasks.

Looking to better understand how these models work, researchers in the machine learning field tested the limits of neural networks by stretching the width as far as possible, towards an infinite number of mathematical functions in each hidden layer. “In the neural analogy, infinite width neural networks are like brains with endless numbers of neurons, giving them huge computing capacity,” said Radhakrishnan.

Machine learning scientists were surprised to learn that as the width of neural networks approached an infinite size, they appeared to function like older and simpler machine learning models known as kernel machines, which are easier to use and don’t require advanced computing resources.

Learning of this finding, Uhler’s team realized that such kernels, known as neural tangent kernels, held promise to more easily and flexibly perform matrix completion in challenging applications such as virtual drug screening. Successful predictions of a small molecule’s impact on cell function could improve the efficiency of drug discovery by revealing the most promising drug candidates to study and the cell types they are likely to work in. Such a system could then be modified to answer a wide variety of thorny data-driven questions, in addition to easier ones like the Netflix problem.

“We first asked why the traditional methods weren't working on these complex biological problems,” said Radhakrishnan. “We then wanted to build a simpler, more general framework for addressing this, which could then be applied to lots of different systems.”

In virtual drug screening, scientists start with data on the effects of a set of small molecules or genetic perturbations on different cell types. Then they try to predict, for example, the effects of a new small molecule on those cell types. For these tasks, an entire column or row of data in the matrix is missing. Because of the way traditional neural networks learn patterns from the given data, they struggle to make realistic predictions and accurately complete the matrix when entire rows or columns are missing.

Models based on infinite width neural networks had never been applied to matrix completion problems, so in their study, Uhler, Radhakrishnan, and their colleagues devised a way to apply neural tangent kernels, as easily implemented proxies for infinite width neural networks, to these kinds of problems. They discovered that when they gave their framework additional information, or metadata, about the rows and columns (for example, the molecular structure of a small molecule), it took into account known similarities among rows or columns to make better predictions. In this way, the framework might predict that an untested drug candidate that is structurally similar to a tested one already in the dataset might impact cell function in the same way as the tested molecule.

Kernels for drug screening

To try out their approach on the virtual drug screening task, the team turned to data from the Connectivity Map (CMap), a Broad-led effort that surveyed the effects of tens of thousands of compounds on dozens of cell types. Their framework performed as well as other computationally intensive neural networks at estimating missing entries in the CMap data, while using far less computing resources.

The team next used the framework to fill in missing areas of digital images and showed that it again performed as well as other methods.

To demonstrate the flexibility of their framework, the scientists devised various ways to incorporate other kinds of metadata into its analysis, which can guide future applications of the approach for a variety of machine learning applications. They also shared the code necessary to run the analyses, which can be done on standard computing hardware like a laptop.

“Traditionally, efforts to develop new theory on the machine learning side have been motivated by things like online advertising and recommender systems,” said Uhler. “We’re excited to see how these new tools can be applied to more important problems for humanity, such as drug discovery, that come out of the biomedical sciences.”

This work is funded in part by the National Science Foundation, the Office of Naval Research, the Eric and Wendy Schmidt Center, and the Simons Foundation.

Paper(s) cited

Radhakrishnan A, et al. Simple, Fast, and Flexible Framework for Matrix Completion with Infinite Width Neural Networks. PNAS. April 11, 2022. DOI: 10.1073/pnas.2115064119.