Towards Meaningful Pretrained Models for Biology
Microsoft Research New England
Modern biological experiments and curation efforts have amassed tremendous amounts of data across domains. These big datasets now drive efforts in large pretrained models, where researchers expose deep learning models to large quantities of unlabeled data with the aim of initializing them with a foundational knowledge of biology so that they can be rapidly transferred to useful analyses. While pretrained models promise to democratize the benefits of deep learning, current models are not guaranteed to provide any meaningful signal for analyses, and in some cases worsen a biologist’s ability to resolve signal. This hinders the usefulness of models, especially in exploratory analysis and hypothesis discovery applications where there may not be enough prior annotations to empirically benchmark models.
In this talk, I argue that to overcome these challenges, we must make pretrained models more meaningful. I demonstrate methods for training meaningful models, including methods that closely align pretraining tasks with the desired signal for downstream analyses, and define inductive biases and constraints that align with biological prior knowledge. While these models can still digest large quantities of unlabeled data for pretraining, they also provide explainable principles of what is learned, enabling biologists to reason if models are applicable to their analyses.