Multimodal protein language models for deciphering protein function
Owen Queen
Research Associate, Harvard Medical School
Yepeng Huang
PhD Student, Harvard Medical School
Marinka Zitnik
Assistant Professor of Biomedical Informatics, Harvard Medical School
Understanding the relationship between a protein's amino acid sequence and its structure or function is a long-standing challenge with far-reaching implications for therapeutic development, as the effects of drugs are often directly linked to proteins. Current protein language models (PLMs) capture evolutionary relationships based on sequences but fall short of directly acquiring protein functions from multimodal molecular data, including protein sequences and structures, peptides, and domains. We develop a multimodal protein language model that integrates textual protein descriptions with a sequence-structure PLM to create a more comprehensive and functionally insightful model of proteins. This integration promises to bridge the current gap in PLMs, transitioning from understanding the structural aspects of proteins to gaining a functional view of vast protein space. The model allows scientists to express their queries in natural language and interact with protein models in an open-ended manner. It allows for text-based prediction of protein targets, multimodal protein captioning, and Q&A sessions with scientists with varying levels of expertise, among others. Trained on a new dataset of protein-text instructions, the model can generalize to new phenotypes in a zero-shot manner, making it versatile for diverse tasks even when functional annotations are scarce. We conclude with an outlook for the future with “AI scientists” as generative agents capable of skeptical learning and reasoning to empower biomedical research.