Protein language models learn evolutionary statistics of interacting sequence motifs

Sergey Ovchinnikov

MIT

Protein language models (pLMs) have exhibited remarkable capabilities in protein structure prediction and design. However, the extent to which they comprehend the intrinsic biophysics of protein structures remains uncertain. We present a suite of analyses that dissect how the flagship pLM ESM-2 predicts structure. Motivated by a consistent error of protein isoforms predicted as structured fragments, we developed a completely unsupervised method to uniformly evaluate any pLM, allowing us to compare coevolutionary statistics to linear models. We further identified that ESM-2 does not require full context for predicting inter-residue contacts. Our study highlights the current limitations of pLMs and contributes to a deeper understanding of their underlying mechanisms, paving the way for more reliable protein structure predictions.

MIA Talks Search