Combining protein language and structure models to redesign E. coli proteome with a reduced amino acid alphabet
Simon Kozlov
MIT
All known organisms need all 20 canonical amino acids to survive and reproduce, yet use of different amino acids must’ve evolved over time. Unfortunately, even the last universal common ancestor used all 20, so there is no way to observe how life might have looked without all of them.
Here, we’re trying to synthesize such an organism by designing a strain of E. coli using only 19 canonical amino acids, starting from redesigning essential genes one at a time with the goal of preserving fitness. Since there are many diverse genes to design, we’re repurposing and extending machine learning-driven protein design methods to accomplish this task. Each of the currently available methods is capturing specific properties of the design landscape due to its approach and training data provided to the models. Protein language models like ESM have access to a vast number of sequences and can learn patterns favored by nature. Methods using AlphaFold as a loss function like AFDesign or MCMC hallucination are aware of the final structure of the protein, but are susceptible to adversarial examples. Finally, models which perform “inverse folding” like ProteinMPNN have both sequence and structure information and can output structure-aware designs. We’re developing optimization methods which combine these models as part of the design process to find sequences which are scored highly by models coming from different approaches, and how this translates to their biological properties. The experimental results show that our methods can generate designs comparable with wild-type versions in fitness with a small number of attempts.