Primer: Generative models from NLP for sequence data
Abstract: Generative models are powerful tools for capturing functional constraints within families of biological sequences. Autoregressive models, developed in natural language processing and related fields, provide a useful approach to modeling sequence data without imposing a rigid alignment structure on the data. In this primer, we will review the math and intuition behind these models, survey advancements in model parameterization, and compare strategies for sampling from the models to generate new sequences. Finally, we will discuss important considerations when applying these models to biological data.
HMS Systems Biology
Alignment-free models for protein and antibody design
Abstract: I will describe a set of machine learning methods for protein design with a focus on accelerating antibody discovery for specificity and affinity. I will also motivate these methods for other applications more broadly in genomics.
Antibodies and nanobodies are highly valued molecular tools, used in research for isolating and imaging specific proteins, and in medical applications as therapeutics. However, for a large number of human and model-organism proteins, existing antibodies are non-existent or unreliable. Emerging experimental techniques enable orders-of-magnitude improvement in the number of sequences assayed for target affinity but are notoriously non-specific and not always well-folded. We have explored the use of generative probabilistic models for this design challenge. We found that the high heterogeneity in antibody sequence length poses a fundamental problem for existing methods, and instead exploited model architectures from natural language processing to develop "alignment-free" predictions. We also developed strategies for designing highly diversified libraries based on these models. Finally, we trained these models not just on assayed sequences but on standing evolutionary diversity, taking full advantage of the experiments already done by nature. Small pilot studies were successful, and we have now generated a library with hundreds of thousands of sequences, which is being evaluated experimentally by our collaborators.