I will describe a set of machine learning methods for protein design with a focus on accelerating antibody discovery for specificity and affinity. I will also motivate these methods for other applications more broadly in genomics.
Antibodies and nanobodies are highly valued molecular tools, used in research for isolating and imaging specific proteins, and in medical applications as therapeutics. However, for a large number of human and model-organism proteins, existing antibodies are non-existent or unreliable. Emerging experimental techniques enable orders-of-magnitude improvement in the number of sequences assayed for target affinity but are notoriously non-specific and not always well-folded. We have explored the use of generative probabilistic models for this design challenge. We found that the high heterogeneity in antibody sequence length poses a fundamental problem for existing methods, and instead exploited model architectures from natural language processing to develop "alignment-free" predictions. We also developed strategies for designing highly diversified libraries based on these models. Finally, we trained these models not just on assayed sequences but on standing evolutionary diversity, taking full advantage of the experiments already done by nature. Small pilot studies were successful, and we have now generated a library with hundreds of thousands of sequences, which is being evaluated experimentally by our collaborators.