You are here

Generative models of proteins and genomes; Polya trees

Eli Weinstein
Marks Lab, Harvard Medical School
Meeting: Building and evaluating generative models of biological sequences, from proteins to whole genomes

Across biology and biomedicine, scientists are interested in measuring sequences, predicting sequences, and testing their predictions experimentally by synthesizing or editing sequences. Generative probabilistic modeling offers a flexible and rigorous framework for learning from sequence data and forming predictions, but building, inferring and critiquing probabilistic models of biological sequences remains challenging. In this talk we outline the major practical and theoretical limitations of existing techniques and propose alternatives. We first describe a structured output distribution for protein data, the “MuE” distribution, that enables the creation of regression models, forecasting models, latent feature models and more; models built with the MuE do not require alignments for training and meet key theoretical conditions. Second, we describe a new generative model that can be scaled to whole genomes, the “BEAR” model, and use it to construct a nonparametric density estimator, robust parameter estimators, a goodness-of-fit test, and a two-sample test, each with consistency guarantees. We illustrate the applications of these methods on a range of biological problems including characterizing immune receptor repertoires, mapping disordered protein families, comparing metagenomic samples, exploring unaligned read data, and forecasting pathogen evolution.

Alan Amin
Marks Lab, Harvard Medical School
Primer: Estimation and testing with generative nonparametric Bayesian models

In this primer, we review some key statistical ideas that have been fundamental to the analysis of continuous low-dimensional data, but have yet to be successfully extended to apply to large scale biological sequence data. In particular, we introduce and motivate nonparametric density estimation, goodness-of-fit testing, and two-sample testing; we then illustrate how each of these challenges may be addressed for continuous low-dimensional data using methods based on the Bayesian Polya tree model. Finally, we describe theoretical guarantees available for each application, focusing on asymptotic consistency results. These ideas lay the foundation for the BEAR sequence model, introduced in the main talk, which we show can address the same challenges in the context of biological sequence data.