Marks Lab, Harvard Medical School

In this primer, we review some key statistical ideas that have been fundamental to the analysis of continuous low-dimensional data, but have yet to be successfully extended to apply to large scale biological sequence data. In particular, we introduce and motivate nonparametric density estimation, goodness-of-fit testing, and two-sample testing; we then illustrate how each of these challenges may be addressed for continuous low-dimensional data using methods based on the Bayesian Polya tree model. Finally, we describe theoretical guarantees available for each application, focusing on asymptotic consistency results. These ideas lay the foundation for the BEAR sequence model, introduced in the main talk, which we show can address the same challenges in the context of biological sequence data.

MIA Talks Search