A scalable, Bayesian model for copy number variation / Bayesian PCA

Mehrtash Babadi
Data Sciences & Data Engineering, Broad
A scalable Bayesian framework for inferring copy number variation

Abstract:  Inferring copy number variation (CNV) from next-generation sequencing (NGS) data is a challenging problem. On the one hand, the complexity of the NGS technology results in a highly non-uniform sampling of the genome with unknown latent factors. On the other hand, devising and implementing modern machine learning algorithms for CNV inference in a scalable and robust fashion is an arduous task due to the sheer size of the data. In this talk, we briefly review the existing approaches and glance over a number of their caveats, including difficulty with sex chromosomes, lack of a data-driven model for determining the number of bias latent factors, neglect of sampling noise, heuristic filtering and outlier detection, lack of self-consistency and scalability. Next, we introduce GATK gCNV, our principled and scalable Bayesian framework for germline CNV inference from whole-exome sequencing (WES) and whole-genome sequencing (WGS) data that addresses these caveats. We benchmark GATK gCNV, XHMM and CODEX on WES data against high-confidence Genome STRiP calls on matched WGS data as ground truth, and show that GATK gCNV yields up to 30 percent higher sensitivity and specificity compared to the existing tools. We conclude the talk with a brief discussion of our ongoing efforts toward addressing the difficulty with common and large CNV events, and generalization to somatic CNV inference.

 

Samuel Lee
Data Sciences & Data Engineering, Broad
Primer: Bayesian PCA

Abstract:  The model at the heart of GATK gCNV builds heavily on the probabilistic and Bayesian approaches to principal component analysis (PCA). In contrast with traditional PCA, the probabilistic approach provides a predictive model that can account for missing data, while the fully Bayesian approach further enables a principled way to learn the effective dimensionality of the principal subspace (i.e., the appropriate number of principal components to use). Model inference can be performed using expectation-maximization and variational-Bayesian methods, respectively. We will give a pedagogical overview of these methods, drawing analogies between Bayesian PCA and the perhaps more familiar Gaussian mixture model.