Computational Epigenomics

We develop algorithms and create robust pipelines and infrastructure that enable our large scale data production efforts. Our computational efforts are spread across three key areas.

Source: Shoresh N, Wong B. Points of view: Data exploration. Nat Methods. 2012.

New Computational tools

Noam Shoresh (Broad) and Martin Aryee (MGH) nucleate a core group of computational biologists working to develop analysis and visualization tools for ChIP-based assays, bisulfite-sequencing assays, and genome topology assays. Some examples include:

Epigenomic annotations for interpreting GWAS: Examining GWAS SNPs in the context of cell-type specific epigenetic maps can be used to identify relevant cell types for certain genetic diseases. This is limited mostly to broad classes of cell types or tissues, and in some cases the evidence is not conclusive even at the higher-system level. We are exploring new ways of approaching this problem, including ways to generate epigenomic annotations without the sometimes unstable step of peak calling.
Single locus lookup: Researchers often wish to understand what is known about a specific genomic locus. We are developing a tool that, upon entering genomic coordinates, would interrogate a large number of chromatin maps and visualize the data from those contexts that show activity. Similar efforts are underway as part of the T2D portal.
Genome topology analyses: Including calling differential loops.
Chromatin accessibility analysis: See chromVAR from Jason Buenrostro.

Production pipelines, quality control and infrastructure

We work with the Data Sciences Platform (DSP) to develop and optimize automated analysis pipelines, database infrastructure, and quality control metrics for process quality assurance. This is essential for large-scale community projects in which we participate, such as ENCODE.

Our analysis pipelines include several in-house quality control (QC) methods, including:

Random forest classifier: We have used hundreds of manually validated ChIP-Seq datasets for various histone modifications to train a random forest classifier to classify tracks by epitope. In addition to catching label swaps, it turns out the low classification confidence is strongly associated with lower quality of data.
Genotyping QC: We use the sequence information in ChIP-Seq data to test whether two samples were derived from the same donor. We leverage linkage disequilibrium structure to address the fact that chromatin maps are very uneven, and that two datasets might have a small area of overlapping enrichment.