We develop algorithms and create robust pipelines and infrastructure that enable our large scale data production efforts. Our computational efforts are spread across three key areas.
New Computational tools
Noam Shoresh (Broad) and Martin Aryee (MGH) nucleate a core group of computational biologists working to develop analysis and visualization tools for ChIP-based assays, bisulfite-sequencing assays, and genome topology assays. Some examples include:
- Epigenomic annotations for interpreting GWAS: Examining GWAS SNPs in the context of cell-type specific epigenetic maps can be used to identify relevant cell types for certain genetic diseases. This is limited mostly to broad classes of cell types or tissues, and in some cases the evidence is not conclusive even at the higher-system level. We are exploring new ways of approaching this problem, including ways to generate epigenomic annotations without the sometimes unstable step of peak calling.
- Single locus lookup: Researchers often wish to understand what is known about a specific genomic locus. We are developing a tool that, upon entering genomic coordinates, would interrogate a large number of chromatin maps and visualize the data from those contexts that show activity. Similar efforts are underway as part of the T2D portal.
- Genome topology analyses: Including calling differential loops.
- Chromatin accessibility analysis: See chromVAR from Jason Buenrostro.
Production pipelines, quality control and infrastructure
We work with the Data Sciences Platform (DSP) to develop and optimize automated analysis pipelines, database infrastructure, and quality control metrics for process quality assurance. This is essential for large-scale community projects in which we participate, such as ENCODE.
Our analysis pipelines include several in-house quality control (QC) methods, including:
- Random forest classifier: We have used hundreds of manually validated ChIP-Seq datasets for various histone modifications to train a random forest classifier to classify tracks by epitope. In addition to catching label swaps, it turns out the low classification confidence is strongly associated with lower quality of data.
- Genotyping QC: We use the sequence information in ChIP-Seq data to test whether two samples were derived from the same donor. We leverage linkage disequilibrium structure to address the fact that chromatin maps are very uneven, and that two datasets might have a small area of overlapping enrichment.
A community cloud portal for epigenomics
Together with the DSP we are building a cloud-based workbench for sharing epigenomic data and pipelines, for facilitating epigenomic analysis.
Tsai SQ, et al. Open-source guideseq software for analysis of GUIDE-seq data. Nat Biotechnol. 2016.
Ziller MJ, et al. Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing. Nat Methods. 2015.
Ayree MJ. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014.
Landt SG, Marinov GK, Kundaje A, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012.
Shoresh N, Wong B. Points of view: Data exploration. Nat Methods. 2012.