Large-scale metagenomic sequence clustering on map-reduce clusters.

J Bioinform Comput Biol

Authors	Xiao Yang Jaroslaw Zola Srinivas Aluru
Keywords	Animals Humans Base Sequence Cluster Analysis Algorithms Computer Simulation Molecular Sequence Data Chromosome Mapping Sequence Analysis, DNA Metagenomics Models, Genetic Metagenome
Abstract	Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. In this paper, we present a parallel algorithm for taxonomic clustering of large metagenomic samples with support for overlapping clusters. We develop sketching techniques, akin to those created for web document clustering, to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all comparison. We formulate the metagenomic classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud ready implementation. We show that the resulting framework can produce high quality clustering of metagenomic samples consisting of millions of reads, in reasonable time limits, when executed on a modest size cluster.
Year of Publication	2013
Journal	J Bioinform Comput Biol
Volume	11
Issue	1
Pages	1340001
Date Published	2013 Feb
ISSN	1757-6334
URL	http://www.worldscinet.com/doi/abs/10.1142/S0219720013400015?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%3dpubmed
DOI	10.1142/S0219720013400015
PubMed ID	23427983
Links	PubMed DOI Google Scholar

Recent Broad Publications

Single-cell image-based genetic screens systematically identify regulators of Ebola virus subcellular infection dynamics.

Phenomewide Association Study of Health Outcomes Associated With the Genetic Correlates of 25 Hydroxyvitamin D Concentration and Vitamin D Binding Protein Concentration.

Modification of coronary artery disease clinical risk factors by coronary artery disease polygenic risk score.

Aspiring toward equitable benefits from genomic advances to individuals of ancestrally diverse backgrounds.

Towards fair and clinically relevant polygenic predictions.