You are here

J Bioinform Comput Biol DOI:10.1142/S0219720013400015

Large-scale metagenomic sequence clustering on map-reduce clusters.

Publication TypeJournal Article
Year of Publication2013
AuthorsYang, X, Zola, J, Aluru, S
JournalJ Bioinform Comput Biol
Date Published2013 Feb
KeywordsAlgorithms, Animals, Base Sequence, Chromosome Mapping, Cluster Analysis, Computer Simulation, Humans, Metagenome, Metagenomics, Models, Genetic, Molecular Sequence Data, Sequence Analysis, DNA

Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. In this paper, we present a parallel algorithm for taxonomic clustering of large metagenomic samples with support for overlapping clusters. We develop sketching techniques, akin to those created for web document clustering, to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all comparison. We formulate the metagenomic classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud ready implementation. We show that the resulting framework can produce high quality clustering of metagenomic samples consisting of millions of reads, in reasonable time limits, when executed on a modest size cluster.


Alternate JournalJ Bioinform Comput Biol
PubMed ID23427983