Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning.

Nat Biotechnol

Authors	Brian Cleary Ilana Brito Katherine Huang Dirk Gevers Terrance Shea Sarah Young Eric Alm
Keywords	Algorithms Species Specificity Chromosome Mapping Epigenesis, Genetic Sequence Analysis, DNA Bacteria Genome, Bacterial Metagenomics Datasets as Topic Databases, Genetic Microbiota
Abstract	Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.
Year of Publication	2015
Journal	Nat Biotechnol
Volume	33
Issue	10
Pages	1053-60
Date Published	2015 Oct
ISSN	1546-1696
URL	http://dx.doi.org/10.1038/nbt.3329
DOI	10.1038/nbt.3329
PubMed ID	26368049
PubMed Central ID	PMC4720164
Links	PubMed Google Scholar DOI
Grant list	T32 GM087237 / GM / NIGMS NIH HHS / United States U54HG003067 / HG / NHGRI NIH HHS / United States

Recent Broad Publications

Bioinformatics pipeline for the systematic mining genomic and proteomic variation linked to rare diseases: The example of monogenic diabetes.

Multi-ancestry meta-analysis of tobacco use disorder identifies 461 potential risk genes and reveals associations with multiple health outcomes.

Refining the impact of genetic evidence on clinical success.

Refining the impact of genetic evidence on clinical success.

Analysis of REST binding sites with canonical and non-canonical motifs in human cell lines.