Candida albicans

From ArachneWiki

Jump to: navigation, search

Candida albicans is a diploid fungus that causes opportunistic oral and vaginal infections in humans. It has a genome size of less than 15 Mbp and was sequenced in different libraries to a Q20 coverage of about 10 X.


Assembly issues

Initial assembly

The out-of-the box run of Assemblez using the default parameters yielded impressive assembly statistics, most notably a contig N50 size of more than 500 kbp. The BasicAssemblyStats output appeared:

Number of reads in total :          241207
Number of reads used  0x :           25504  ( 10.6% )
Number of reads used  1x :          215703  ( 89.4% )
Number of reads used >1x :               0  (  0.0% )
Number of bases in total :       176519466
Number of bases used  0x :        15163948  (  8.6% )
Number of bases used  1x :       161355518  ( 91.4% )
Number of bases used >1x :               0  (  0.0% )
Contig        N75 length :          237019  (      21 large ones )
Contig        N50 length :          537128  (      10 Large ones )
Contig        N25 length :          780827  (       4 LARGE ones )
Contig      total length :        14492284  (      99 IN TOTAL   )
Super         N75 length :         1041827  (       7 large ones )
Super         N50 length :         1401178  (       4 Large ones )
Super         N25 length :         1762731  (       2 LARGE ones )
Super       total length :        14503629  (      25 IN TOTAL   )

Moreover, the assembly was in good agreement with another strain of C. albicans, suggesting that there were no major assembly errors present.

So, what went wrong with the assembly?

After attempting to run gene calling software, it turned out that there was a number of false frameshifts as well as premature stop codons preventing genes to be identified correctly as such. PolymorphismEstimator thought that the heterozygosity was around 1.6% (one event in about 60 bases) and mostly in SNPs and tiny indels - not enough to disturb the assembly process but enough to have the consensus generator use bases from either haplotype on a more or less random basis.


In order to preserve longer stretches from one single haplotype, we first identified all reads that had overlaps with close-by reads that contained high quality disagreements; then, we removed all these reads together with their read partners, leaving most of the assembly with consensus but no read coverage (this can be done in a single run of IncompatiblePurge with the option USE_REPEATS=False).

Next, we used a greedy algorithm to put back reads in a manner that disagreements between the reads are minimized (CompatibleBinge and PlopDownReads with the option MAX_SNPS=0; the first one places reads only in pairs, the second one also places single reads). Most of the original contig sequence could be preserved, and at a read usage of 65%, the polymorphism rate was now estimated to be 0.004% (compared to 0.001% for C. lusitaniae, a real haploid genome).

A subsequent run of ParallelGenerateTilings and FixConsensus could repair the consensus with the remaining reads that better reflect one single haplotype. Finally, the number of broken genes dropped dramatically in a new run of gene calling.

Personal tools