Candida albicans is a diploid fungus that causes opportunistic oral and vaginal infections in humans. It has a genome size of less than 15 Mbp and was sequenced in different libraries to a Q20 coverage of about 10 X.
Number of reads in total : 241207 Number of reads used 0x : 25504 ( 10.6% ) Number of reads used 1x : 215703 ( 89.4% ) Number of reads used >1x : 0 ( 0.0% )
Number of bases in total : 176519466 Number of bases used 0x : 15163948 ( 8.6% ) Number of bases used 1x : 161355518 ( 91.4% ) Number of bases used >1x : 0 ( 0.0% )
Contig N75 length : 237019 ( 21 large ones ) Contig N50 length : 537128 ( 10 Large ones ) Contig N25 length : 780827 ( 4 LARGE ones ) Contig total length : 14492284 ( 99 IN TOTAL )
Super N75 length : 1041827 ( 7 large ones ) Super N50 length : 1401178 ( 4 Large ones ) Super N25 length : 1762731 ( 2 LARGE ones ) Super total length : 14503629 ( 25 IN TOTAL )
Moreover, the assembly was in good agreement with another strain of C. albicans, suggesting that there were no major assembly errors present.
So, what went wrong with the assembly?
After attempting to run gene calling software, it turned out that there was a number of false frameshifts as well as premature stop codons preventing genes to be identified correctly as such. PolymorphismEstimator thought that the heterozygosity was around 1.6% (one event in about 60 bases) and mostly in SNPs and tiny indels - not enough to disturb the assembly process but enough to have the consensus generator use bases from either haplotype on a more or less random basis.
In order to preserve longer stretches from one single haplotype, we first identified all reads that had overlaps with close-by reads that contained high quality disagreements; then, we removed all these reads together with their read partners, leaving most of the assembly with consensus but no read coverage (this can be done in a single run of IncompatiblePurge with the option USE_REPEATS=False).
Next, we used a greedy algorithm to put back reads in a manner that disagreements between the reads are minimized (CompatibleBinge and PlopDownReads with the option MAX_SNPS=0; the first one places reads only in pairs, the second one also places single reads). Most of the original contig sequence could be preserved, and at a read usage of 65%, the polymorphism rate was now estimated to be 0.004% (compared to 0.001% for C. lusitaniae, a real haploid genome).
A subsequent run of ParallelGenerateTilings and FixConsensus could repair the consensus with the remaining reads that better reflect one single haplotype. Finally, the number of broken genes dropped dramatically in a new run of gene calling.