The module PolymorphismEstimator estimates the rate at which polymorphic events occur in a polymorphic genome. It uses the read locations, respectively the differences between them, and contig consensus.
First, repeats are masked in the assembly, since the chance of finding reads from different copies of repeats that introduce false events is considerably higher than in unique sequence. Since some assemblies are undercollapsed (i.e. the haplotypes are in separate scaffolds), the option MIN_COPY_NUMBER gives control over the frequency of repeats to be masked (REPEAT_MASKED=False will turn off any masking). Also, this tool reports the percentage of bases found in repeats (it uses 48-mers).
Regardless of read quality scores, this module looks for the occurrence of (by default) three reads that disagree with consensus, but agree with each other (these locations are pre-computed by HQDAnnotator). Polymorphic events include SNPs as well as indels up to ~25 bases. If stretches of disagreements are longer than one base, we examine the differences and count base mismatches:
Total bases: 14689050 bp, unique: 13643266 bp (92.8805%) Repetitive before smoothing: 6.29223% Mismatches: 28522 or 131278 bp (0.962218%) Insertions: 2653 or 10242 bp (0.07507%) Deletions: 1893 or 7306 bp (0.0535502%) SNPs: 18435 bp (0.135122%) Total bases: 1.09084% *** Adjusted estimate: 0.398644% *** Alternate estimate: 0.423638%
The most reliable number for overall polymorphism rate is the "Alternate estimate", which includes both SNPs and indels and is close to a real divergence rate (in the sense of how 'identical' the haplotypes are).
SNP distribution histogram: Windows: 2870 0 - 0%: -> 28.2927% of genome 0 - 1%: -> 57.8049% of genome 1 - 2%: -> 12.5784% of genome 2 - 3%: -> 1.21951% of genome 3 - 4%: -> 0% of genome
Since HQDAnnotator requires perfect 24-mer matches at each side of a disagreement, regions of polymorphism rates higher than ~0.5% will be underrepresented (a 1% polymorphic assembly will appear as less than 0.8%). Given uneven distribution, even genomes with average rates lower than that can have regions that are undercounted. CAUTION: for polymorphism rates approaching 1% or higher, this counting is no longer accurate.