PolymorphismEstimator

From ArachneWiki

Jump to: navigation, search
PolymorphismEstimator
Function Text output
Phase Analysis
Standard CLAs PRE, DATA, RUN, SUBDIR, GDB, NO_HEADER
Special CLAs HQD_FILE, MIN_READS, REPEAT_MASK, MIN_COPY_NUMBER, MIN_HOMOZYGEOUS_STRETCH, COVERAGE_STATS, HISTOGRAM, HISTO_WIN, haplo_count
Source location ARACHNE_DIR/qcmarkup

The module PolymorphismEstimator estimates the rate at which polymorphic events occur in a polymorphic genome. It uses the read locations, respectively the differences between them, and contig consensus.

PolymorphismEstimator requires as input the file mergedcontigs.tags (from TagRepeats) and hqd.qc.xml (from HQDAnnotator ).

Contents

Repeat masking

First, repeats are masked in the assembly, since the chance of finding reads from different copies of repeats that introduce false events is considerably higher than in unique sequence. Since some assemblies are undercollapsed (i.e. the haplotypes are in separate scaffolds), the option MIN_COPY_NUMBER gives control over the frequency of repeats to be masked (REPEAT_MASKED=False will turn off any masking). Also, this tool reports the percentage of bases found in repeats (it uses 48-mers).

Polymorphism detection

Regardless of read quality scores, this module looks for the occurrence of (by default) three reads that disagree with consensus, but agree with each other (these locations are pre-computed by HQDAnnotator). Polymorphic events include SNPs as well as indels up to ~25 bases. If stretches of disagreements are longer than one base, we examine the differences and count base mismatches:

Total bases: 14689050 bp, unique: 13643266 bp (92.8805%)
Repetitive before smoothing: 6.29223%
Mismatches:  28522 or 131278 bp (0.962218%)
Insertions:  2653 or 10242 bp (0.07507%)
Deletions:   1893 or 7306 bp (0.0535502%)
SNPs:      18435 bp (0.135122%)
Total bases: 1.09084%
*** Adjusted  estimate: 0.398644%
*** Alternate estimate: 0.423638%

The most reliable number for overall polymorphism rate is the "Alternate estimate", which includes both SNPs and indels and is close to a real divergence rate (in the sense of how 'identical' the haplotypes are).

Distribution

Since polymorphism is rarely, if ever, evenly distributed across the genome, we also collect stats for regions at different levels of polymorphism in non-overlapping windows (5 kbp by default):

SNP distribution histogram: 
Windows: 2870
0 - 0%: -> 28.2927% of genome
0 - 1%: -> 57.8049% of genome
1 - 2%: -> 12.5784% of genome
2 - 3%: -> 1.21951% of genome
3 - 4%: -> 0% of genome


Accuracy

Since HQDAnnotator requires perfect 24-mer matches at each side of a disagreement, regions of polymorphism rates higher than ~0.5% will be underrepresented (a 1% polymorphic assembly will appear as less than 0.8%). Given uneven distribution, even genomes with average rates lower than that can have regions that are undercounted. CAUTION: for polymorphism rates approaching 1% or higher, this counting is no longer accurate.

Personal tools