Tagged with #quality-scores
1 documentation article | 0 announcements | 2 forum discussions


Comments (0)

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

Phred scale in context

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

Phred scale in practice

In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A higher score indicates a higher probability that a particular decision is correct, while conversely, a lower score indicates a higher probability that the decision is incorrect.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$ Q = -10 \log E $$

So we can interpret this score as an estimate of error, where the error is e.g. the probability that the base is called incorrectly by the sequencer, but we can also interpret it as an estimate of accuracy, where the accuracy is e.g. the probability that the base was identified correctly by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$ E = 10 ^{-\left(\frac{Q}{10}\right)} $$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

$$ \begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray} $$

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score Error Accuracy (1 - Error)
10 1/10 = 10% 90%
20 1/100 = 1% 99%
30 1/1000 = 0.1% 99.9%
40 1/10000 = 0.01% 99.99%
50 1/100000 = 0.001% 99.999%
60 1/1000000 = 0.0001% 99.9999%

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

No posts found with the requested search criteria.
Comments (2)

HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold".

What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant?

Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same.

Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio.

Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping?

Comments (15)

Dear all, I want to call SNPs from Solid data (SE=35) using GATK recent version (3.2.2), but I got the following errors which is regarding to RealignerTargetCreator function: when using argument -fixMisencodedQual ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool without -fixMisencodedQual ERROR MESSAGE: SAM/BAM file SAMFileReader{XXXXX} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 62; please see the GATK --help documentation for options related to this error when using argument -allowPotentiallyMisencodedQuals, it can run well.

All my command like following, bfast match -f ref.fa -r 1.fastq -A 1 -n 16 >1.aligned.bmf

bfast localalign -f ref.fa -m 1.aligned.bmf -A 1 -n 16 >2.aligned.baf

bfast postprocess -f ref.fa -i 2.aligned.baf -A 1 -Y 2 -n 16 -b 0 >2.sam

java -Xmx60g -jar /bin/picard-tools-1.118/AddOrReplaceReadGroups.jar INPUT=2.sam OUTPUT=2.bam SORT_ORDER=coordinate RGID=OS RGLB=OS RGPL=solid RGPU=SRR035385 RGSM=OS

java -Xmx60g -jar /bin/picard-tools-1.118/MarkDuplicates.jar INPUT=2.bam OUTPUT=2rdup.bam METRICS_FILE=2rdup REMOVE_DUPLICATES=true ASSUME_SORTED=true MAX_FILE_HANDLES=2000

java -jar /bin/GATK-3.2-2/GenomeAnalysisTK.jar -R ref.fa -T RealignerTargetCreator -I 2rdup.bam -o 2.realn.intervals -nt 8 -allowPotentiallyMisencodedQuals ###I got errors here

Can I get a correct VCF file when I using argument -allowPotentiallyMisencodedQuals in the following command! While, some wrong commands may lead to this problem, please point them out.

I hope someone can help me with my questions, thank you!