2 documentation articles | 0 announcements | 2 forum discussions

Created 2015-07-29 13:55:56 | Updated 2016-04-23 00:12:36 | Tags: genotype likelihoods math phred pl

PL is a sample-level annotation calculated by GATK variant callers such as HaplotypeCaller, recorded in the FORMAT/sample columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample, as described here.

This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.

- The basic math
- Example and interpretation
- Special case: non-reference confidence model (GVCF mode)

The basic formula for calculating PL is:

$$ PL = -10 * \log{P(Genotype | Data)} $$

where `P(Genotype | Data)`

is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of `P(Genotype | Data)`

is described in the genotyping section of the Haplotype Caller documentation.

Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.

*The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.*

Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).

*Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.*

```
# Alleles
Reference: A
Read: T
# Conditional probabilities calculated by HC
P(AA | Data) = 0.000001
P(AT | Data) = 0.000100
P(TT | Data) = 0.010000
```

We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:

Genotype | A/A | A/T | T/T |
---|---|---|---|

Raw PL | -10 * log(0.000001) = 60 | -10 * log(0.000100) = 40 | -10 * log(0.010000) = 20 |

Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value. This is expected because, as described in the VCF FAQ, the PL is the *likelihood* of the genotype, which means (rather unintuitively if you’re not a stats buff) it is the probability that the genotype is **not** correct. So, low values mean a genotype is more likely, and high values means it’s less likely.

At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to **normalize** the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:

Genotype | A/A | A/T | T/T |
---|---|---|---|

Normalized PL | 60 - 20 = 40 | 40 - 20 = 20 | 20 - 20 = 0 |

We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.

We actually formalize this assessment of genotype quality in the **GQ annotation**, as described also in the VCF FAQ.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.

When you run HaplotypeCaller with `-ERC GVCF`

to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic `<NON-REF>`

allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).

The PL values for any possible genotype that includes the `<NON-REF>`

allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving `<NON-REF>`

. Instead, it uses base quality scores to model the genotype likelihoods.

You may have noticed that a lot of the scores that are output by the GATK are in Phred scale. The Phred scale was originally used to represent base quality scores emitted by the Phred program in the early days of the Human Genome Project (see this Wikipedia article for more historical background). Now they are widely used to represent probabilities and confidence scores in other contexts of genome science.

In the context of sequencing, Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer.

In the context of variant calling, Phred-scaled quality scores can be used to represent many types of probabilities. The most commonly used in GATK is the QUAL score, or variant quality score. It is used in much the same way as the base quality score: the variant quality score is a Phred-scaled estimate of how confident we are that the variant caller correctly identified that a given genome position displays variation in at least one sample.

In today’s sequencing output, by convention, Phred-scaled base quality scores range from 2 to 63. However, Phred-scaled quality scores in general can range anywhere from 0 to infinity. A **higher score** indicates a higher probability that a particular decision is **correct**, while conversely, a **lower score** indicates a higher probability that the decision is **incorrect**.

The Phred quality score (Q) is logarithmically related to the error probability (E).

$$ Q = -10 \log E $$

So we can interpret this score as an estimate of **error**, where the error is *e.g.* the probability that the base is called **incorrectly** by the sequencer, but we can also interpret it as an estimate of **accuracy**, where the accuracy is *e.g.* the probability that the base was identified **correctly** by the sequencer. Depending on how we decide to express it, we can make the following calculations:

If we want the probability of error (E), we take:

$$ E = 10 ^{-\left(\frac{Q}{10}\right)} $$

And conversely, if we want to express this as the estimate of accuracy (A), we simply take

$$ \begin{eqnarray} A &=& 1 - E \nonumber \ &=& 1 - 10 ^{-\left(\frac{Q}{10}\right)} \nonumber \ \end{eqnarray} $$

Here is a table of how to interpret a range of Phred Quality Scores. It is largely adapted from the Wikipedia page for Phred Quality Score.

For many purposes, a Phred Score of 20 or above is acceptable, because this means that whatever it qualifies is 99% accurate, with a 1% chance of error.

Phred Quality Score | Error | Accuracy (1 - Error) |
---|---|---|

10 | 1/10 = 10% | 90% |

20 | 1/100 = 1% | 99% |

30 | 1/1000 = 0.1% | 99.9% |

40 | 1/10000 = 0.01% | 99.99% |

50 | 1/100000 = 0.001% | 99.999% |

60 | 1/1000000 = 0.0001% | 99.9999% |

And finally, here is a graphical representation of the Phred scores showing their relationship to accuracy and error probabilities.

The red line shows the error, and the blue line shows the accuracy. Of course, as error decreases, accuracy increases symmetrically.

Note: You can see that below Q20 (which is how we usually refer to a Phred score of 20), the curve is really steep, meaning that as the Phred score decreases, you lose confidence very rapidly. In contrast, above Q20, both of the graphs level out. This is why Q20 is a good cutoff score for many basic purposes.

No articles to display.

Created 2016-03-15 09:50:46 | Updated 2016-03-15 09:56:32 | Tags: haplotypecaller genotype likelihoods phred

Dear all,

I have a question regarding the use of the SHAPEIT genotype calling by consecutive use of GATK, BEAGLE, and SHAPEIT. I have used the GATK haplotype caller, giving rise to an output of .vcf files with a ‘PL’ (normalised phred-scaled likelihood) column, and no ‘GL’ (genotype log10 likelihood) column. BEAGLE can handle PL as input, and the initial genotype calling works fine.

The next step would be to use SHAPEIT for phasing, as described here: [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gcall]

There is a C++ script described on this page, called prepareGenFromBeagle4, which will convert the BEAGLE .vcf files to the correct SHAPEIT input. However, this script looks for a ‘GL’ column in the .vcf file, which is unavailable since GATK only outputs ‘PL’. The script therefore crashes and I cannot proceed to the phasing step.

I have considered and tested a simple conversion script from PL to GL, where GL = PL/-10 However, since the PL is normalised (genotype with highest likelihood is set to 0), there is some loss of information here.

For example for a given genotype AA/AB/BB, if the original GLs (these are not in the GATK output but say they were) were -0.1 / -1 / -10 The corresponding PLs should be 1 / 10 / 100 And the normalised PLs (GATK output) would be 0 / 10 / 100 Giving rise to these converted GLs after the simple conversion 0 / -1 / -10

The converted GLs are used to then calculate genotype probabilities required for the SHAPEIT input. The issue that I have, is that all three genotype probabilities in the SHAPEIT input need to add up to 1.00 in total. The GL values are somehow scaled to calculate the genotype probabilities. Therefore, the final genotype probabilities in the SHAPEIT input file would turn out differently if I used the 'original' GL values (which I do not have) in comparison to the converted GL values. I am afraid this will introduce bias into the genotype probabilities used by SHAPEIT.

Apologies for the long post, I would be grateful hearing your thoughts on this issue. Have you used GATK and SHAPEIT consecutively before and run into this problem? Is there a reason to be weary of the potential bias here?

Many thanks in advance, Annique Claringbould

Created 2014-12-05 17:41:36 | Updated | Tags: depthofcoverage haplotypecaller phred quality-scores minor-allele-frequency

HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold".

What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant?

Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same.

Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio.

Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping?