3 documentation articles | 0 announcements | 5 forum discussions

Created 2015-07-29 13:55:56 | Updated 2016-04-23 00:12:36 | Tags: genotype likelihoods math phred pl

PL is a sample-level annotation calculated by GATK variant callers such as HaplotypeCaller, recorded in the FORMAT/sample columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample, as described here.

This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.

- The basic math
- Example and interpretation
- Special case: non-reference confidence model (GVCF mode)

The basic formula for calculating PL is:

$$ PL = -10 * \log{P(Genotype | Data)} $$

where `P(Genotype | Data)`

is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of `P(Genotype | Data)`

is described in the genotyping section of the Haplotype Caller documentation.

Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.

*The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.*

Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).

*Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.*

```
# Alleles
Reference: A
Read: T
# Conditional probabilities calculated by HC
P(AA | Data) = 0.000001
P(AT | Data) = 0.000100
P(TT | Data) = 0.010000
```

We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:

Genotype | A/A | A/T | T/T |
---|---|---|---|

Raw PL | -10 * log(0.000001) = 60 | -10 * log(0.000100) = 40 | -10 * log(0.010000) = 20 |

Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value. This is expected because, as described in the VCF FAQ, the PL is the *likelihood* of the genotype, which means (rather unintuitively if you’re not a stats buff) it is the probability that the genotype is **not** correct. So, low values mean a genotype is more likely, and high values means it’s less likely.

At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to **normalize** the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:

Genotype | A/A | A/T | T/T |
---|---|---|---|

Normalized PL | 60 - 20 = 40 | 40 - 20 = 20 | 20 - 20 = 0 |

We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.

We actually formalize this assessment of genotype quality in the **GQ annotation**, as described also in the VCF FAQ.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.

When you run HaplotypeCaller with `-ERC GVCF`

to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic `<NON-REF>`

allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).

The PL values for any possible genotype that includes the `<NON-REF>`

allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving `<NON-REF>`

. Instead, it uses base quality scores to model the genotype likelihoods.

Created 2014-07-23 17:38:17 | Updated 2016-04-23 00:08:57 | Tags: haplotypecaller genotype likelihoods genotyping

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as PL and GQ.

Note that this describes the **regular mode** of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in `-ERC`

modes (`GVCF`

and `BP_RESOLUTION`

) please see the reference confidence model documentation.

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the `-ploidy`

argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, the maximum "workable" ploidy is ~20 for a max number of alt alleles = 6). Future developments will aim to mitigate those limitations.

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype **G** given the observed data **D**.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype **G**:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype **G** is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype **G** into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype **G**.

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ -- see the linked docs for details. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

Created 2014-07-23 17:36:49 | Updated 2015-09-14 13:43:25 | Tags: haplotypecaller likelihoods haplotype hc

This document describes the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin *et al.*, 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). **Note: If reads from a pair overlap at a site and they have the same base, the base quality is capped at Q20 for both reads (Q20 is half the expected PCR error rate). If they do not agree, we set both base qualities to Q0.**

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. **(example figure TBD)**

The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and **marginalize them over alleles**, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods. **(example fig TBD)**

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

No articles to display.

Created 2016-03-15 09:50:46 | Updated 2016-03-15 09:56:32 | Tags: haplotypecaller genotype likelihoods phred

Dear all,

I have a question regarding the use of the SHAPEIT genotype calling by consecutive use of GATK, BEAGLE, and SHAPEIT. I have used the GATK haplotype caller, giving rise to an output of .vcf files with a ‘PL’ (normalised phred-scaled likelihood) column, and no ‘GL’ (genotype log10 likelihood) column. BEAGLE can handle PL as input, and the initial genotype calling works fine.

The next step would be to use SHAPEIT for phasing, as described here: [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gcall]

There is a C++ script described on this page, called prepareGenFromBeagle4, which will convert the BEAGLE .vcf files to the correct SHAPEIT input. However, this script looks for a ‘GL’ column in the .vcf file, which is unavailable since GATK only outputs ‘PL’. The script therefore crashes and I cannot proceed to the phasing step.

I have considered and tested a simple conversion script from PL to GL, where GL = PL/-10 However, since the PL is normalised (genotype with highest likelihood is set to 0), there is some loss of information here.

For example for a given genotype AA/AB/BB, if the original GLs (these are not in the GATK output but say they were) were -0.1 / -1 / -10 The corresponding PLs should be 1 / 10 / 100 And the normalised PLs (GATK output) would be 0 / 10 / 100 Giving rise to these converted GLs after the simple conversion 0 / -1 / -10

The converted GLs are used to then calculate genotype probabilities required for the SHAPEIT input. The issue that I have, is that all three genotype probabilities in the SHAPEIT input need to add up to 1.00 in total. The GL values are somehow scaled to calculate the genotype probabilities. Therefore, the final genotype probabilities in the SHAPEIT input file would turn out differently if I used the 'original' GL values (which I do not have) in comparison to the converted GL values. I am afraid this will introduce bias into the genotype probabilities used by SHAPEIT.

Apologies for the long post, I would be grateful hearing your thoughts on this issue. Have you used GATK and SHAPEIT consecutively before and run into this problem? Is there a reason to be weary of the potential bias here?

Many thanks in advance, Annique Claringbould

I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.

Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.

#CHROM POS ID REF ALT 100403001-1 100403001-1A 100403001-1B 1 5933530 rs905469 A G 0/0:37,0:0,99,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 5933530 rs905469 A G 0/0:37,0:189,15,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 10412636 rs4846215 A T 0/0:119,0:0,358,4297 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 10412636 rs4846215 A T 0/0:119,0:110,9,0 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 11729035 rs79974326 G C 0/0:50,0:0,141,1709 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 11729035 rs79974326 G C 0/0:50,0:1930,0,3851 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 16735764 rs182873855 G A 0/0:54,0:0,138,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 16735764 rs182873855 G A 0/0:54,0:174,0,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 17316577 rs77880760 G T 0/0:42,0:0,123,1470 0/0:38,0:0,111,1317 0/0:53,0:0,153,1817 1 17316577 rs77880760 G T 0/0:42,0:233,17,1470 0/0:38,0:0,111,1317 0/0:225,25:0,153,181 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:0,87,1066 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:1844,159,0 1 31740706 rs3753373 A G 0/0:123,0:0,349,4173 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885 1 31740706 rs3753373 A G 0/0:123,0:117,6,0 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885

Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.

In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.

Thanks, Kevin

Created 2013-07-05 18:46:03 | Updated | Tags: unifiedgenotyper likelihoods

Greetings,

I am trying to incorporate genotype likelihoods into a downstream analysis. I have two questions:

1) Why is the most likely genotype scaled to a Phred score of zero?

2) Is there a way to undo the scaling? I have seen downstream tools undo the scaling, but I don't know how they do it. Is there an equation that will return an estimated genotype likelihood from the scaled genotype likelihoods?

Thank you for your time.

Zev Kronenberg

Created 2013-01-28 06:08:54 | Updated 2013-01-28 06:18:43 | Tags: producebeagleinput likelihoods beagle community

Dear GATK team and community members,

I used ProduceBeagleInput to create a genotype likelihoods file, and ran beagle.jar according to the example in http://gatkforums.broadinstitute.org/discussion/43/interface-with-beagle-software. Beagle gave a warning that it is better to use a reference panel for imputing genotypes and phasing. So I downloaded the recommended reference panel (http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes.phase1_release_v3/), but Beagle requires that the alleles be in the same order on both reference and sample files. The tool to do this is check_strands.py (http://faculty.washington.edu/sguy/beagle/strand_switching/README), but it requires both sample and reference files be in .bgl format. This is a little disappointing since not being able to use the reference panel means Beagle's calculations won't be as accurate, although I'm not sure by how much.

I understand that this might be out of the scope of responsibility for the GATK team, but I will greatly appreciate if someone can provide suggestions to allow GATK's input to Beagle be phased using a reference panel. Or hopefully, the GATK team will write a tool to produce .bgl files?

Regards, Jamie

Created 2012-10-15 20:48:48 | Updated 2012-10-16 02:23:04 | Tags: likelihoods

The printed values missed the PL value, for examples, the format is:

```
GT:AD:DP:GQ:PL
['0/2', '1,0,10', '11', '8.12']
['0/2', '211,39,0', '250', '99']
['0/1', '10,1', '11', '14.38']
['0/1', '4,2', '4', '24.38']
['0/0', '27,0', '27', '78.26']
['1/1', '164,2', '183', '99']
['0/1', '242,1', '249', '99']
['0/1', '225,0', '233', '99']
['0/0', '84,5', '82', '81.18']
```

For every case, the PL value is missing. It happens most often when there are more than one alternative alleles.

Thank you,

Jim