Tagged with #genotype
2 documentation articles | 1 announcement | 21 forum discussions

Created 2015-07-29 13:55:56 | Updated 2016-04-23 00:12:36 | Tags: genotype likelihoods math phred pl

Comments (0)

PL is a sample-level annotation calculated by GATK variant callers such as HaplotypeCaller, recorded in the FORMAT/sample columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample, as described here.

This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.


  1. The basic math
  2. Example and interpretation
  3. Special case: non-reference confidence model (GVCF mode)

1. The basic math

The basic formula for calculating PL is:

$$ PL = -10 * \log{P(Genotype | Data)} $$

where P(Genotype | Data) is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of P(Genotype | Data) is described in the genotyping section of the Haplotype Caller documentation.

Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.

The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.

2. Example and interpretation

Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).

Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.

# Alleles
Reference: A
Read: T

# Conditional probabilities calculated by HC 
P(AA | Data) = 0.000001
P(AT | Data) = 0.000100
P(TT | Data) = 0.010000

Calculate the raw PL values

We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:

Genotype A/A A/T T/T
Raw PL -10 * log(0.000001) = 60 -10 * log(0.000100) = 40 -10 * log(0.010000) = 20

Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value. This is expected because, as described in the VCF FAQ, the PL is the likelihood of the genotype, which means (rather unintuitively if you’re not a stats buff) it is the probability that the genotype is not correct. So, low values mean a genotype is more likely, and high values means it’s less likely.


At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to normalize the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:

Genotype A/A A/T T/T
Normalized PL 60 - 20 = 40 40 - 20 = 20 20 - 20 = 0

We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.

Genotype quality

We actually formalize this assessment of genotype quality in the GQ annotation, as described also in the VCF FAQ.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.

3. Special case: non-reference confidence model (GVCF mode)

When you run HaplotypeCaller with -ERC GVCF to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic <NON-REF> allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).

The PL values for any possible genotype that includes the <NON-REF> allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving <NON-REF>. Instead, it uses base quality scores to model the genotype likelihoods.

Created 2014-07-23 17:38:17 | Updated 2016-04-23 00:08:57 | Tags: haplotypecaller genotype likelihoods genotyping

Comments (0)

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as PL and GQ.

Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in -ERC modes (GVCF and BP_RESOLUTION) please see the reference confidence model documentation.


The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

1. Preliminary assumptions / limitations


Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).


Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, the maximum "workable" ploidy is ~20 for a max number of alt alleles = 6). Future developments will aim to mitigate those limitations.

Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

2. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype G:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype G.

3. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ -- see the linked docs for details. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

Created 2014-10-22 16:33:50 | Updated 2014-10-22 16:39:11 | Tags: genotype genotype-refinement mendelianviolations denovo

Comments (0)

The core GATK Best Practices workflow has historically focused on variant discovery --that is, the existence of genomic variants in one or more samples in a cohorts-- and consistently delivers high quality results when applied appropriately. However, we know that the quality of the individual genotype calls coming out of the variant callers can vary widely based on the quality of the BAM data for each sample. To address the increasing need for more accurate genotypes, especially in the context of analyses performed on families, we have developed a new workflow that uses additional data such as pedigrees and genotype priors to improve the accuracy of genotype calls, to filter genotype calls that are not reliable enough for downstream analysis and to tag possible de novo mutations.

The Genotype Refinement workflow is meant to serve as an optional extension of the variant calling workflow, intended for researchers whose work requires high-quality identification of individual genotypes. It is documented in a new method article, with mathematical details available separately. See the corresponding tutorial for step-by-step instructions on how to actually run it in practice.

Note that although all tools involved in this workflow are already available in GATK 3.2, some functionalities will only be available in the latest development version (see nightly builds in the Downloads section) until the release of version 3.3 (which is imminent).

Let us know how the new workflow works for you; as usual, we crave feedback and are happy to answer any and all questions.

Created 2016-05-19 08:43:40 | Updated 2016-05-19 08:44:42 | Tags: haplotypecaller vcf genotype reference-confidence-model

Comments (3)

Hi, Just wondering what the possible reasons could be for Haplotype Caller (version: 3.5-0-g36282e4) to declare a reference genotype quality of 0 for positions where the read depth is relatively high, such as the following region:

SC1 1628 . T . . PASS DP=50;GC=33.33 GT:AD:DP:RGQ 0/0:50:50:99 SC1 1629 . T . . PASS DP=50;GC=38.1 GT:AD:DP:RGQ 0/0:50:50:99 SC1 1630 . T . . FAIL_RGQ DP=50;GC=38.1 GT:AD:DP:RGQ 0/0:45:50:5 SC1 1631 . C . . PASS DP=51;GC=33.33 GT:AD:DP:RGQ 0/0:51:51:99 SC1 1632 . A . . PASS DP=51;GC=33.33 GT:AD:DP:RGQ 0/0:51:51:96

In this instance the RGQ of the failed position above (at SC1:1630) was actually 5 (which I set as the threshold for filtering in this example), but I have plenty of instances where the read depth and resultant RGQ are like:

SC1 1630 . T . . FAIL_RGQ DP=50;GC=38.1 GT:AD:DP:RGQ ./.:45:50:5 SC1 1640 . T . . FAIL_RGQ DP=48;GC=38.1 GT:AD:DP:RGQ ./.:34:48:0 SC1 1805 . T . . FAIL_RGQ DP=36;GC=33.33 GT:AD:DP:RGQ ./.:32:36:0 SC1 2046 . A . . FAIL_RGQ DP=37;GC=19.05 GT:AD:DP:RGQ ./.:33:37:2 SC1 2345 . A . . FAIL_RGQ DP=105;GC=23.81 GT:AD:DP:RGQ ./.:90:105:0 SC1 2352 . A . . FAIL_RGQ DP=116;GC=19.05 GT:AD:DP:RGQ ./.:103:116:0 SC1 2356 . C . . FAIL_RGQ DP=112;GC=23.81 GT:AD:DP:RGQ ./.:100:112:0 SC1 2359 . G . . FAIL_RGQ DP=111;GC=28.57 GT:AD:DP:RGQ ./.:99:111:0

It feels like something funny is going on. Should it be possible for RGQ to be so low with such high depth? Also, I thought the AD format tag gave the count of unfiltered reads, whilst the format DP tag gave the filtered read depth (i.e. reads HC finds informative). Therefore shouldn't the AD count always be at least as high as the DP count?

Created 2016-05-19 01:27:00 | Updated 2016-05-19 01:32:05 | Tags: genotype read-depth gatk3-4 gatk3-5

Comments (1)

Hello, I'm using GATK and I found this - "zero depth with genotype"

As INFO column said, DP=54, however, in sample column DP:0 with GT:1/1. And reads are found by looking bam file with samtools tview.

Could you help me how to interpret this?


Kenneth Joohyun Han

ps. this is found both GATK v3.5 and v3.4 . GATK v3.5 chr3 150421527 . C CTCCTCCTCCTCCTCCACCTCT 408.76 PASS AC=2;AF=1.00;AN=2;DP=54;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,0:0:22:446,22,0

GATK v3.4 chr3 150421527 . C CTCCTCCTCCTCCTCCACCTCT 417.79 PASS AC=2;AF=1.00;AN=2;DP=54;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;SOR=0.693;set=variant2 GT:AD:DP:GQ:PL 1/1:0,0:0:22:446,22,0

Created 2016-05-18 10:20:34 | Updated | Tags: selectvariants variantfiltration genotype no-calls

Comments (1)


I have been trying to find a way to mark specific individual genotypes as No Call.

I know that in VariantFiltration it is possible to add the option --setFilteredGTToNocall in order to mark filtered genotypes as no call. However, in my case, there is no available filter corresponding to my criteria. Let me explain. I have some diploid-haploid paired related samples, i.e. I expect the haploid individual to share one of the two alleles from its diploid related. Therefore, for positions where there are discordant genotypes, I would like to mark individual genotypes as as no call in order to not include potentials errors in my data. I couldn't find a way to apply this rational to the pipeline of VariantFiltration or SelectVariants...

Eventually, I manage to manipulate the .vcf file by myself to mark any discordant genotypes as . or ./. However, when later I want to select the positions for which there are maximum of 0.2 ratio of no call (using the option --maxNOCALLfraction of SelectVariants from the nightly build version), there was an error message "there aren't enough comumns for line...". I suppose that it's because I did manipulate the vcf file by myself and there could have been errors (although I'm quite confident with my script).

Therefore, I think it's better to not manipulate the vcf file and try to do things using the pipeline....but how?

Created 2016-04-18 16:19:46 | Updated | Tags: genotype pooled-sequencing-haplotypecaller

Comments (1)

Hi, I have a VCF file obtained from GATK with the HaplotypeCaller analysis type. I have some difficulties in the interpretation of the genotype of the variants called. How should I interpret the following genotypes:1/2,1/3,2/3,3/3,0/2 and so on? Thank you for your help. Best regards,


Created 2016-03-20 19:57:09 | Updated | Tags: unifiedgenotyper genotype

Comments (1)

I know HC call genotype, but does unifiedgenotyper also call genotype or just variants?

Created 2016-03-15 09:50:46 | Updated 2016-03-15 09:56:32 | Tags: haplotypecaller genotype likelihoods phred

Comments (5)

Dear all,

I have a question regarding the use of the SHAPEIT genotype calling by consecutive use of GATK, BEAGLE, and SHAPEIT. I have used the GATK haplotype caller, giving rise to an output of .vcf files with a ‘PL’ (normalised phred-scaled likelihood) column, and no ‘GL’ (genotype log10 likelihood) column. BEAGLE can handle PL as input, and the initial genotype calling works fine.

The next step would be to use SHAPEIT for phasing, as described here: [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gcall]

There is a C++ script described on this page, called prepareGenFromBeagle4, which will convert the BEAGLE .vcf files to the correct SHAPEIT input. However, this script looks for a ‘GL’ column in the .vcf file, which is unavailable since GATK only outputs ‘PL’. The script therefore crashes and I cannot proceed to the phasing step.

I have considered and tested a simple conversion script from PL to GL, where GL = PL/-10 However, since the PL is normalised (genotype with highest likelihood is set to 0), there is some loss of information here.

For example for a given genotype AA/AB/BB, if the original GLs (these are not in the GATK output but say they were) were -0.1 / -1 / -10 The corresponding PLs should be 1 / 10 / 100 And the normalised PLs (GATK output) would be 0 / 10 / 100 Giving rise to these converted GLs after the simple conversion 0 / -1 / -10

The converted GLs are used to then calculate genotype probabilities required for the SHAPEIT input. The issue that I have, is that all three genotype probabilities in the SHAPEIT input need to add up to 1.00 in total. The GL values are somehow scaled to calculate the genotype probabilities. Therefore, the final genotype probabilities in the SHAPEIT input file would turn out differently if I used the 'original' GL values (which I do not have) in comparison to the converted GL values. I am afraid this will introduce bias into the genotype probabilities used by SHAPEIT.

Apologies for the long post, I would be grateful hearing your thoughts on this issue. Have you used GATK and SHAPEIT consecutively before and run into this problem? Is there a reason to be weary of the potential bias here?

Many thanks in advance, Annique Claringbould

Created 2016-01-05 20:37:22 | Updated | Tags: mutect genotype vcf-file

Comments (1)


I have noticed that when running MuTect, the variant calls in my output always have a genotype of 0/1. I have not seen any 1/1 genotypes, even when the variant allele frequency is 100%.

Are there any instances when MuTect gives a 1/1 genotype for a variant call?

Thank you, Jeremy

Created 2015-11-26 10:31:36 | Updated 2015-11-26 10:32:29 | Tags: haplotypecaller readbackedphasing genotype phasing

Comments (1)

Hi, I know that HaplotypeCaller outputs physical phasing information in PG and PGT fields, so as not to confuse physical phasing with pedigree based methods, however if I wanted to use the physical phasing information in downstream tools which rely on the | character in the GT tag and the PS identifier tag, is it sufficient to copy this information from the PG and PGT fields into the GT and PS tags respectively for all samples which were genotyped (I notice that the genotype in PGT and GT may not necessarily agree after you run genotypeGVCFs - it looks like genotypeGVCFs updates the called GT from HaplotypeCaller, but the physical phasing information from is unchanged)?

Created 2014-02-27 15:25:19 | Updated 2014-02-27 22:40:50 | Tags: unifiedgenotyper genotype vcf-format gt

Comments (10)

generated with gatk 2.8-1-g932cd3a

Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):

CHROM   POS ID  REF ALT FILTER  QUAL    ABHet   ABHom   AC  AF  AN  BaseCounts  BaseQRankSum    DP  Dels    FS  GC  HRun    HaplotypeScore  LowMQ   MLEAC   MLEAF   MQ  MQ0 MQRankSum   MeanDP  MinDP   OND PercentNBaseSolid   QD  ReadPosRankSum  Samples Somatic VariantType cosmic.ID   1.AB    1.AD    1.DP    1.F 1.GQ    1.GT    1.MQ0   1.PL    1.Z 2.AB    2.AD    2.DP    2.F 2.GQ    2.GT    2.MQ0   2.PL    2.Z 3.AB    3.AD    3.DP    3.F 3.GQ    3.GT    3.MQ0   3.PL    3.Z 4.AB    4.AD    4.DP    4.F 4.GQ    4.GT    4.MQ0   4.PL    4.Z 5.AB    5.AD    5.DP    5.F 5.GQ    5.GT    5.MQ0   5.PL    5.Z
11  92616485    0   A   C   PASS    63.71   0.333   0.698   1   0.1 10  89,54,0,0   -5.631  143 0   49.552  71.29   2   4.4154  0.0000,0.0000,143   1   0.1 50.27   0   -1.645  28.6    16  0.242   0   2.36    2.125   R5_A3_1 NA  SNP COSM467570  NA  24,9    33  0.2727272727    54  A/A 0   0,54,537    -1.3055824197   0.33    9,18    27  0.6666666667    96  A/C 0   96,0,178    0.8660254038    NA  21,11   32  0.34375 21  A/A 0   0,21,466    -0.8838834765   NA  12,4    16  0.25    27  A/A 0   0,27,272    -1  NA  23,12   35  0.3428571429    42  A/A 0   0,42,537    -0.9296696802

This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....

Created 2013-12-09 10:40:23 | Updated | Tags: unifiedgenotyper genotype conflict

Comments (4)

Dear Team, I was looking at a VCF file produced with UnifiedGenotyper (2.4.9). It is a multisample call and, for a limited number of calls, I have genotypes that are telling the exact opposite of AD field, as in this case

GT:AD:DP:GQ:PL  1/1:10,1:11:3:24,3,0


GT:AD:DP:GQ:PL  1/1:18,1:19:3:22,3,0

I have ten reads supporting the reference allele, 1 read supporting the alternate and the genotype is 1/1. This is happening in ~200 sites per sample in my dataset. I've checked the other way around and I found <100 sites in which the genotype is called 0/0 and the AD suggests 1/1 or (more frequently) 0/1. This seems to happen in sites in which the number of variant samples is low (no more than 3 samples in a set of ~50 samples) and it is puzzling me a lot. Can you give me a comment on why this is happening? Thanks


Created 2013-11-22 02:21:52 | Updated | Tags: combinevariants genotype likelihoods

Comments (7)

I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.

Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.

#CHROM  POS       ID           REF  ALT  100403001-1           100403001-1A           100403001-1B        

1       5933530   rs905469     A    G    0/0:37,0:0,99,1192    0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 
1       5933530   rs905469     A    G    0/0:37,0:189,15,1192  0/0:35,0:0,90,1101     0/0:44,0:0,117,1412 

1       10412636  rs4846215    A    T    0/0:119,0:0,358,4297  0/0:113,0:0,337,4060   0/0:102,0:0,304,3622
1       10412636  rs4846215    A    T    0/0:119,0:110,9,0     0/0:113,0:0,337,4060   0/0:102,0:0,304,3622

1       11729035  rs79974326   G    C    0/0:50,0:0,141,1709   0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 
1       11729035  rs79974326   G    C    0/0:50,0:1930,0,3851  0/0:53,0:0,150,1788    0/0:71,0:0,187,2246 

1       16735764  rs182873855  G    A    0/0:54,0:0,138,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 
1       16735764  rs182873855  G    A    0/0:54,0:174,0,1691   0/0:57,0:0,153,1841    0/0:47,0:0,120,1441 

1       17316577  rs77880760   G    T    0/0:42,0:0,123,1470   0/0:38,0:0,111,1317    0/0:53,0:0,153,1817 
1       17316577  rs77880760   G    T    0/0:42,0:233,17,1470  0/0:38,0:0,111,1317    0/0:225,25:0,153,181

1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:0,87,1066  
1       28116000  rs2294229    A    G    0/0:37,0:0,105,1291   0/0:37,0:0,111,1379    0/0:30,0:1844,159,0 

1       31740706  rs3753373    A    G    0/0:123,0:0,349,4173  0/0:110,0:0,319,3793   0/0:111,0:0,328,3885
1       31740706  rs3753373    A    G    0/0:123,0:117,6,0     0/0:110,0:0,319,3793   0/0:111,0:0,328,3885

Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.

In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.

Thanks, Kevin

Created 2013-07-23 19:44:06 | Updated | Tags: variantannotator vcf genotype annotation

Comments (9)


Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file):

chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0

The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect:

java -jar $GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf

Does anyone have any suggestions? Thanks in advance!

Created 2013-07-23 16:20:21 | Updated | Tags: phasebytransmission readbackedphasing trio genotype

Comments (4)

I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!

Created 2012-12-14 05:58:16 | Updated 2013-01-07 19:46:55 | Tags: genotype completegenomics

Comments (1)

Hi. I have converted the tsv file from Complete genomics(CGA tool) to vcf format. but, when I run programs, it says some error in my vcf format. I carefully examined the vcf file and found, in some SNVs the GT is 1|.(dot). Is it a valid vcf file or is there any problem in vcf file? The program showed the error in those lines in which GT is 1|.(dot) i.e half genotype information. The line looks like this:

1   55164   .   C   A   .   .   NS=1;AN=1;AC=1;CGA_XR=dbsnp.103|rs3091274;CGA_RPT=L2|L2|49.7;CGA_SDO=2  GT:PS:FT:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP  1/.:.:VQLOW:30,.:30,.:8,.:-30,0,0:-8,0,0:18:18,.:0

Created 2012-12-04 20:12:44 | Updated 2012-12-04 20:15:44 | Tags: varianteval jexl genotype

Comments (1)

While running VariantEval, I'm trying to stratify by a JexlExpression by setting using

-ST Sample -ST JexlExpression -select "GQ>20"

This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.

My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?

Would you like this behavior or are you happy with the current operation of jexlExpression?


Created 2012-10-18 11:59:05 | Updated 2012-10-19 01:11:34 | Tags: best-practices genotype threshold

Comments (4)


I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

Please can you clarify?

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

Created 2012-10-15 21:16:43 | Updated 2012-10-18 00:42:30 | Tags: unifiedgenotyper genotype

Comments (11)

Hello the team,

For some genotypes, it seems are wrong, I know it's model based, and base q, map q, etc are considered in the model. I also read this link: http://gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv#latest But my case are special, the format is (ref allele count)/(alternative allele count) genotype call: 22/24 0/0 109/125 0/0 85/109 0/0 26/32 0/0 40/161 0/0 195/6 1/1 239/5 1/1 83/6 1/1 46/28 1/1

In one case, the two variants are adjacent to each other. In some case, they are one base indels.



Created 2012-10-11 18:54:36 | Updated 2012-10-18 01:30:25 | Tags: unifiedgenotyper haplotypecaller genotype all-sites

Comments (2)

I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller? Thanks!

Created 2012-10-11 18:53:13 | Updated 2012-10-18 01:33:10 | Tags: genotype heterzygosity homozygosity

Comments (1)

I have an inbred mouse strain that I am sequencing and there should be little to NO heterozygosity. Yet with the default settings of UGT -heterozygosity (which is 0.001) many homs are being called as hets. When 230/250 reads are alternate and 20/250 are reference, it calls a het, even though it should be homozygous alternate.

What do you recommendations for this setting for inbred animals?

thanks, GATK is great!


Created 2012-10-05 22:43:09 | Updated 2012-10-18 01:09:20 | Tags: unifiedgenotyper bam genotype

Comments (8)

I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.

I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:

GT:AD:DP:GQ:MQ0:PL 1/1:0,66:66:99:0:2337,187,0

This seems to me to be a serious bug. Is this anything that's been noted before?

I am running GATKLite version 2.1-3-ge1dbcc8


Created 2012-09-29 01:07:11 | Updated 2012-10-01 17:08:13 | Tags: genotype algorithm

Comments (6)


I am observing the following scenario at one particular SNP (C/G) using two different enrichment technologies: (I am using IGV syntax: ALLELE|number of reads w/ allele|%of total reads|+strand reads|- strand reads)

  • technology1:
    C: 15 47% 15+,0- G: 17 53% 17+,0-
  • technology2:
    C: 17 37% 13+,4- G: 29 63% 26+,3- As you can see both technologies have good coverage of the SNP and also good representation of each allele. SNP(C/G) does not get called in technology1.

My questions are: 1- Does the GATK algorithm have some sort of constraint on the proportion of reads coming from only one strand (as with technology1) in order to try to predict or discard duplicates? 2- I know that the base call of a particular base is bounded by the mapping quality of its read. If my --stand_call_conf is 30 and one of the bases at this SNP position has MQ<30 does this avoid this position getting called? Or is it more like the avg(MQ) has to be >30 (meaning more than one read at this position is taken into account)?

Thanks for any clarification, Gene