It used to be that the first rule of GATK was: don't talk about the QUAL score. No more! This document covers the key points in loving detail. Figures are hand-drawn and scanned for now; we'll try to redo them cleanly when we find a bit of time (don't hold your breath though).
It's the Phred-scaled posterior probability that all samples in your callset are homozygous reference.
Basically, we're trying to give you the probability that all the variant evidence you saw in your data is wrong.
If you have just a handful of low quality reads, your QUAL will be pretty low. Possibly too low to emit -- we typically use a threshold of 10 to emit, 30 to call in genotyping, either via HaplotypeCaller in "normal" (non-GVCF) mode or via GenotypeGVCFs (in the GVCF workflow, HaplotyeCaller sets both emit and call thresholds to 0 and emits everything to the GVCF).
However, if you have a lot of variant reads, your QUAL will be much higher. But it only describes the probability that all of your data is erroneous, so it has trouble distinguishing between a small number of reads with high quality mismatches or a large number of reads with low quality mismatches. That's why we recommend using QualByDepth (the QUAL normalized by depth of reads supporting the variant) as an annotation for VQSR because that will yield higher annotation values for high quality reads and lower values for big piles of weak evidence.
Heng Li's 2011 paper, section 2.3.5 (there are other copies elsewhere) gives the equations for the biallelic case. It's a recursive relation, so we have to use a dynamic programming algorithm (as you may have seen in the chapter on pairwise alignments in the Durbin et al. "Biological Sequence Analysis" book).
This lovely diagram lays it all out:
S_1...S_N are your N samples, which we're going to evaluate cumulatively as we move across the columns of the matrix. Here we're being very general and allowing each sample to have a different ploidy, which we'll represent with p_i. Thus the total number of chromosomes is Sum{p_i}=P.
We're interested in the S_N column because that represents the AC calculations once we take into account all N samples. The S_N column still isn't our final joint likelihood because we added the samples in a particular order, but more on that later.
We calculate the joint likelihood across samples for all ACs from zero to the total number of chromosomes. We look at all ACs because we also use this calculation to determine the MLEAC that gets reported as part of the "genotyping" process. In the matrix above, we're indexing by i for sample and j for allele count (AC). g_i represents the genotype of the ith sample in terms of its number of alt alleles, i.e. for homRef g_i=0. Note that uses a different approach to break things down than Heng Li's paper, but it's more intuitive with the indexing. And remember this is the biallelic case, so we can assume any non-reference alleles are the same single alt allele. L(g_i) is the likelihood of the genotype, which we can get from sample i's PLs (after we un-Phred scale them, that is).
The "matrix" is triangular because as AC increases, we have to allocate a certain number of samples as being homozygous variant, so those have g_i = 0 with probability 1. Here we show the calculation to fill in the z_ij cell in the matrix, which is the cell corresponding to seeing j alt alleles after taking into account i samples. If sample i is diploid, there are three cells we need to take into account (because i can have 3 genotypes -- 0/0, 0/1, and 1/1 corresponding to g_i={2,1,0}), all of which come from the column where we looked at i-1 samples.
Thus z_ij is the sum of entries where i-1 samples had j alts (z_i-1,j,and sample i is homRef), where i-1 samples had j-1 alts (z_i-1,j-1 and sample i is het) and where i=1 samples had j-2 alts (z_i-1,j-2 and sample i is homVar), taking into account the binomial coefficient (binomial because we're biallelic here so we're only interested in the ref set and the alt set) for the number of ways to arrange i's chromosomes.
By the time we get to column S_N, we've accumulated all the PL data for all the samples. We can then get the likelihood that AC=j in our callset by using the entry in the row according to AC=j and dividing it by the binomial coefficient for the total number of chromosomes (P) with j alt alleles to account for the fact that we could see those alt chromosomes in our samples in any order.
Yep! In short, the prior based on AC is Pr(AC = i; i > 0) = θ/i making Pr(AC = 0) = 1 – ΣP>=i>0Pr(AC = i)
The prior, which is uniform across all sites, comes from population genetics theory, specifically coalescent theory. Let's start by defining some of our population genetics terminology. In the GATK docs, we use θ as the population heterozygosity under the neutral mutation model. Heterozygosity is the probability that two alleles drawn at random from the population will be different by state. In modern sequencing terms, that just means that there will be a variant in one with respect to the other. Note that two alleles can be identical by state but different by origin, i.e. the same variant occurred independently. If we assume that all loci are equally likely to be variant (which we know in modern times to be false, but this assumption underlies some important population genetics theories that we us as approximations) then we can also describe θ as the rate at which variants occur in the genome, 1 per 1/θ basepairs.
From Gillespie, "a coalescent is the lineage of alleles in a sample [as in cohort of individuals samples from the population] traced backwards in time to their common ancestor allele." Forward in time, the splits in the tree can be thought of as mutation events that generate new alleles. Backwards in time, they are referred to as coalescent events because two branches of the tree coalesce, or come together.
Each split in the coalescent represents the occurrence of a variant (let's say that each left branch is unchanged and the right branch picks up the new variant). Allele A never saw any variants, but one occurred separating A from B/C/D at -t3. Then another occurred separating B/C from D at -t2, and a final one separating B from C at -t1. So allele A is still "wild type" with no variants. Allele B has only variant -t3. Allele C has two variants: t3 and t1. Allele D has two variants: t3 and t2. So variant t3 has AC=3 (three alleles stemming from its right, non-reference branch), t2 has AC=1 and t1 has AC=1. Time here is given in generations of the population, so multiple generations can occur without there being a mutational event leading to a new allele.
The total time in the coalescent is measured as the sum of the lengths of all the branches of the tree describing the coalescent. For the figure, Tc = 4t1 + 3(t2-t1) + 2*(t3-t2). If we define Ti as the time required to reduce a coalescent with i alleles to one with i-1 alleles, we can write Tc as 4T4 + 3T3 + 2T2. In the forward direction, then Ti becomes the amount of time required to introduce a new mutation into a population of i-1 distinct alleles.
To derive an expected value for Ti, let's look at how each allele is derived from its ancestors in a population of n alleles within N samples under the assumption that a coalescence has not occurred, i.e. that each allele has a different ancestor in the previous generation because there were no coalescence events (or mutations in the forward time direction). The first (reference) allele (A in the diagram) is guaranteed to have an ancestor in the first generation because there were no mutations. The second allele has to have a different ancestor than the first allele or else they would be derived from the same source and thusly the same allele because there were no mutations in this generation. The second allele has a different ancestor with probability 1-1/(2N) = (2N-1)/(2N) (where we're assuming ploidy=2 as we usually do for population genetics of humans). Note that there are 2N possible ancestor alleles and 2N-1 that are different from the first allele. The probability that the third allele has a distinct ancestor, given that the first two do not share an ancestor, is (2N-2)/(2N), making the total probability of three alleles with three different ancestors:
$$ \dfrac{2N-1}{2N} \times \dfrac{2N-2}{2N} $$
We can continue this pattern for all n alleles to arrive at the probability that all n alleles have different ancestors, i.e. that no coalescent event (or variant event) has occurred:
$$ \left ( 1-\dfrac{1}{2N} \right )\times \left ( 1-\dfrac{2}{2N} \right )\times \cdots \times \left ( 1- \dfrac{n-1}{2N} \right ) $$
And if we multiply terms and approximate terms with N^2 in the denominator is small enough to be ignored we arrive at:
$$ 1- \dfrac{1}{2N}-\dfrac{2}{2N}- \cdots - \dfrac{n-1}{2N} $$
The probability of a coalescence occurring is the complement of the probability that it does not occur, giving:
$$ \dfrac {1+2+\cdots+(n-1)}{2N} = \dfrac{n(n-1))}{4N} $$
Which is the probability of a coalescence in any particular generation. We can now describe the probability distribution of the time to the first coalescence as a geometric distribution where the probability of success is:
$$ E[T_n] = \dfrac{4N}{n(n-1))} $$
Giving the expectation of the time to coalescence as:
$$ E[T_i] = \dfrac{4N}{i(i-1))} $$
We can generalize this to any coalescent event i as:
$$ Tc = \sum{i=2}^{n}iT_i $$
Which is a generalization of the example worked out above based on the figure. The expectation of the time spent in the coalescent is then:
$$ E[Tc] = E \left[ \sum{i=2}^{n}iTi\right ] = \sum{i=2}^{n}iE[Ti] = 4N \sum{i=2}^{n}\dfrac{1}{i-1} $$
The expected number of variants (Sn, called "segregating sites" in the old-school pop gen vernacular) is the neutral mutation rate (u) times the expected amount of time in the coalescent. A mutation can occur on any branch of the coalescent, so we want to sum the time in all branches to allow for all possibilities -- we did this above.
So the expected number of variants can be expressed in terms of the heterozygosity, which, if we describe it as a rate per basepair as above, allows us to describe the probability of a variant occurring at a given locus, forming the prior for our QUAL score. If we assume a cohort of unrelated individuals, the occurrence of any variant with AC > 1 must the result of that variant occurring multiple times independently at the same locus. If we now assume the coalescent is restricted to lineage of variants at a single position, we can reframe E[Sn] in terms of AC instead of number of alleles. Then we can convert the index of the sum to be AC (the number of mutations, but restricted to the same locus) using i = i' + 1 (because the set n originally includes the reference allele) so that the new where N is the number of chromosomes in the cohort.
From there, we can show that the Pr[AC=i] = θ/i
$$ E[S_n] = uE[Tc]=\theta\sum{i=2}^{n}\dfrac{1}{i-1} = \dfrac{\theta}1+\dfrac{\theta}2+\cdots+\dfrac{\theta}{n-1} $$
(The theory presented here comes from Chapter 2 of "Population Genetics: A Concise Guide" by John H. Gillespie)
The posterior is simply:
P(AC = i|D) = Lk(D | AC = i) Pr(AC = i) / P(D)
QUAL = Phred ( AC = 0 | D).
Well, the short answer is that it gets a lot more complicated. Where we had a 2-D matrix for the biallelic case, we'll have a N-dimensional volume for a site with N alleles (including the reference.)
Another lovely illustration helps us wrap our puny human brains around this idea:
Where p is ploidy, s is number of samples, a is number of alleles -- that's it.
So we use some approximations in order to get you your results in a reasonable amount of time. Those have been working out pretty well so far, but there are a few cases where they don't do as well, so we're looking into improving our approximations so nobody loses any rare alleles. Stay tuned!
There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference.
The basic difference is that QUAL refers to the variant site whereas GQ refers to a specific sample's GT.
QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
QUAL (or more importantly, its normalized form, QD) is mostly useful in multisample context. When you are recalibrating a cohort callset, you're going to be looking exclusively at site-level annotations like QD, because at that point what you're looking for is evidence of variation overall. That way you don't rely too much on individual sample calls, which are less robust.
In fact, many cohort studies don't even really care about individual genotype assignments, so they only use site annotations for their entire analysis.
Conversely, QUAL may seem redundant if you have only one sample. Especially if it has a good GQ (and more importantly, well separated PLs) then admittedly you don't really need to look at the QUAL -- you know what you have. If the GQ is not good, you can typically rely on the PLs to tell you whether you do probably have a variant, but we're just not sure if it's het or hom-var. If hom-ref is also a possibility, the call may be a potential false positive.
That said, it is more effective to filter on site-level annotations first, then refine and filter genotypes as appropriate. That's the workflow we recommend, based on years of experience doing this at fairly large scales...
Hello team GATK, I am trying to implement Mutect2 on my matched TN paired bam files. Though I am getting calls in VCF format, The QUAL column is just "." and no values is given. Also the GQ values in the sample genotype columns is missing. I have used the below command:
java -Xmx8g -jar Tools/GATK_v3.5/GenomeAnalysisTK.jar -T MuTect2 -R Reference/Human/hg38/hg38.fa -I:tumor ALL_BAMS/"$TUMOR".sorted.dedup.realigned.recal.bam -I:normal ALL_BAMS/"$NORMAL".sorted.dedup.realigned.recal.bam -stand_emit_conf 30 -stand_call_conf 30 --dbsnp Database/All_reorder.vcf -L Database/Illumina_target/nexterarepdcapture_expandedexome_Intervals_hg38.bed -o ALL_BAMS/"$OUT"_TN.raw.vcf
Can you please suggest whats going wrong here?
Thanks Sneh
Dear the GATK team,
I'm very appreciate the gatk team. The gatk has been very useful analysis tool for my study on human-reseq data.
I've tried using gatk on non-human re-seq data with no dbSNP & dbIndel in these days. In that case, I found a strange result of variant calling with low QUAL value from 10 to 60. I picked up any variant with low QUAL, and looked up it in a BAM file. But that was not real variant, i think it is because there was no variant recalibration step. So i try to filter variant calling result with some value, and here is my questions.
1) As i know it, 'QUAL' means a phred-scaled quality score assigned by the variant caller. So i tried to use it to filter the variant calling result. Is there any specific filtering threshold of QUAL for non-human re-seq data?
2) By one of the post in the gatk forum, one of the GATK Dev. member said that they recommend looking at QD, not QUAL. ( http://gatkforums.broadinstitute.org/discussion/6051/how-to-interpret-a-very-broad-distribution-of-qual ) In my case, can i use QD for filtering non-human re-seq data? And would you recommend any value and specific threshold for filtering non-human re-seq data?
Thank you in advance.
Yours respectfully, Hubert.
Hello Forum Members and GATK Team, I am trying to understand the calculations that goes into QD in GATK.
In the following its mentioned that the new versions calculate QD using AD: https://www.broadinstitute.org/gatk/blog?id=3862
In joint VCF file I have following entry: 1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0
In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry:
1 5474238 . C A,
Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?
Above is simple example where only in one sample a GT was called that was different from other 0/0.
Lets take another example (bit complicated): Joint VCF file entry: 1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49
Entires in the GVCF files:
Sample1
1 24089531 . A G,
Sample7
1 24089531 . A G,
Sample8
1 24089531 . A G,
Sample9
1 24089531 . A G,
Sample14
Sample13
Sample10
1 24089531 . A G,
Sample11
1 24089531 . A G,
Sample12
1 24089531 . A G,
Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.
GATK Version=3.2-2-gec30cee
Thanks in Advance.
Hi
I was wondering how QUAL and PL are calculated. Is there a document that describes this for Haplotype Caller?
Thanks
Hello @Geraldine,
I'm getting some too-high QUAL scores in my VCF, the whole file is full of weird scores in the tens of thousands:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chrM 73 . G A 27120.46 . BaseQRankSum=-8.22 0e-01;ClippingRankSum=1.02;DP=1063;MLEAC=6;MLEAF=0.600;MQ=59.91;MQ0=0;MQRankSum=0. 120;QD=32.10;ReadPosRankSum=-3.840e-01 GT:AD:DP:GQ:PL 1/1:1,332:333:99:10644 ,962,0 1/1:1,315:316:99:10159,913,0 0/0:118,0:118:99:0,120,1800 1/1:0,19 6:196:99:6366,586,0 0/0:95,0:95:99:0,120,1800 chrM 146 rs72619361 T C 16226.15 . DB;DP=1315 ;MLEAC=2;MLEAF=0.200;MQ=57.18;MQ0=0;QD=34.24 GT:AD:DP:GQ:PL 0/0:338,0:338:99:0,1 20,1800 0/0:324,0:324:99:0,120,1800 0/0:118,0:118:99:0,120,1800 0/0:17 5,0:175:99:0,120,1800 1/1:0,359:359:99:16268,1102,0 chrM 150 . T C 52405.15 . BaseQRankSum=0.310;ClippingRankSum=0.695;DP=1523;MLEAC=8;MLEAF=0.800;MQ=60.00;MQ0=0;MQRankSum=-4.940e-01;QD=30.63;ReadPosRankSum=-1.364e+00 GT:AD:DP:GQ:PL 1/1:0,421:421:99:14544,1264,0 1/1:0,415:415:99:14206,1244,0 0/0:118,0:118:99:0,120,1800 1/1:1,206:207:99:7027,586,0 1/1:0,356:356:99:16670,1132,0 chrM 152 rs117135796 T C 16299.15 . DB;DP=1495;MLEAC=2;MLEAF=0.200;MQ=57.18;MQ0=0;QD=29.09 GT:AD:DP:GQ:PL 0/0:409,0:409:99:0,120,1800 0/0:411,0:411:99:0,120,1800 0/0:118,0:118:99:0,120,1800 0/0:204,0:204:99:0,120,1800 1/1:0,352:352:99:16341,1102,0 chrM 194 . C T 12039.15 . BaseQRankSum=-1.325e+00;ClippingRankSum=-4.920e-01;DP=1754;MLEAC=2;MLEAF=0.200;MQ=60.00;MQ0=0;MQRankSum=-1.597e+00;QD=32.89;ReadPosRankSum=1.11 GT:AD:DP:GQ:PL 0/0:409,0:409:99:0,120,1800 0/0:411,0:411:99:0,120,1800 1/1:6,360:366:99:12081,883,0 0/0:204,0:204:99:0,120,1800 0/0:361,0:361:99:0,120,1800
It's just a 4x WGS file, nothing fancy.
Any idea of why this might be?
Thanks,
I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.
Hi, I'm working with the UnifiedGenotyper walker and I have detected strange values for the QUAL field of some VCF entries in the output files.
Sometimes in the VCF output file, the QUAL value for different vcf entries it is repited, for example, the QUAL values 32729.73 or 2147483609.73 usually appear in the output and not only in my files, because when I have searched on the GATK forum, this value appears in other users posted vcf files related to other questions.
I have tested it with several GATK versions, and in the latests versions these QUAL numbers are extremely high and I have also detected that the value doesn't correspond with the relationship QUAL~QD*DP.
Another strange thing is that for other QUAL values in the VCF file, it is very common that decimal part begins with a seven. i.e : 32729.73
Have you detected this? Is some kind of bug?
I look forward to your response.
I copy some VCF entries with different GATK versions:
Last version (2.7-2):
chr13 32907535 . C CT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.483;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.88;MQ0=0;MQRankSum=13.620;QD=33.72;RPA=11,12;RU=T;ReadPosRankSum=-12.065;STR GT:AD:DP:GQ:PL 0/1:1386,1037:2453:99:9301,0,13130
chr13 32907589 . G GT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.991;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.93;MQ0=0;MQRankSum=26.910;QD=28.67;RPA=7,8;RU=T;ReadPosRankSum=-2.595;STR GT:AD:DP:GQ:PL 0/1:1306,1142:2469:99:14116,0,16944
V 2.6-5:
chr13 32907535 . C CT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.106;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.81;MQ0=0;MQRankSum=14.984;QD=29.49;RPA=11,12;RU=T;ReadPosRankSum=-9.803;STR GT:AD:DP:GQ:PL 0/1:1261,1038:2453:99:7901,0,13152
chr13 32907589 . G GT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.976;DP=998;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.74;MQ0=0;MQRankSum=25.865;QD=31.47;RPA=7,8;RU=T;ReadPosRankSum=-0.572;STR GT:AD:DP:GQ:PL 0/1:1184,1142:2469:99:13365,0,16796
V 2.5-2:
chr13 32907535 . C CT 32729.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.023;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.75;MQ0=0;MQRankSum=-3.054;QD=32.73;RPA=11,12;RU=T;ReadPosRankSum=3.137;STR GT:AD:DP:GQ:PL 0/1:0,29:2453:99:7901,0,13152
chr13 32907589 . G GT 32729.73 . AC=1;AF=0.500;AN=2;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.71;MQ0=0;QD=32.76;RPA=7,8;RU=T;STR GT:AD:DP:GQ:PL 0/1:0,0:2469:99:13365,0,16796
I've had the same DNA sample sequenced using two different library prep methods, NEB-Next and Illumina-Nextera, the latter of which generates twice the number of reads than the former. When genotypes for the two samples are called individually using the UnifiedGenotyper, the read depth, DP for the Illumina-Nextera is roughly twice that of the NEB-Next but the quality score, QUAL, remains similar. I was expecting the QUAL to reflect the read depth but obviously this is not the case. Could you enlighten me?