There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference.
The basic difference is that QUAL refers to the variant site whereas GQ refers to a specific sample's GT.
QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
QUAL (or more importantly, its normalized form, QD) is mostly useful in multisample context. When you are recalibrating a cohort callset, you're going to be looking exclusively at site-level annotations like QD, because at that point what you're looking for is evidence of variation overall. That way you don't rely too much on individual sample calls, which are less robust.
In fact, many cohort studies don't even really care about individual genotype assignments, so they only use site annotations for their entire analysis.
Conversely, QUAL may seem redundant if you have only one sample. Especially if it has a good GQ (and more importantly, well separated PLs) then admittedly you don't really need to look at the QUAL -- you know what you have. If the GQ is not good, you can typically rely on the PLs to tell you whether you do probably have a variant, but we're just not sure if it's het or hom-var. If hom-ref is also a possibility, the call may be a potential false positive.
That said, it is more effective to filter on site-level annotations first, then refine and filter genotypes as appropriate. That's the workflow we recommend, based on years of experience doing this at fairly large scales...
Hello team GATK, I am trying to implement Mutect2 on my matched TN paired bam files. Though I am getting calls in VCF format, The QUAL column is just "." and no values is given. Also the GQ values in the sample genotype columns is missing. I have used the below command:
java -Xmx8g -jar Tools/GATK_v3.5/GenomeAnalysisTK.jar -T MuTect2 -R Reference/Human/hg38/hg38.fa -I:tumor ALL_BAMS/"$TUMOR".sorted.dedup.realigned.recal.bam -I:normal ALL_BAMS/"$NORMAL".sorted.dedup.realigned.recal.bam -stand_emit_conf 30 -stand_call_conf 30 --dbsnp Database/All_reorder.vcf -L Database/Illumina_target/nexterarepdcapture_expandedexome_Intervals_hg38.bed -o ALL_BAMS/"$OUT"_TN.raw.vcf
Can you please suggest whats going wrong here?
Dear the GATK team,
I'm very appreciate the gatk team. The gatk has been very useful analysis tool for my study on human-reseq data.
I've tried using gatk on non-human re-seq data with no dbSNP & dbIndel in these days. In that case, I found a strange result of variant calling with low QUAL value from 10 to 60. I picked up any variant with low QUAL, and looked up it in a BAM file. But that was not real variant, i think it is because there was no variant recalibration step. So i try to filter variant calling result with some value, and here is my questions.
1) As i know it, 'QUAL' means a phred-scaled quality score assigned by the variant caller. So i tried to use it to filter the variant calling result. Is there any specific filtering threshold of QUAL for non-human re-seq data?
2) By one of the post in the gatk forum, one of the GATK Dev. member said that they recommend looking at QD, not QUAL. ( http://gatkforums.broadinstitute.org/discussion/6051/how-to-interpret-a-very-broad-distribution-of-qual ) In my case, can i use QD for filtering non-human re-seq data? And would you recommend any value and specific threshold for filtering non-human re-seq data?
Thank you in advance.
Yours respectfully, Hubert.
Hello Forum Members and GATK Team, I am trying to understand the calculations that goes into QD in GATK.
In the following its mentioned that the new versions calculate QD using AD: https://www.broadinstitute.org/gatk/blog?id=3862
In joint VCF file I have following entry: 1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0
In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry:
1 5474238 . C A,
Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?
Above is simple example where only in one sample a GT was called that was different from other 0/0.
Lets take another example (bit complicated): Joint VCF file entry: 1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49
Entires in the GVCF files:
1 24089531 . A G,
1 24089531 . A G,
1 24089531 . A G,
1 24089531 . A G,
1 24089531 . A G,
1 24089531 . A G,
1 24089531 . A G,
Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.
Thanks in Advance.
I'm getting some too-high QUAL scores in my VCF, the whole file is full of weird scores in the tens of thousands:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chrM 73 . G A 27120.46 . BaseQRankSum=-8.22 0e-01;ClippingRankSum=1.02;DP=1063;MLEAC=6;MLEAF=0.600;MQ=59.91;MQ0=0;MQRankSum=0. 120;QD=32.10;ReadPosRankSum=-3.840e-01 GT:AD:DP:GQ:PL 1/1:1,332:333:99:10644 ,962,0 1/1:1,315:316:99:10159,913,0 0/0:118,0:118:99:0,120,1800 1/1:0,19 6:196:99:6366,586,0 0/0:95,0:95:99:0,120,1800 chrM 146 rs72619361 T C 16226.15 . DB;DP=1315 ;MLEAC=2;MLEAF=0.200;MQ=57.18;MQ0=0;QD=34.24 GT:AD:DP:GQ:PL 0/0:338,0:338:99:0,1 20,1800 0/0:324,0:324:99:0,120,1800 0/0:118,0:118:99:0,120,1800 0/0:17 5,0:175:99:0,120,1800 1/1:0,359:359:99:16268,1102,0 chrM 150 . T C 52405.15 . BaseQRankSum=0.310;ClippingRankSum=0.695;DP=1523;MLEAC=8;MLEAF=0.800;MQ=60.00;MQ0=0;MQRankSum=-4.940e-01;QD=30.63;ReadPosRankSum=-1.364e+00 GT:AD:DP:GQ:PL 1/1:0,421:421:99:14544,1264,0 1/1:0,415:415:99:14206,1244,0 0/0:118,0:118:99:0,120,1800 1/1:1,206:207:99:7027,586,0 1/1:0,356:356:99:16670,1132,0 chrM 152 rs117135796 T C 16299.15 . DB;DP=1495;MLEAC=2;MLEAF=0.200;MQ=57.18;MQ0=0;QD=29.09 GT:AD:DP:GQ:PL 0/0:409,0:409:99:0,120,1800 0/0:411,0:411:99:0,120,1800 0/0:118,0:118:99:0,120,1800 0/0:204,0:204:99:0,120,1800 1/1:0,352:352:99:16341,1102,0 chrM 194 . C T 12039.15 . BaseQRankSum=-1.325e+00;ClippingRankSum=-4.920e-01;DP=1754;MLEAC=2;MLEAF=0.200;MQ=60.00;MQ0=0;MQRankSum=-1.597e+00;QD=32.89;ReadPosRankSum=1.11 GT:AD:DP:GQ:PL 0/0:409,0:409:99:0,120,1800 0/0:411,0:411:99:0,120,1800 1/1:6,360:366:99:12081,883,0 0/0:204,0:204:99:0,120,1800 0/0:361,0:361:99:0,120,1800
It's just a 4x WGS file, nothing fancy.
Any idea of why this might be?
I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.
Hi, I'm working with the UnifiedGenotyper walker and I have detected strange values for the QUAL field of some VCF entries in the output files.
Sometimes in the VCF output file, the QUAL value for different vcf entries it is repited, for example, the QUAL values 32729.73 or 2147483609.73 usually appear in the output and not only in my files, because when I have searched on the GATK forum, this value appears in other users posted vcf files related to other questions.
I have tested it with several GATK versions, and in the latests versions these QUAL numbers are extremely high and I have also detected that the value doesn't correspond with the relationship QUAL~QD*DP.
Another strange thing is that for other QUAL values in the VCF file, it is very common that decimal part begins with a seven. i.e : 32729.73
Have you detected this? Is some kind of bug?
I look forward to your response.
I copy some VCF entries with different GATK versions:
Last version (2.7-2):
chr13 32907535 . C CT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.483;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.88;MQ0=0;MQRankSum=13.620;QD=33.72;RPA=11,12;RU=T;ReadPosRankSum=-12.065;STR GT:AD:DP:GQ:PL 0/1:1386,1037:2453:99:9301,0,13130 chr13 32907589 . G GT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.991;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.93;MQ0=0;MQRankSum=26.910;QD=28.67;RPA=7,8;RU=T;ReadPosRankSum=-2.595;STR GT:AD:DP:GQ:PL 0/1:1306,1142:2469:99:14116,0,16944
chr13 32907535 . C CT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.106;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.81;MQ0=0;MQRankSum=14.984;QD=29.49;RPA=11,12;RU=T;ReadPosRankSum=-9.803;STR GT:AD:DP:GQ:PL 0/1:1261,1038:2453:99:7901,0,13152 chr13 32907589 . G GT 2147483609.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.976;DP=998;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.74;MQ0=0;MQRankSum=25.865;QD=31.47;RPA=7,8;RU=T;ReadPosRankSum=-0.572;STR GT:AD:DP:GQ:PL 0/1:1184,1142:2469:99:13365,0,16796
chr13 32907535 . C CT 32729.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.023;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.75;MQ0=0;MQRankSum=-3.054;QD=32.73;RPA=11,12;RU=T;ReadPosRankSum=3.137;STR GT:AD:DP:GQ:PL 0/1:0,29:2453:99:7901,0,13152 chr13 32907589 . G GT 32729.73 . AC=1;AF=0.500;AN=2;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.71;MQ0=0;QD=32.76;RPA=7,8;RU=T;STR GT:AD:DP:GQ:PL 0/1:0,0:2469:99:13365,0,16796
I've had the same DNA sample sequenced using two different library prep methods, NEB-Next and Illumina-Nextera, the latter of which generates twice the number of reads than the former. When genotypes for the two samples are called individually using the UnifiedGenotyper, the read depth, DP for the Illumina-Nextera is roughly twice that of the NEB-Next but the quality score, QUAL, remains similar. I was expecting the QUAL to reflect the read depth but obviously this is not the case. Could you enlighten me?