Tagged with #qual
1 documentation article | 0 announcements | 7 forum discussions

Created 2014-11-24 19:15:38 | Updated 2014-11-25 19:18:02 | Tags: qual genotype-quality variant-quality
Comments (0)

There has been a lot of confusion about the difference between QUAL and GQ, and we hope this FAQ will clarify the difference.

The basic difference is that QUAL refers to the variant site whereas GQ refers to a specific sample's GT.

  • QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.

  • GQ tells you how confident we are that the genotype we assigned to a particular sample is correct. It is simply the second lowest PL, because it is the difference between the second lowest PL and the lowest PL (always 0).

QUAL (or more importantly, its normalized form, QD) is mostly useful in multisample context. When you are recalibrating a cohort callset, you're going to be looking exclusively at site-level annotations like QD, because at that point what you're looking for is evidence of variation overall. That way you don't rely too much on individual sample calls, which are less robust.

In fact, many cohort studies don't even really care about individual genotype assignments, so they only use site annotations for their entire analysis.

Conversely, QUAL may seem redundant if you have only one sample. Especially if it has a good GQ (and more importantly, well separated PLs) then admittedly you don't really need to look at the QUAL -- you know what you have. If the GQ is not good, you can typically rely on the PLs to tell you whether you do probably have a variant, but we're just not sure if it's het or hom-var. If hom-ref is also a possibility, the call may be a potential false positive.

That said, it is more effective to filter on site-level annotations first, then refine and filter genotypes as appropriate. That's the workflow we recommend, based on years of experience doing this at fairly large scales...

No posts found with the requested search criteria.

Created 2015-09-04 05:51:00 | Updated | Tags: qual non-human-re-seq
Comments (1)

Dear the GATK team,

I'm very appreciate the gatk team. The gatk has been very useful analysis tool for my study on human-reseq data.

I've tried using gatk on non-human re-seq data with no dbSNP & dbIndel in these days. In that case, I found a strange result of variant calling with low QUAL value from 10 to 60. I picked up any variant with low QUAL, and looked up it in a BAM file. But that was not real variant, i think it is because there was no variant recalibration step. So i try to filter variant calling result with some value, and here is my questions.

1) As i know it, 'QUAL' means a phred-scaled quality score assigned by the variant caller. So i tried to use it to filter the variant calling result. Is there any specific filtering threshold of QUAL for non-human re-seq data?

2) By one of the post in the gatk forum, one of the GATK Dev. member said that they recommend looking at QD, not QUAL. ( http://gatkforums.broadinstitute.org/discussion/6051/how-to-interpret-a-very-broad-distribution-of-qual ) In my case, can i use QD for filtering non-human re-seq data? And would you recommend any value and specific threshold for filtering non-human re-seq data?

Thank you in advance.

Yours respectfully, Hubert.

Created 2015-02-02 14:33:08 | Updated | Tags: dp qual qd
Comments (10)

Hello Forum Members and GATK Team, I am trying to understand the calculations that goes into QD in GATK.

In the following its mentioned that the new versions calculate QD using AD: https://www.broadinstitute.org/gatk/blog?id=3862

In joint VCF file I have following entry: 1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0

In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry: 1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1

Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?

Above is simple example where only in one sample a GT was called that was different from other 0/0.

Lets take another example (bit complicated): Joint VCF file entry: 1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49

Entires in the GVCF files: Sample1 1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1 Sample2 1 24089531 . A G, 21.80 . BaseQRankSum=0.365;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.741;ReadPosRankSum=-0.365 GT:AB:AD:GQ:PL:SB 0/1:0.600:6,4,0:50:50,0,117,68,128,196:6,0,3,1 Sample3 1 24089531 . A G, 38.03 . BaseQRankSum=-0.731;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AB:AD:GQ:PL:SB 0/1:0.200:1,4,0:12:68,0,12,71,24,95:1,0,3,1 Sample4 1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0 Sample5 1 24089531 . A G, 4.61 . BaseQRankSum=-0.720;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.720;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:31:31,0,87,43,93,136:4,0,2,0 Sample6 1 24089531 . A G, 32.77 . BaseQRankSum=0.406;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.988;ReadPosRankSum=-0.406 GT:AB:AD:GQ:PL:SB 0/1:0.500:3,3,0:40:62,0,40,70,49,120:2,1,3,0

Sample7 1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample8 1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample9 1 24089531 . A G, 0 . BaseQRankSum=-0.731;DP=12;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AD:GQ:PL:SB 0/0:4,1,0:8:0,8,91,12,93,97:4,0,0,0



Sample10 1 24089531 . A G, 7.60 . BaseQRankSum=-1.380;DP=9;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.380;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:35:35,0,91,47,97,144:4,0,2,0

Sample11 1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0

Sample12 1 24089531 . A G, 9.31 . BaseQRankSum=0.358;DP=7;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.358;ReadPosRankSum=-0.358 GT:AD:GQ:PL:SB 0/1:3,2,0:38:38,0,49,47,55,102:0,0,0,0

Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.

GATK Version=3.2-2-gec30cee

Thanks in Advance.

Created 2014-11-19 18:20:14 | Updated 2014-11-19 18:20:41 | Tags: qual genotype-likelihood pl
Comments (5)


I was wondering how QUAL and PL are calculated. Is there a document that describes this for Haplotype Caller?


Created 2014-08-20 21:10:21 | Updated 2014-08-20 21:12:55 | Tags: qual
Comments (6)

Hello @Geraldine,

I'm getting some too-high QUAL scores in my VCF, the whole file is full of weird scores in the tens of thousands:

chrM 73 . G A 27120.46 . BaseQRankSum=-8.22 0e-01;ClippingRankSum=1.02;DP=1063;MLEAC=6;MLEAF=0.600;MQ=59.91;MQ0=0;MQRankSum=0. 120;QD=32.10;ReadPosRankSum=-3.840e-01 GT:AD:DP:GQ:PL 1/1:1,332:333:99:10644 ,962,0 1/1:1,315:316:99:10159,913,0 0/0:118,0:118:99:0,120,1800 1/1:0,19 6:196:99:6366,586,0 0/0:95,0:95:99:0,120,1800 chrM 146 rs72619361 T C 16226.15 . DB;DP=1315 ;MLEAC=2;MLEAF=0.200;MQ=57.18;MQ0=0;QD=34.24 GT:AD:DP:GQ:PL 0/0:338,0:338:99:0,1 20,1800 0/0:324,0:324:99:0,120,1800 0/0:118,0:118:99:0,120,1800 0/0:17 5,0:175:99:0,120,1800 1/1:0,359:359:99:16268,1102,0 chrM 150 . T C 52405.15 . BaseQRankSum=0.310;ClippingRankSum=0.695;DP=1523;MLEAC=8;MLEAF=0.800;MQ=60.00;MQ0=0;MQRankSum=-4.940e-01;QD=30.63;ReadPosRankSum=-1.364e+00 GT:AD:DP:GQ:PL 1/1:0,421:421:99:14544,1264,0 1/1:0,415:415:99:14206,1244,0 0/0:118,0:118:99:0,120,1800 1/1:1,206:207:99:7027,586,0 1/1:0,356:356:99:16670,1132,0 chrM 152 rs117135796 T C 16299.15 . DB;DP=1495;MLEAC=2;MLEAF=0.200;MQ=57.18;MQ0=0;QD=29.09 GT:AD:DP:GQ:PL 0/0:409,0:409:99:0,120,1800 0/0:411,0:411:99:0,120,1800 0/0:118,0:118:99:0,120,1800 0/0:204,0:204:99:0,120,1800 1/1:0,352:352:99:16341,1102,0 chrM 194 . C T 12039.15 . BaseQRankSum=-1.325e+00;ClippingRankSum=-4.920e-01;DP=1754;MLEAC=2;MLEAF=0.200;MQ=60.00;MQ0=0;MQRankSum=-1.597e+00;QD=32.89;ReadPosRankSum=1.11 GT:AD:DP:GQ:PL 0/0:409,0:409:99:0,120,1800 0/0:411,0:411:99:0,120,1800 1/1:6,360:366:99:12081,883,0 0/0:204,0:204:99:0,120,1800 0/0:361,0:361:99:0,120,1800

It's just a 4x WGS file, nothing fancy.

Any idea of why this might be?


Created 2014-08-11 09:50:56 | Updated | Tags: unifiedgenotyper vcf qual
Comments (1)

I used the EMIT_ALL_SITES option with Unified Genotyper. For polymorphic sites, the quality score (QUAL field) corresponds to the Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. But for monomorphic sites, (at this site, we have an homozygote for the reference allele), what is the definition of the quality score ? and how is it computed ? Many thanks for your explanation.

Created 2013-09-02 09:23:42 | Updated 2013-09-04 20:59:28 | Tags: unifiedgenotyper qual
Comments (3)

Hi, I'm working with the UnifiedGenotyper walker and I have detected strange values for the QUAL field of some VCF entries in the output files.

Sometimes in the VCF output file, the QUAL value for different vcf entries it is repited, for example, the QUAL values 32729.73 or 2147483609.73 usually appear in the output and not only in my files, because when I have searched on the GATK forum, this value appears in other users posted vcf files related to other questions.

I have tested it with several GATK versions, and in the latests versions these QUAL numbers are extremely high and I have also detected that the value doesn't correspond with the relationship QUAL~QD*DP.

Another strange thing is that for other QUAL values in the VCF file, it is very common that decimal part begins with a seven. i.e : 32729.73

Have you detected this? Is some kind of bug?

I look forward to your response.

I copy some VCF entries with different GATK versions:

Last version (2.7-2):

chr13    32907535    .    C    CT    2147483609.73    .    AC=1;AF=0.500;AN=2;BaseQRankSum=-1.483;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.88;MQ0=0;MQRankSum=13.620;QD=33.72;RPA=11,12;RU=T;ReadPosRankSum=-12.065;STR    GT:AD:DP:GQ:PL    0/1:1386,1037:2453:99:9301,0,13130
chr13    32907589    .    G    GT    2147483609.73    .    AC=1;AF=0.500;AN=2;BaseQRankSum=6.991;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.93;MQ0=0;MQRankSum=26.910;QD=28.67;RPA=7,8;RU=T;ReadPosRankSum=-2.595;STR    GT:AD:DP:GQ:PL    0/1:1306,1142:2469:99:14116,0,16944

V 2.6-5:

chr13    32907535    .    C    CT    2147483609.73    .    AC=1;AF=0.500;AN=2;BaseQRankSum=-2.106;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.81;MQ0=0;MQRankSum=14.984;QD=29.49;RPA=11,12;RU=T;ReadPosRankSum=-9.803;STR    GT:AD:DP:GQ:PL    0/1:1261,1038:2453:99:7901,0,13152
chr13    32907589    .    G    GT    2147483609.73    .    AC=1;AF=0.500;AN=2;BaseQRankSum=6.976;DP=998;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.74;MQ0=0;MQRankSum=25.865;QD=31.47;RPA=7,8;RU=T;ReadPosRankSum=-0.572;STR    GT:AD:DP:GQ:PL    0/1:1184,1142:2469:99:13365,0,16796

V 2.5-2:

chr13    32907535    .    C    CT    32729.73    .    AC=1;AF=0.500;AN=2;BaseQRankSum=0.023;DP=1000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.75;MQ0=0;MQRankSum=-3.054;QD=32.73;RPA=11,12;RU=T;ReadPosRankSum=3.137;STR    GT:AD:DP:GQ:PL    0/1:0,29:2453:99:7901,0,13152
chr13    32907589    .    G    GT    32729.73    .    AC=1;AF=0.500;AN=2;DP=999;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.71;MQ0=0;QD=32.76;RPA=7,8;RU=T;STR    GT:AD:DP:GQ:PL    0/1:0,0:2469:99:13365,0,16796

Created 2013-07-24 10:25:20 | Updated | Tags: unifiedgenotyper snp qual
Comments (1)

I've had the same DNA sample sequenced using two different library prep methods, NEB-Next and Illumina-Nextera, the latter of which generates twice the number of reads than the former. When genotypes for the two samples are called individually using the UnifiedGenotyper, the read depth, DP for the Illumina-Nextera is roughly twice that of the NEB-Next but the quality score, QUAL, remains similar. I was expecting the QUAL to reflect the read depth but obviously this is not the case. Could you enlighten me?