Tagged with #dp
0 documentation articles | 0 announcements | 14 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (7)


Sometimes when a sample has no reads, it still can have a genotype. I used GenotypeGVCFs (without CombineGVCFs) to generate a vcf with several samples. If I extract data for one sample, I sometimes see missing genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0:.:.:.:.

which seem ok to me. Note that DP = 0.

but more rarely, I see a genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  0/0:0,0:.:3:.:.:0,3,45
GT:AD:DP:GQ:PGT:PID:PL  1/1:0,0:.:3:1|1:32486973_A_AG:45,3,0

and this is more questionnable. Note that DP = "."

Can you explain how genotype can be called without reads ? (I suspect haplotype to be involved, but without read I do not understand) any idea about the DP value ? why has it a value with missing genotype ?

thank you.

Comments (10)

Hello GATK team,

  1. I've noticed a strange variant in the gvcf output of HC:

Raw vcf: 6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0

After GenortpeGVCFs: 6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0

The DP and AD are 0, but there is a variant - A. What do you think? Why does it happened?

  1. What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two. 3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38


Comments (15)


we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case. In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV. One especially strange case is the following: For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable. Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads. How can this happen? I'm really looking forward to your answers.


Comments (10)

Hello Forum Members and GATK Team, I am trying to understand the calculations that goes into QD in GATK.

In the following its mentioned that the new versions calculate QD using AD: https://www.broadinstitute.org/gatk/blog?id=3862

In joint VCF file I have following entry: 1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0

In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry: 1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1

Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?

Above is simple example where only in one sample a GT was called that was different from other 0/0.

Lets take another example (bit complicated): Joint VCF file entry: 1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49

Entires in the GVCF files: Sample1 1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1 Sample2 1 24089531 . A G, 21.80 . BaseQRankSum=0.365;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.741;ReadPosRankSum=-0.365 GT:AB:AD:GQ:PL:SB 0/1:0.600:6,4,0:50:50,0,117,68,128,196:6,0,3,1 Sample3 1 24089531 . A G, 38.03 . BaseQRankSum=-0.731;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AB:AD:GQ:PL:SB 0/1:0.200:1,4,0:12:68,0,12,71,24,95:1,0,3,1 Sample4 1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0 Sample5 1 24089531 . A G, 4.61 . BaseQRankSum=-0.720;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.720;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:31:31,0,87,43,93,136:4,0,2,0 Sample6 1 24089531 . A G, 32.77 . BaseQRankSum=0.406;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.988;ReadPosRankSum=-0.406 GT:AB:AD:GQ:PL:SB 0/1:0.500:3,3,0:40:62,0,40,70,49,120:2,1,3,0

Sample7 1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample8 1 24089531 . A G, 10.20 . BaseQRankSum=-0.156;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.778;ReadPosRankSum=-0.467 GT:AB:AD:GQ:PL:SB 0/1:0.790:11,3,0:38:38,0,230,70,239,309:11,0,3,0

Sample9 1 24089531 . A G, 0 . BaseQRankSum=-0.731;DP=12;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AD:GQ:PL:SB 0/0:4,1,0:8:0,8,91,12,93,97:4,0,0,0



Sample10 1 24089531 . A G, 7.60 . BaseQRankSum=-1.380;DP=9;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-1.380;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:35:35,0,91,47,97,144:4,0,2,0

Sample11 1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0

Sample12 1 24089531 . A G, 9.31 . BaseQRankSum=0.358;DP=7;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.358;ReadPosRankSum=-0.358 GT:AD:GQ:PL:SB 0/1:3,2,0:38:38,0,49,47,55,102:0,0,0,0

Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.

GATK Version=3.2-2-gec30cee

Thanks in Advance.

Comments (7)


I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.

E.g. on the single sample gvcf I have:

GC G, 522.73 . BaseQRankSum=-0.716;ClippingRankSum=0.219;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=66.41;MQ0=0;MQRankSum=1.681;ReadPosRankSum=-1.155 GT:AD:DP:GQ:PL:SB 0/1:20,18,0:38:99:560,0,633,620,687,1307:10,10,10,8

This agrees with what I see in IGV for this site (see attached image).

However, when I joint call this site along with other gvcfs, the output for that sample looks like:


i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?

Thanks a lot


Comments (6)

I have run haplotype caller using gatk 3.2. I used Haplotype Caller on 4 samples, then GVCF on the 4 samples combined (commands below). I used select variants to randomly sample a portion of the SNPs for one of those 4 samples. When I look in IGV I notice a lot of strange results.

For example a clearly tri allelic site is called biallelic in the vcf (see attached) chr01 102445496 . A G 115.24 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.28;ClippingRankSum=0.124;DP=19;FS=0.000;GQ_MEAN=54.50;GQ_STDDEV=52.80;MQ=56.03;MQ0=0;MQRankSum=0.406;NCC=0;QD=4.61;ReadPosRankSum=0.406 GT:AD:DP:GQ:PL 0/1:8,7:19:99:220,0,248

Another error, which is common, is that the DP field is wrong - the counts it lists are incorrect (see attached) e.g. The depth here is listed as 5,23 but IGV shows that it is really 5,1 chr07 28268592 . C A 842.02 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.179e+00;ClippingRankSum=0.240;DP=28;FS=49.925;GQ_MEAN=44.50;GQ_STDDEV=62.93;MQ=60.00;MQ0=0;MQRankSum=0.300;NCC=2;QD=30.07;ReadPosRankSum=1.02 GT:AD:DP:GQ:PL 0/1:5,23:28:89:867,0,89

The base qualities of all the bases that are not N, are >30. Duplicate reads have been removed. I know other users had issues with DP when coverage is high, but these samples are ~15-25x.

SNP calling commands

java -Xmx8G -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} \ -T HaplotypeCaller \ -I ${in_dir}/"36E_"${REGION}""${SAMPLE_ABB}"_lowqNrecalibrated_3.bam" \ --min_base_quality_score 30 \ --output_mode EMIT_ALL_SITES \ --intervals $sites_dir/$file \ -o ${out_dir}/"37F"${REGION}"_"${SAMPLE_ABB}"_HapCal.vcf" \

java -Xmx4G -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T GenotypeGVCFs\ --variant ${out_dir}/"37F_"${REGION}""${SAMPLE1}"_HapCal_gVCF.vcf" \ --variant ${out_dir}/"37F"${REGION}""${SAMPLE2}"_HapCal_gVCF.vcf" \ --variant ${out_dir}/"37F"${REGION}""${SAMPLE3}"_HapCal_gVCF.vcf" \ --variant ${out_dir}/"37F"${REGION}""${SAMPLE4}"_HapCal_gVCF.vcf" \ -allSites \ --intervals $sites_dir/$file \ -o ${out_dir}/"38F"${REGION}"_"${FOUR}"_HapCal_includeNonVariantSites.vcf" \

Comments (9)

Hi team,

I've been having some issues with DP following CombineVariants:

I have two vcf files called by different callers - one by GATK UnifiedGenotyper and the other by samtools mpileup.

When I merge the files using CombineVariants, I notice that the DP per each variant is actually equal to the sum of DP of each of the vcf files. For example: If for a shared variant in both vcf files the DP=8, then the DP in the union file will be DP=16. Neverhteless, if a given variant is not shared by both files, then the DP in the union file will be equal to the input file.

Is there a way resolve this issue?




Comments (1)


I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.

GT:AD:DP:GQ:PL 0/1:52,37:88:99:1078,0,1486

However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.

GT:AD:DP:GQ:PL 0/1:51,0:88:99:0,99,1283

I think I recall another post having a similar issue due to multithreaded use of select variants


APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!



Comments (1)

I am using GATK 2.5, and am calling HaplotypeCaller as

java -Xmx4g -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I sample.bam -o hap.vcf -L subset.bed -dt NONE -D dbsnp_137.b37_addChr.vcf -filterNoBases -A Coverage -A DepthPerAlleleBySample

In the output vcf file, in the FORMAT field, I have GT:AD:GQ:PL. DP is lost. Even though it is specified in the header. UnifiedGenotyper is working fine, and producing DP information. Any ideas on what is wrong?


Comments (7)


I would like to filter variants based on max DP. I understand that in order to define the max DP cutoof there is a need to calculate the Sigma, as stated in GATK's best practices:

"The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts."

Hence, how can one calculate the sigma, and what is it exactly?



Comments (27)

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,


Comments (23)

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time

Comments (3)


I am trying Unified Genotyper on a 454 data set combined with some artificial data. The bam file for this data has 12 aligned reads spanning the locus: Chr17:7578406. These 12 reads are obtained from real 454 run. On executing Unified Genotyper with the following options:

java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o NG.1x.CL316.vcf --dbsnp ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410"

I get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       83.77   .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.529;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=2.7884;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.529;QD=6.98;ReadPosRankSum=2.367     GT:AD:DP:GQ:PL  0/1:7,5:12:99:112,0,195

Then I added a copy of same 12 reads (after modifying the names of the added reads to give each added read a unique name) to the above bam file. I have ensured that the names are unique in the bam file. Then again I run UnifiedGenotyper and get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       165.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.966;DB;DP=24;Dels=0.00;FS=1.885;HaplotypeScore=5.5768;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.205;QD=6.91;ReadPosRankSum=3.367     GT:AD:DP:GQ:PL  0/1:14,10:22:99:194,0,363

I want to point out that in the first run, I see that INFO and FORMAT column's have DP=12 and AD of 7,5 add up indicating that no read for filtered for making the variant call. However in the second run, even though the 12 extra reads are exactly the same as the earlier 12 reads, we see that while the INFO column DP=24, the FORMAT column DP is 22. Which indicates that 2 reads have been filtered while making the variant call.

Also please note that I tried adding the 12 read incrementally. And observed that till adding the 7th read, the DPs were all matching up. But then when I added the 8th read, the result showed that 2 reads were filtered (from FORMAT column DP). So if only the 8th read was being filtered out, I would think it might be a problem with that read but why is an extra read which was not being filtered earlier is being filtered on addition of this read. Or is the FORMAT column DP incorrect?

The 12 aligned reads that were added to the sam file are attached. Please let me know if you would need any other info.

Thanks for any insight in this issue.

Comments (4)

Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?

Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.