# Tagged with #dp 2 documentation articles | 0 announcements | 18 forum discussions

Created 2015-08-17 21:27:25 | Updated | Tags: math ad dp

### The problem:

You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!

Many users have reported being confused by variant calls where there is apparently no evidence for the called allele. For example, sometimes a VCF may contain a variant call that looks like this:

2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38

You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?

This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.

We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.

Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.

Reads Likelihood of A Likelihood of T
1 3.8708e-7 3.6711e-7
2 4.9992e-7 2.8425e-7

Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation.

Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.

Reads Phred-scaled likelihood of A Phred-scaled likelihood of T
1 -6.4122 -6.4352
2 -6.3011 -6.5463

Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.

To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.

How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.

### Conclusion

So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.

Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.

In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.

Created 2014-10-17 19:19:39 | Updated 2015-07-06 13:44:00 | Tags: coveragebysample depthofcoverage depthperallelebysample ad dp

### Overview

This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.

The metrics involved are the following:

• DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
• Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.

### Coverage annotations: DP and AD

The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:

• DepthPerAlleleBySample (AD): outputs the depth of coverage of each allele per sample.
• Coverage (DP): outputs the filtered depth of coverage for each sample and the unfiltered depth of coverage across all samples.

At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.

Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.

TO BE CONTINUED...

No posts found with the requested search criteria.

Created 2015-11-11 15:45:08 | Updated | Tags: haplotypecaller downsampling ad dp read-counts

What I have learned so far from other discussions about HaplotypeCaller:

• read counts for positions with very high coverage are downsampled
• this does not affect variant calling
• this does affect DP and AD fields in the output (g)vcf file
• don't use -nct parameter with HC
• downsampling is hard-coded and can't be influenced by parameters

Nonetheless two problems remain: The HC doc says "This tool applies the following downsampling settings by default. To coverage: 500" Why is it possible to observe much higher coverage (DP, AD) values in the output vcf file?

I observe SNPs where the recalibrated bam file in IGV has a depth of 1385 for the reference and 1233 for alternate allele but 839 (reference) and 246 (alt) in the HaplotypeCaller vcf file. Maybe this happens by chance, as reads for downsampling are chosen at random or it is related to this bug [gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller](http://gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller

Both observations lead to the conclusion that DP and AD values from HC output are of little use for samples with high (where does high start? 500?) coverage.

Created 2015-08-07 14:25:49 | Updated | Tags: depthofcoverage haplotypecaller dp solid igv

Hello Everyone!

I'm using the whole GATK workflow to analyze Target Resequencing data coming from SOLID platforms. I followed the Best Practices for analysis and used the proper SOLID flags when using BaseRecalibrator (--solid_recal_mode SET_Q_ZERO_BASE_N --solid_nocall_strategy PURGE_READ), however, when looking at the VCF files after Haplotype Caller something does not add up.

I checked some of the variants inside some of my samples and i found that the DP field does not report the same per base coverage value than the one that are reported by the bam (using the --bamOutput to produce a bam for Haplotype Caller) when looking at them using the IGV. As far as I understand, for each position there's a downsampling, but I'm see a lower DP value compared to the ones that are stored in the BAM I'm attaching an IGV screenshots of one of the variants in which i'm encountering this problem. I deactivated all filtering alignment options in IGV, as well as downsampling. Here's the line Reported in the VCF for this variant:

As you can see from the screenshot, not only the covers differ, but a lot of reads that maps according to the reference are missing- Does somebody has an idea of what happened to the coverage inside the VCF?

Thanks a lot for your time!

Daniele

Created 2015-07-29 13:43:11 | Updated | Tags: dp eval

In the following small table you can see the relationship of SNP-calls with coverage_depth (DP) for 34 samples. SNP calls extracted from Haplotype_caller module of GATK. As it can bee seen by increasing DP the number of SNP-calls to be decreased. However, my expectation was in opposite. Is there any explanation? Thanks

Call-SNP DP 39263 3 34950 4 21228 5 17878 6 13249 7 10913 8 8891 9 7484 10 6126 11 5165 12 4336 13 3807 14 3240 15 2975 16 2636 17 2347 18 2222 19 2046 20

Created 2015-05-22 08:00:24 | Updated 2015-05-22 08:01:11 | Tags: haplotypecaller dp

Hi,

currently I'm running GATK version 3.3.0 and run in the problem, that quite often I don't get a value for DP after the HaplotypeCaller. There actually is nothing (just a dot) in the VCF file.

chr11 57569573 . A ACC 1640.8 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:57,16:.:99:0|1:57569569_CAGG_C:480,0,7262 chr11 57569575 . A AT 1686.05 VQSRTrancheINDEL99.90to100.00 . GT:AD:DP:GQ:PGT:PID:PL 0/1:60,16:.:99:0|1:57569569_CAGG_C:492,0,7092

What woks is to annotate afterwards using the VariantAnnotator. But in the documentation it states, that the VariantAnnotator defines the depth by the total amount of reads at that specific position. In contrast the DP after HaplotypeCaller should be the number of informative reads for that position. I can find that difference when I check for the position in GEMINI prior and after Annotation with Variant annotator.

Patient 1 (after reannotation using VariantAnnotator) Position depth variant 57563990 75 C/C 57569568 81 CAGG/C 57569572 78 A/ACC 57569574 81 A/AT

Patient 2 (after reannotation using VariantAnnotator) Postition depth variant 57563990 73 C/T 57569568 104 CAGG/CAGG 57569572 106 A/A 57569574 106 A/A

After HaplotypeCaller Postion patient1 patient2 ...(other patients) 57563990 48 -1 -1 -1 -1 -1 C/C C/T C/T C/T C/T C/T 57569568 -1 66 52 -1 56 60 CAGG/C CAGG/CAGG CAGG/CAGG CAGG/C CAGG/CAGG CAGG/CAGG 57569572 -1 101 88 -1 101 77 A/ACC A/A A/A A/ACC A/A A/A 57569574 -1 97 87 -1 94 78 A/AT A/A A/A A/AT A/A A/A 57576040 -1 -1 -1 -1 -1 8 ./. ./. ./. ./. ./. C/C 57578937 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. TTG/T 57578941 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/CTG 57578942 -1 -1 -1 -1 -1 -1 ./. ./. ./. ./. ./. C/G

E.g. patient 2 has in the position 57569568 66 informative reads, but 104 reads in total. To check how valid a Variant is, I think it is way more interesting to know the number of informative reads.

In this specific case we checked for those mutation (57569572 and 57569574) via sanger and didn't find anything. If we would know that only 2 or 3 reads would be informative we might have gone to another variant instead.

Am I doing something wrong, or is there a bug in the software?

One example how I call the HaplotypeCaller.

java -Xmx10g -jar /data/ngs/bin/GATK/3.3-0/GenomeAnalysisTK.jar -l INFO -et NO_ET \ -T HaplotypeCaller \ -R /data/ngs/programs/bundle_2.8/ucsc.hg19.fasta \ -D /data/ngs/programs/bundle_2.8/dbsnp_138.hg19.vcf \ -L /data/ngs/exome_targets/SureSelect_V4_S03723314_Regions.bed \ -ERC GVCF \ -variant_index_type LINEAR \ -variant_index_parameter 128000 \ -stand_call_conf 30 \ -stand_emit_conf 30 \ -A Coverage -A AlleleBalance \ -A QualByDepth -A HaplotypeScore \ -nct 4 \ -I out/06-BQSR/21_383_1.bam \ -o out/08-variant-calling/21_383_1/21_383_1-raw.vcf \ -log out/08-variant-calling/21_383_1/21_383_1-raw_calling.log

Anselm

Created 2015-03-09 16:48:43 | Updated | Tags: dp gt genotypegvcfs

Hi,

Sometimes when a sample has no reads, it still can have a genotype. I used GenotypeGVCFs (without CombineGVCFs) to generate a vcf with several samples. If I extract data for one sample, I sometimes see missing genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  ./.:0,0:0:.:.:.:.


which seem ok to me. Note that DP = 0.

but more rarely, I see a genotype with no read :

GT:AD:DP:GQ:PGT:PID:PL  0/0:0,0:.:3:.:.:0,3,45


and this is more questionnable. Note that DP = "."

Can you explain how genotype can be called without reads ? (I suspect haplotype to be involved, but without read I do not understand) any idea about the DP value ? why has it a value with missing genotype ?

thank you.

Created 2015-03-08 09:52:48 | Updated | Tags: haplotypecaller dp

Hello GATK team,

1. I've noticed a strange variant in the gvcf output of HC:

Raw vcf: 6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0

After GenortpeGVCFs: 6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0

The DP and AD are 0, but there is a variant - A. What do you think? Why does it happened?

1. What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two. 3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38

Maya

Created 2015-02-12 13:25:05 | Updated | Tags: snp dp indel

Hey,

we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case. In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV. One especially strange case is the following: For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable. Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads. How can this happen? I'm really looking forward to your answers.

Sarah

Created 2015-02-02 14:33:08 | Updated | Tags: dp qual qd

Hello Forum Members and GATK Team, I am trying to understand the calculations that goes into QD in GATK.

In the following its mentioned that the new versions calculate QD using AD: https://www.broadinstitute.org/gatk/blog?id=3862

In joint VCF file I have following entry: 1 5474238 . C A 163.45 PASS AC=2;AF=0.071;AN=28;DP=609;FS=0.000;MQ=60.00;MQ0=0;QD=27.05;Samples=Sample1;VQSLOD=18.21;VariantType=SNP;culprit=MQ GT:AD:GQ:PL 1/1:0,3:12:175,12,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,0 0/0:0,0:0:0,0,00/0:0,0:0:0,0,0

In the individual GVCF files (all 14) I search for Chromosome 1 and Position 5474238, and there only one entry: 1 5474238 . C A, 142.28 . DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:12:175,12,0,175,12,175:0,0,2,1

Here the QUAL is 142.28 and AD is 3. How come in the joint VCF file, the QUAL becomes 163.45 and QD=27.05 with DP=609?

Above is simple example where only in one sample a GT was called that was different from other 0/0.

Lets take another example (bit complicated): Joint VCF file entry: 1 24089531 . A G 882.89 PASS AC=13;AF=0.464;AN=28;BaseQRankSum=0.358;DP=159;FS=4.751;MQ=60.00;MQ0=0;MQRankSum=0.358;QD=10.77;ReadPosRankSum=0.00;Samples=Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7,Sample8,Sample9,Sample10;VQSLOD=13.36;VariantType=SNP;culprit=DP GT:AB:AD:GQ:PL 1/1:.:0,3:7:48,7,0 0/1:0.600:6,4:50:50,0,117 0/1:0.670:4,2:31:31,0,87 0/1:0.500:3,3:40:62,0,40 0/1:0.200:1,4:12:68,0,12 0/0:.:6,0:18:0,18,155 0/1:0.790:11,3:38:38,0,230 1/1:.:0,15:44:437,44,0 0/0:.:4,1:8:0,8,91 0/0:.:0,0:0:0,0,0 0/0:.:0,0:0:0,0,0 0/1:0.670:4,2:35:35,0,91 1/1:.:0,1:3:11,3,0 0/1:.:3,2:38:38,0,49

Entires in the GVCF files: Sample1 1 24089531 . A G, 16.33 . DP=6;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,3,0:7:48,7,0,48,7,48:0,0,2,1 Sample2 1 24089531 . A G, 21.80 . BaseQRankSum=0.365;DP=12;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.741;ReadPosRankSum=-0.365 GT:AB:AD:GQ:PL:SB 0/1:0.600:6,4,0:50:50,0,117,68,128,196:6,0,3,1 Sample3 1 24089531 . A G, 38.03 . BaseQRankSum=-0.731;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.731;ReadPosRankSum=0.731 GT:AB:AD:GQ:PL:SB 0/1:0.200:1,4,0:12:68,0,12,71,24,95:1,0,3,1 Sample4 1 24089531 . A G, 0 . DP=7;MLEAC=0,0;MLEAF=0.00,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 0/0:6,0,0:18:0,18,155,18,155,155:6,0,0,0 Sample5 1 24089531 . A G, 4.61 . BaseQRankSum=-0.720;DP=6;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=-0.720;ReadPosRankSum=0.000 GT:AB:AD:GQ:PL:SB 0/1:0.670:4,2,0:31:31,0,87,43,93,136:4,0,2,0 Sample6 1 24089531 . A G, 32.77 . BaseQRankSum=0.406;DP=8;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0;MQRankSum=0.988;ReadPosRankSum=-0.406 GT:AB:AD:GQ:PL:SB 0/1:0.500:3,3,0:40:62,0,40,70,49,120:2,1,3,0

Sample14

Sample13

Sample11 1 24089531 . A G, 0.05 . DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQ0=0 GT:AD:GQ:PL:SB 1/1:0,1,0:3:11,3,0,11,3,11:0,0,0,0

Using AD only, I am still not able to the QD value or the QUAL values in the Joint VCF files.

GATK Version=3.2-2-gec30cee

Created 2014-12-12 12:53:50 | Updated | Tags: ad dp genotypegvcfs

Hi,

I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.

E.g. on the single sample gvcf I have:

This agrees with what I see in IGV for this site (see attached image).

However, when I joint call this site along with other gvcfs, the output for that sample looks like:

0/1:20,0:38:99:560,0,633

i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?

Thanks a lot

Dan

Created 2014-07-22 21:35:03 | Updated | Tags: dp haplotype-caller

I have run haplotype caller using gatk 3.2. I used Haplotype Caller on 4 samples, then GVCF on the 4 samples combined (commands below). I used select variants to randomly sample a portion of the SNPs for one of those 4 samples. When I look in IGV I notice a lot of strange results.

For example a clearly tri allelic site is called biallelic in the vcf (see attached) chr01 102445496 . A G 115.24 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.28;ClippingRankSum=0.124;DP=19;FS=0.000;GQ_MEAN=54.50;GQ_STDDEV=52.80;MQ=56.03;MQ0=0;MQRankSum=0.406;NCC=0;QD=4.61;ReadPosRankSum=0.406 GT:AD:DP:GQ:PL 0/1:8,7:19:99:220,0,248

Another error, which is common, is that the DP field is wrong - the counts it lists are incorrect (see attached) e.g. The depth here is listed as 5,23 but IGV shows that it is really 5,1 chr07 28268592 . C A 842.02 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.179e+00;ClippingRankSum=0.240;DP=28;FS=49.925;GQ_MEAN=44.50;GQ_STDDEV=62.93;MQ=60.00;MQ0=0;MQRankSum=0.300;NCC=2;QD=30.07;ReadPosRankSum=1.02 GT:AD:DP:GQ:PL 0/1:5,23:28:89:867,0,89

The base qualities of all the bases that are not N, are >30. Duplicate reads have been removed. I know other users had issues with DP when coverage is high, but these samples are ~15-25x.

SNP calling commands

java -Xmx8G -jar ${gatk_dir}/GenomeAnalysisTK.jar -R${ref_dir}/${genome} \ -T HaplotypeCaller \ -I${indir}/"36E"${REGION}"_"${SAMPLE_ABB}"_lowqNrecalibrated_3.bam" \ --min_base_quality_score 30 \ --output_mode EMIT_ALL_SITES \ --intervals $sites_dir/$file \ -o ${outdir}/"37F"${REGION}"_"${SAMPLE_ABB}"_HapCal.vcf" \ java -Xmx4G -jar${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} -T GenotypeGVCFs\ --variant ${outdir}/"37F"${REGION}"_"${SAMPLE1}"_HapCal_gVCF.vcf" \ --variant${outdir}/"37F"${REGION}"_"${SAMPLE2}"_HapCal_gVCF.vcf" \ --variant ${outdir}/"37F"${REGION}"_"${SAMPLE3}"_HapCal_gVCF.vcf" \ --variant${outdir}/"37F"${REGION}"_"${SAMPLE4}"_HapCal_gVCF.vcf" \ -allSites \ --intervals $sites_dir/$file \ -o ${outdir}/"38F"${REGION}"_"{FOUR}"_HapCal_includeNonVariantSites.vcf" \ Created 2014-02-02 18:34:10 | Updated | Tags: combinevariants dp Hi team, I've been having some issues with DP following CombineVariants: I have two vcf files called by different callers - one by GATK UnifiedGenotyper and the other by samtools mpileup. When I merge the files using CombineVariants, I notice that the DP per each variant is actually equal to the sum of DP of each of the vcf files. For example: If for a shared variant in both vcf files the DP=8, then the DP in the union file will be DP=16. Neverhteless, if a given variant is not shared by both files, then the DP in the union file will be equal to the input file. Is there a way resolve this issue? Thanks! Best, Sagi Created 2014-01-10 15:39:54 | Updated 2014-01-10 16:16:06 | Tags: unifiedgenotyper ad dp allele-depth variant-calling Hi, I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant. GT:AD:DP:GQ:PL 0/1:52,37:88:99:1078,0,1486 However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged. GT:AD:DP:GQ:PL 0/1:51,0:88:99:0,99,1283 I think I recall another post having a similar issue due to multithreaded use of select variants http://gatkforums.broadinstitute.org/discussion/1943/weird-behaviour-of-selectvatiants APologies for not commenting on this post instead as I had already posted this prior to seeing the other post! Thanks, MC Created 2013-11-15 21:38:02 | Updated | Tags: haplotypecaller dp I am using GATK 2.5, and am calling HaplotypeCaller as java -Xmx4g -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I sample.bam -o hap.vcf -L subset.bed -dt NONE -D dbsnp_137.b37_addChr.vcf -filterNoBases -A Coverage -A DepthPerAlleleBySample In the output vcf file, in the FORMAT field, I have GT:AD:GQ:PL. DP is lost. Even though it is specified in the header. UnifiedGenotyper is working fine, and producing DP information. Any ideas on what is wrong? Thanks Created 2013-10-23 12:12:10 | Updated | Tags: dp sigma Hi, I would like to filter variants based on max DP. I understand that in order to define the max DP cutoof there is a need to calculate the Sigma, as stated in GATK's best practices: "The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts." Hence, how can one calculate the sigma, and what is it exactly? Thanks! Sagi Created 2013-10-16 12:50:25 | Updated | Tags: unifiedgenotyper haplotypecaller dp Dear GATK Team, I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper. I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x. My pipeline is: Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper. Here are the minimum commands that reproduce the discrepancy: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ --dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \ -R /gatk_bundle/ucsc.hg19.fasta \ -I sample1.rg.bam \ -o sample1.HC.vcf \ -L ROI.bed \ -dt NONE \ -nct 8 Example variant from sample1.HC.vcf: chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0 ... In comparison to using UnifiedGenotyper with exactly the same alignment file: java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ --dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \ -R /gatk_bundle/ucsc.hg19.fasta \ -I sample1.rg.bam \ -o sample1.UG.vcf \ -L ROI.bed \ -nct 4 \ -dt NONE \ -glm BOTH Example variant from sample1.UG.vcf: chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0 I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good: awk '{if (3=="chr17" && $4 > (41245466-100) &&$4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[\$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Sam

Created 2013-01-07 13:05:07 | Updated | Tags: variantfiltration dp

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Created 2012-12-21 23:24:24 | Updated 2013-01-07 19:15:12 | Tags: dp

Hello,

I am trying Unified Genotyper on a 454 data set combined with some artificial data. The bam file for this data has 12 aligned reads spanning the locus: Chr17:7578406. These 12 reads are obtained from real 454 run. On executing Unified Genotyper with the following options:

java -Xmx4g -jar /usr/local/lib/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -R ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/human_g1k_v37.fasta -T UnifiedGenotyper -I NG.CL316.ordered.sorted.bam -o NG.1x.CL316.vcf --dbsnp ~/CGP/ReferenceSeq/fromGATK/bundle1.5/b37/dbsnp_135.b37.vcf -stand_call_conf 30 -stand_emit_conf 0.0 -dcov 5000 -L "17:7,578,400-7,578,410"

I get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       83.77   .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.529;DB;DP=12;Dels=0.00;FS=0.000;HaplotypeScore=2.7884;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.529;QD=6.98;ReadPosRankSum=2.367     GT:AD:DP:GQ:PL  0/1:7,5:12:99:112,0,195

Then I added a copy of same 12 reads (after modifying the names of the added reads to give each added read a unique name) to the above bam file. I have ensured that the names are unique in the bam file. Then again I run UnifiedGenotyper and get the following result:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  z
17      7578406 rs28934578      C       T       165.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=-0.966;DB;DP=24;Dels=0.00;FS=1.885;HaplotypeScore=5.5768;MLEAC=1;MLEAF=0.500;MQ=38.06;MQ0=0;MQRankSum=0.205;QD=6.91;ReadPosRankSum=3.367     GT:AD:DP:GQ:PL  0/1:14,10:22:99:194,0,363

I want to point out that in the first run, I see that INFO and FORMAT column's have DP=12 and AD of 7,5 add up indicating that no read for filtered for making the variant call. However in the second run, even though the 12 extra reads are exactly the same as the earlier 12 reads, we see that while the INFO column DP=24, the FORMAT column DP is 22. Which indicates that 2 reads have been filtered while making the variant call.