You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!
Many users have reported being confused by variant calls where there is apparently no evidence for the called allele. For example, sometimes a VCF may contain a variant call that looks like this:
2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38
You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?
This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.
We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.
Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.
|Reads||Likelihood of A||Likelihood of T|
Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation.
Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.
|Reads||Phred-scaled likelihood of A||Phred-scaled likelihood of T|
Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.
To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.
How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.
So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.
Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.
In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.
This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.
The metrics involved are the following:
For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.
The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:
At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.
The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.
Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.
TO BE CONTINUED...
This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.
Affected versions: 2.x versions up to 2.8 (not sure when it started)
Affected tool: SelectVariants
Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (
Effects: Genotype-level fields (such as AD) swapped among samples
This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.
What I have learned so far from other discussions about HaplotypeCaller:
Nonetheless two problems remain: The HC doc says "This tool applies the following downsampling settings by default. To coverage: 500" Why is it possible to observe much higher coverage (DP, AD) values in the output vcf file?
I observe SNPs where the recalibrated bam file in IGV has a depth of 1385 for the reference and 1233 for alternate allele but 839 (reference) and 246 (alt) in the HaplotypeCaller vcf file. Maybe this happens by chance, as reads for downsampling are chosen at random or it is related to this bug [gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller](http://gatkforums.broadinstitute.org/discussion/5882/uncorrect-strand-bias-due-to-downsampling-haplotypecaller
Both observations lead to the conclusion that DP and AD values from HC output are of little use for samples with high (where does high start? 500?) coverage.
I have found several sites in my VCF files where GenotypeGVCFs appears to output the
AD records in the wrong order, since the
AD values do not match either the call or the order of values in the
PL field. The vast majority of sites are correct, but sometimes they look like this (
INFO column removed for readability):
POS ID REF ALT QUAL FILTER FORMAT Sample1 Sample2 349 . G A 210669.47 . GT:AD:DP:GQ:PL 1:3,0,238:241:99:9052,0 1:0,0,59:59:99:2699,0 910 . T A,G 177554.11 . GT:AD:DP:GQ:PL 2:0,0,0,76:76:99:3600,3600,0 1:0,90,0,0:90:99:4321,0,4321 1948 . T G,A,C 209510.27 . GT:AD:DP:GQ:PL 1:0,234,0,0,0:234:99:10934,0,10934,10934 2:0,0,0,57,0:57:99:2745,2745,0,2745
ADentries than possible haplotypes or
PLentries (including the extreme case of having 5 AD values at site 1948 above)
ADvalue gets added)
ADfields, but the order is not wrong, so the extra entry must be at the end
--emitRefConfidence BP_RESOLUTION), and the order/number of
ADfields is fine there
I haven't been able to find anything special about the affected sites, so it's not at all clear what is happing. The command line arguments used were:
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -V Sample1_raw.g.vcf -V Sample2_raw.g.vcf -R TanzaniaDog_RV2772_KF155002.1.fasta -o AllSamples_raw.gvcf --annotateNDA --includeNonVariantSites -stand_emit_conf 10 -stand_call_conf 30
I did try specifying ploidy in the above command (
--sample_ploidy 1), but this does not solve the problem.
I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);
1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:
single sample gvcf
1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9
1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]
1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]
2) Incorrect AD is taken while genotyping gvcf files:
1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]
1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]
I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.
Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?
Best regards, Paweł
I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.
E.g. on the single sample gvcf I have:
This agrees with what I see in IGV for this site (see attached image).
However, when I joint call this site along with other gvcfs, the output for that sample looks like:
i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?
Thanks a lot
I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.
However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.
I think I recall another post having a similar issue due to multithreaded use of select variants
APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!
I've seen related issues discussed here but not exactly this one. I'm following closely the current recommendations for an exome pipeline, and the GATK version,downloaded from git, was v2.5-2-gf57256b, Compiled 2013/06/06 17:28:57.
For example, I have two samples with heterozygous variants 12:81503433C>G. The AD values for the the samples in the raw vcf file, and the SNVs-only file were 15,14 and 20,15 for the two samples and these agree with what I see in IGV. There was nothing in the indels-only file at that position. The AD values were the same in the recalibrated SNVs-only file. But after combining the recalibrated SNVs and indels with CombineVariants the AD values inexplicably became 21,0 and 0,24 respectively. This seems to be happening to many variants.
I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):
so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:
Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."
I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.
Thanks a lot for all your hard work, Markus
Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?
Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.