This document describes the proper use of metrics associated with depth of coverage for the purpose of evaluating variants.
The metrics involved are the following:
For an overview of the tools and concepts involved in performing sequence coverage analysis, where the purpose is to answer the common question: "(Where) Do I have enough sequence data to be empowered to discover variants with reasonable confidence?", please see this document.
The variant callers generate two main coverage annotation metrics: the allele depth per sample (AD) and overall depth of coverage (DP, available both per sample and across all samples, with important differences), controlled by the following annotator modules:
At the sample level, these annotations are highly complementary metrics that provide two important ways of thinking about the depth of the data available for a given sample at a given site. The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.
The sample-level DP is in some sense reflective of the power I have to determine the genotype of the sample at this site, while the AD tells me how many times I saw each of the REF and ALT alleles in the reads, free of any bias potentially introduced by filtering the reads. If, for example, I believe there really is a an A/T polymorphism at a site, then I would like to know the counts of A and T bases in this sample, even for reads with poor mapping quality that would normally be excluded from the statistical calculations going into GQ and QUAL.
Note that because the AD includes reads and bases that were filtered by the caller (and in case of indels, is based on a statistical computation), it should not be used to make assumptions about the genotype that it is associated with. Ultimately, the phred-scaled genotype likelihoods (PLs) are what determines the genotype calls.
TO BE CONTINUED...
This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.
Affected versions: 2.x versions up to 2.8 (not sure when it started)
Affected tool: SelectVariants
Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (
Effects: Genotype-level fields (such as AD) swapped among samples
This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.
I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);
1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:
single sample gvcf
1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9
1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]
1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]
2) Incorrect AD is taken while genotyping gvcf files:
1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]
1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]
I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.
Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?
Best regards, Paweł
I am running the GATK 3.2-2 pipeline and I am getting some odd results from the genotype VCF stage. It appears that incorrect AD results are being output.
E.g. on the single sample gvcf I have:
This agrees with what I see in IGV for this site (see attached image).
However, when I joint call this site along with other gvcfs, the output for that sample looks like:
i.e. It appears to not output the expected 20,18 - but it still seems to be making the correct genotype call so I am assuming it is just an output bug and not something that is affecting calling, but it would be good to know for sure. Either way it is problematic as it is affecting our downstream depth filtering. This isn't the only example of this happening in our data. I am trying to identify more by looking for occasions where the AD fields sum to less than DP, am I right in thinking this should never be the case as AD is unfiltered compared to the filtered DP?
Thanks a lot
I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.
However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.
I think I recall another post having a similar issue due to multithreaded use of select variants
APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!
I've seen related issues discussed here but not exactly this one. I'm following closely the current recommendations for an exome pipeline, and the GATK version,downloaded from git, was v2.5-2-gf57256b, Compiled 2013/06/06 17:28:57.
For example, I have two samples with heterozygous variants 12:81503433C>G. The AD values for the the samples in the raw vcf file, and the SNVs-only file were 15,14 and 20,15 for the two samples and these agree with what I see in IGV. There was nothing in the indels-only file at that position. The AD values were the same in the recalibrated SNVs-only file. But after combining the recalibrated SNVs and indels with CombineVariants the AD values inexplicably became 21,0 and 0,24 respectively. This seems to be happening to many variants.
I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):
so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:
Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."
I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.
Thanks a lot for all your hard work, Markus
Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?
Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.