This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.
Affected versions: 2.x versions up to 2.8 (not sure when it started)
Affected tool: SelectVariants
Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (
Effects: Genotype-level fields (such as AD) swapped among samples
This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.
I've used the Unified Genotyper for variant calling with GATK version 2.5.2. This was the info for a private variant.
However, after select variants to exclude non variant and variants not passing Filter, the AD changed and eliminated the alternative reads though the DP remained unchanged.
I think I recall another post having a similar issue due to multithreaded use of select variants
APologies for not commenting on this post instead as I had already posted this prior to seeing the other post!
I've seen related issues discussed here but not exactly this one. I'm following closely the current recommendations for an exome pipeline, and the GATK version,downloaded from git, was v2.5-2-gf57256b, Compiled 2013/06/06 17:28:57.
For example, I have two samples with heterozygous variants 12:81503433C>G. The AD values for the the samples in the raw vcf file, and the SNVs-only file were 15,14 and 20,15 for the two samples and these agree with what I see in IGV. There was nothing in the indels-only file at that position. The AD values were the same in the recalibrated SNVs-only file. But after combining the recalibrated SNVs and indels with CombineVariants the AD values inexplicably became 21,0 and 0,24 respectively. This seems to be happening to many variants.
I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):
so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:
Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."
I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.
Thanks a lot for all your hard work, Markus