I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is:
1) Generate a VCF containing unique IDs (using GATK CallVariants)
2) Replace the VCF header with potentially non-unique IDs (using tabix -r)
3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants)
It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3.
The relevant lines are as follows:
HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127
HG00421 HG00422 HG00422 HG00423 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127
HG00421 HG00422 HG00423 NA12801 NA12802 0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:299 0/0:127 0/0:299 0/0:293 0/0:127
Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before.
I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch.