This patch release fixes several issues that we felt were annoying enough to warrant an interim version release; see detailed list below.
Oh, and in case you're curious about the patch number, the *-46 numeral refers to the number of code commits made since the 3.4-0 release. Out of this, about a dozen are related to the bug fixes listed above; another dozen relate to new work that is not yet publicly available (gotta wait 'til 3.5...) and the rest are automated commits produced when merging completed work into the master codebase.
In version 3.4, we introduced some functionality to handle cases where a site that is variant in one sample is covered by a deletion in another sample (see release notes for details). Some of that functionality involved using a new symbolic
<*:DEL> notation to represent this in multi-sample VCFs. Unfortunately we later realized that this was not the best way to represent this, so we have now switched to a more spec-compliant notation, which is simply
*. We've also added some refinements to correctly handle cases where alleles need to be remapped, and/or where there are multiple overlapping spanning deletions in the population (which we encountered twice in a population of ~10K samples, to give you an idea of how rarely that happens in humans). The new code is fully backward-compatible, so if you have any files that contain the
<*:DEL> notation, these will be understood correctly by GATK. Note however that they will not be converted to the new notation -- they will just be passed through unchanged.
Some annoying gcc compiler version issues caused many of you to be unable to utilize Intel's accelerated PairHMM library to speed up HaplotypeCaller. We've sorted those out so now everyone should be able to utilize it provided their computing platform is running reasonably recent (i.e. not Jurassic-era) software.
There was a minor bug in the ReadBackedPhasing tool that was causing some variants to be merged into MNPs when they shouldn't be; that's now fixed.
A user reported a dramatic slowdown in the DepthOfCoverage and shortly thereafter, provided a code patch that resolved to problem. Self-fixing bug reports are our favorite kind, thank you @mnw21cam!
OK, technically not a bug fix, so this shouldn't be in a patch, according to my esteemed software engineer colleagues, but I couldn't wait to share it with y'all. This is something people often ask how to do and until now it was beyond painful. But hark! You can now tell SelectVariants to filter out variants where a given number or proportion of samples have filtered genotypes. See the usage example in the tool doc (toward the end of the list). Now tell me you're not happy that we included it in the patch. C'mon, I dare you.
Again with a not-a-bug-fix thing, sorry -- there's a new read filter to get rid of reads that have too much soft-clipping going on to be honest. This is a sign that there's something seriously wrong with them, as we recently found out while analyzing a bunch of cheek swab samples that turned out to be heavily contaminated with bacterial DNA. It's always fun to see some seriously messed-up samples go by... and it's nice that now we can do something to rescue them.
We made some minor changes to documentation (VQSLOD, QD) and error handling (clarified error message for Contigs Out of Order error, made GVCF indexing parameters not required for gzipped output).
This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.
Affected versions: 2.x versions up to 2.8 (not sure when it started)
Affected tool: SelectVariants
Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (
Effects: Genotype-level fields (such as AD) swapped among samples
This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.
As reported here, a bug was found in VariantsToBinaryPED. Briefly, VariantsToBinaryPed expected the fam file to describe the samples in the same order as the input VCF file: if they were not in the same order, it did not correctly map sample IDs with the genotypes in the output binary PED.
We expect that in most use cases, the order would be the same (because PLINK uses lexicographic order, as does the GATK), so the bug would not impact results. However, as the user report demonstrates, in cases where order was different, the bug would seriously impact results.
We therefore recommend that anyone who has used VariantsToBinaryPED check their results for any inconsistencies in the kinship coefficients. Our apologies for the inconvenience to anyone who is affected by this bug, and big thanks again to user TimHughes for reporting the bug.
Finally, we have fixed the bug in GATK and released the fixed version under version number 2.7-4.
If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem.
Thank you for your patience while we work to fix this issue.
We have received reports of a bug occurring with HaplotypeCaller in v2.4, with the error message "Reads are too small for use in assembly." We are working to fix it.
In the meantime, if you encounter it too, please don't post a new discussion about it, but do post a comment on this announcement so that we can count how many people are affected.
Thank you for your patience and our apologies for the inconvenience!
I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):
so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:
Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."
I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.
Thanks a lot for all your hard work, Markus