Yay, August is over! Goodbye steamy hot days, hello mild temperatures and beautiful leaf-peeping season. We hope you all had a great summer (in the Northern hemisphere at least) and caught a bit of a vacation. For our part, we've been chained to our desks the whole time!
Well, not really, but we've got a feature-rich release for you nonetheless. Lots of new things; not all of them fully mature, so heed the caveats on the experimental features! We've also made some key improvements to VQSR that we're very excited about, some bug fixes to various tools of course, and a new way to boost calling performance. Full list in the release notes as usual, and highlights below.
When UnifiedGenotyper and HaplotypeCaller emit variant calls, they tell you how confident you can be that the variants are real. But how do you know how confident to be that the rest are reference, i.e. non-variant? It's actually a pretty hard problem… and this is our answer:
For HaplotypeCaller, we’ve developed a full-on reference model that produces reference confidence scores. To use it, you need to enable the
--emitRefConfidence mode. This mode is a little bit complicated so be sure you read the method article before you try to use it.
For UnifiedGenotyper, we don’t have a completely fleshed-out model, but we’ve added the
-allSitePLs argument which, in combination with the
EMIT_ALL_SITES output mode, will enable calculation of PLs for all sites, including reference. This will give a measure of reference confidence and a measure of which alt alleles are more plausible (if any). Note that this only works with the SNP calling model. Again, this is not as good or as complete as the reference model in HaplotypeCaller, so we urge you to use HaplotypeCaller for this unless you really need to use UnifiedGenotyper.
These are two highly experimental features; they work in our tests, but your mileage may vary, so please examine your results carefully. We welcome your feedback!
A common problem in calling indels is that you get false positives that are associated with PCR slippage around short tandem repeats (especially homopolymers). Until we can all switch to PCR-free amplification, we're stuck with this issue. So we thought it would be nice to be able to model this type of error and mitigate its impact on our indel calls. The new
--pcr_indel_model argument allows the HaplotypeCaller to use a new feature called the PCR indel model to weed out false positive indels more or less aggressively depending on how much you care about sensitivity vs. specificity.
This feature too is highly experimental, so play with it at your own risk. And stay tuned, because we've already got some ideas on how to improve it further.
Variant recalibration is one of the most challenging parts of the Best Practices workflow, and not just for users! We've been wrestling with some of its internal machinery to produce better, more consistent modeling results, especially with call sets that are on the lower end of the size scale.
One of the breakthroughs we made was separating the parameters for the positive and negative training models. You know (or should know) that the VariantRecalibrator builds two separate models: one to model what "good" variants (i.e. true positives) look like (the positive model), and one to model what "bad" variants (i.e. false positives) look like (the negative model). Until now, we applied parameters the same way to both, but we've now realized that it makes more sense to treat them differently.
Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number. You can change this setting with the
--numBadVariants argument, which replaces the now-deprecated
Finally, we also found that the order of annotations matters. Now, instead of applying the annotation dimensions to the training model in the order that they were specified at the command line, VariantRecalibrator first reorders them based on their standard deviation. This stabilizes the training model and produces much more consistent results.
Some of you have been clamoring for more flexibility in handling individual BAM files and samples without losing the convenience of processing them in batches. In response, we've added the following:
For general GATK use, the
-sample_rename_mapping_file engine argument allows you to rename samples on-the-fly at runtime. It takes a file that maps bam files to sample names. Note that this does require that your BAM files contain single samples only, although multiple read groups are allowed.
For variant calling, the
-onlyEmitSamples argument allows you to tell the UnifiedGenotyper to only emit calls for specific samples among a cohort that you're calling in multisample mode, without emitting the calls for the rest of the cohort. Keep in mind however that the calculations will still be made on the entire cohort, and the annotation values emitted for those calls will reflect that.
For VQSR, the
--excludeFiltered flag tells the ApplyRecalibration tool not to emit sites that are filtered out by recalibration (i.e. do not write them to file).
And some of you went ahead and added the features you wanted yourselves!
Yossi Farjoun contributed a patch to enable allele-biased downsampling with different per-sample values for the HaplotypeCaller, emulating the equivalent functionality that was already available in the UnifiedGenotyper.
Louis Bergelson contributed a new read filter, LibraryReadFilter, which allows you to use only reads from a specific library in your analysis. This is the opposite (and somewhat more specific) functionality compared to the existing engine argument, --read_group_black_list , which allows you to exclude read groups based on specific tags (including but not limited to LB).
We have a new diagnostic tool, QualifyMissingIntervals, that allows you to collect metrics such as GC content, mapping quality etc. for a list of intervals of interest. This is something you'd typically want to use if you found (through other tools) that you're missing calls in certain intervals, and you want to find out what's going wrong in those regions.
Finally, those of you who have access to more sophisticated computing platforms, heads up! Version 2.7 comes with a version of the PairHMM algorithm (aka the bit that takes forever to run in HaplotypeCaller) that is optimized for running on FPGA chips. Credit goes to the fine folks at Convey Computer and Green Mountain Computing Systems who teamed up to develop this optimized version of the PairHMM, with a little help from our very own Tech Dev team. We're told further optimizations may be in store; in the meantime, they're seeing up to 300-fold speedups of HaplotypeCaller runs on Convey's platform. Not bad!