Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.
None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.
Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (
--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension
.g.vcf so that the engine knows what level of compression to use to write the gVCF (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.
Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of
-dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).
Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument
--breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use
-L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!
And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.
The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic
<DEL> allele representing the deletion.
The new annotation is called
RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the
-allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.
This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.
We’ve added two new documentation resources to the Guide.
One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.
The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!
Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.
-ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.
Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.
As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.
This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the
-ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the
You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:
But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).
In a nutshell, phased records will look like this:
1 1372243 . T <NON_REF> . . END=1372267 <snip> <snip> 1 1372268 . G A,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip> 1 1372269 . G T,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip> 1 1372270 . C <NON_REF> . . END=1372299 <snip> <snip>
You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).
The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.
Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).
We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the
--recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.
Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.
Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as
-ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.
The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).
You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as
--includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).
The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the
-bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.
We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.
We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.
In a nutshell, the base name of the codebase has changed from
gatk (which hopefully makes more sense), and the most common effect is that
sting.gatk classpath segments are now
gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.
This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.
So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.
We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!
Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.
Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command:
Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.
|Linux kernel version||Architecture / Processor||Expected speedup||Instruction set|
|Any 64-bit Linux||Any x86 64-bit||1-1.5x||Non-vector|
|Linux 2.6 or newer||Penryn (Core 2 or newer)||1.3-1.8x||SSE 4.1|
|Linux 2.6.30 or newer||SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer)||2-2.5x||AVX|
To find out exactly which processor is in your machine, you can run this command in the terminal:
$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.
Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:
Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:
SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).
The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.
The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.
Better late than never, here is the now-traditional "Highlights" document for GATK version 3.0, which was released two weeks ago. It will be a very short one since we've already gone over the new features in detail in separate articles --but it's worth having a recap of everything in one place. So here goes.
We are delighted to present our new Best Practices workflow for variant calling in which multisample calling is replaced by a winning combination of single-sample calling in gVCF mode and joint genotyping analysis. This allows us to both bypass performance issues and solve the so-called "N+1 problem" in one fell swoop. For full details of why and how this works, please see this document. In the near future, we will update our Best Practices page to make it clear that the new workflow is now the recommended way to go for calling variants on cohorts of samples. We've already received some pretty glowing feedback from early adopters, so be sure to try it out for yourself!
All the cool kids were doing it, so we had to join the party. It took a few months of experimentation, a couple of new tools and some tweaks to the HaplotypeCaller, but you can now call variants on RNAseq with GATK! This document details our Best Practices recommendations for doing so, along with a non-trivial number of caveats that you should keep in mind as you go.
Nice try, but no. This tool is obsolete now that we have the gVCF/reference model pipeline (see above). Note that this means that GATK 3.0 will not support BAM files that were processed using ReduceReads!
We've switched the build system from Ant to Maven, which should make it much easier to use GATK as a library against which you can develop your own tools. And on a related note, we're also making significant changes to the internal structure of the GATK codebase. Hopefully this will not have too much impact on external projects, but there will be a doc very shortly describing how the new build system works and how the codebase is structured.
For reasons that will be made clear in the near future, we decided to hold the previously announced hardware optimizations until version 3.1, which will be released very soon. Stay tuned!
Better late than never, here are the highlights of the most recent version release, GATK 2.8. This should be short and sweet because as releases go, 2.8 is light on new features, and is best described as a collection of bug fixes, which are all* dutifully listed in the corresponding release notes document. That said, two of the changes we've made deserve some additional explanation.
* Up to now (this release included) we have not listed updates/patches to Queue in the release notes, but will start doing so from the next version onward.
In the last release (2.7, for those of you keeping score at home) we trumpeted that the old
-percentBad argument of VariantRecalibrator had been replaced by the shiny new
-numBad argument, and that this was going to be awesome for all sorts of good reasons, improve stability and whatnot. Weeeeeeell it turned out that wasn't quite the case. It worked really well on the subset of analyses that we tested it on initially, but once we expanded to different datasets (and the complaints started rolling in on the forum) we realized that it actually made things worse in some cases because the default value was less appropriate than what
-percentBad would have produced. This left people guessing as to what value would work for their particular dataset, with a great big range to choose from and very little useful information to assist in the choice.
So, long story short, we (and by "we" I mean Ryan) built in a new function that allows the VariantRecalibrator to determine for itself the amount of variants that is appropriate to use for the "bad" model depending on the data. So the short-lived
-numBad argument is gone too, replaced by... nothing. No new argument to specify; just let the VariantRecalibrator do its thing.
Of course if you really want to, you can override the default behavior and tweak the internal thresholds. See the tool doc here; and remember that a good rule of thumb is that if you can't figure out which arguments are involved based on that doc, you probably shouldn't be messing with this advanced functionality.
This is still a rather experimental feature, so we're still making changes as we go. The two big changes worth mentioning here are that you can now run this on reduced reads, and that we've changed the indexing routine to optimize the compression level. The latter shouldn't have any immediate impact on normal users, but it was necessary for a new feature project we've been working on behind the scenes (the single-sample-to-joint-discovery pipeline we have been alluding to in recent forum discussions). The reason we're mentioning it now is that if you use
-ERC GVCF output, you'll need to specify a couple of new arguments as well (
-variant_index_type LINEAR and
-variant_index_parameter 128000, with those exact values). This useful little fact didn't quite make it into the documentation before we released, and not specifying them leads to an error message, so... there you go. No error message for you!
That's all for tool changes. In addition to those, we have made a number of corrections in the tool documentation pages, updated the Best Practices (mostly layout, tiny bit of content update related to the VQSR -numBad deprecation) and made some minor changes to the website, e.g. updated the list of publications that cite the GATK and improved the Guide index somewhat (but that's still a work in progress).
Yay, August is over! Goodbye steamy hot days, hello mild temperatures and beautiful leaf-peeping season. We hope you all had a great summer (in the Northern hemisphere at least) and caught a bit of a vacation. For our part, we've been chained to our desks the whole time!
Well, not really, but we've got a feature-rich release for you nonetheless. Lots of new things; not all of them fully mature, so heed the caveats on the experimental features! We've also made some key improvements to VQSR that we're very excited about, some bug fixes to various tools of course, and a new way to boost calling performance. Full list in the release notes as usual, and highlights below.
When UnifiedGenotyper and HaplotypeCaller emit variant calls, they tell you how confident you can be that the variants are real. But how do you know how confident to be that the rest are reference, i.e. non-variant? It's actually a pretty hard problem… and this is our answer:
For HaplotypeCaller, we’ve developed a full-on reference model that produces reference confidence scores. To use it, you need to enable the
--emitRefConfidence mode. This mode is a little bit complicated so be sure you read the method article before you try to use it.
For UnifiedGenotyper, we don’t have a completely fleshed-out model, but we’ve added the
-allSitePLs argument which, in combination with the
EMIT_ALL_SITES output mode, will enable calculation of PLs for all sites, including reference. This will give a measure of reference confidence and a measure of which alt alleles are more plausible (if any). Note that this only works with the SNP calling model. Again, this is not as good or as complete as the reference model in HaplotypeCaller, so we urge you to use HaplotypeCaller for this unless you really need to use UnifiedGenotyper.
These are two highly experimental features; they work in our tests, but your mileage may vary, so please examine your results carefully. We welcome your feedback!
A common problem in calling indels is that you get false positives that are associated with PCR slippage around short tandem repeats (especially homopolymers). Until we can all switch to PCR-free amplification, we're stuck with this issue. So we thought it would be nice to be able to model this type of error and mitigate its impact on our indel calls. The new
--pcr_indel_model argument allows the HaplotypeCaller to use a new feature called the PCR indel model to weed out false positive indels more or less aggressively depending on how much you care about sensitivity vs. specificity.
This feature too is highly experimental, so play with it at your own risk. And stay tuned, because we've already got some ideas on how to improve it further.
Variant recalibration is one of the most challenging parts of the Best Practices workflow, and not just for users! We've been wrestling with some of its internal machinery to produce better, more consistent modeling results, especially with call sets that are on the lower end of the size scale.
One of the breakthroughs we made was separating the parameters for the positive and negative training models. You know (or should know) that the VariantRecalibrator builds two separate models: one to model what "good" variants (i.e. true positives) look like (the positive model), and one to model what "bad" variants (i.e. false positives) look like (the negative model). Until now, we applied parameters the same way to both, but we've now realized that it makes more sense to treat them differently.
Because of how relative amounts of good and bad variants tend to scale differently with call set size, we also realized it was a bad idea to have the selection of bad variants be based on a percentage (as it has been until now) and instead switched it to a hard number. You can change this setting with the
--numBadVariants argument, which replaces the now-deprecated
Finally, we also found that the order of annotations matters. Now, instead of applying the annotation dimensions to the training model in the order that they were specified at the command line, VariantRecalibrator first reorders them based on their standard deviation. This stabilizes the training model and produces much more consistent results.
Some of you have been clamoring for more flexibility in handling individual BAM files and samples without losing the convenience of processing them in batches. In response, we've added the following:
For general GATK use, the
-sample_rename_mapping_file engine argument allows you to rename samples on-the-fly at runtime. It takes a file that maps bam files to sample names. Note that this does require that your BAM files contain single samples only, although multiple read groups are allowed.
For variant calling, the
-onlyEmitSamples argument allows you to tell the UnifiedGenotyper to only emit calls for specific samples among a cohort that you're calling in multisample mode, without emitting the calls for the rest of the cohort. Keep in mind however that the calculations will still be made on the entire cohort, and the annotation values emitted for those calls will reflect that.
For VQSR, the
--excludeFiltered flag tells the ApplyRecalibration tool not to emit sites that are filtered out by recalibration (i.e. do not write them to file).
And some of you went ahead and added the features you wanted yourselves!
Yossi Farjoun contributed a patch to enable allele-biased downsampling with different per-sample values for the HaplotypeCaller, emulating the equivalent functionality that was already available in the UnifiedGenotyper.
Louis Bergelson contributed a new read filter, LibraryReadFilter, which allows you to use only reads from a specific library in your analysis. This is the opposite (and somewhat more specific) functionality compared to the existing engine argument, --read_group_black_list , which allows you to exclude read groups based on specific tags (including but not limited to LB).
We have a new diagnostic tool, QualifyMissingIntervals, that allows you to collect metrics such as GC content, mapping quality etc. for a list of intervals of interest. This is something you'd typically want to use if you found (through other tools) that you're missing calls in certain intervals, and you want to find out what's going wrong in those regions.
Finally, those of you who have access to more sophisticated computing platforms, heads up! Version 2.7 comes with a version of the PairHMM algorithm (aka the bit that takes forever to run in HaplotypeCaller) that is optimized for running on FPGA chips. Credit goes to the fine folks at Convey Computer and Green Mountain Computing Systems who teamed up to develop this optimized version of the PairHMM, with a little help from our very own Tech Dev team. We're told further optimizations may be in store; in the meantime, they're seeing up to 300-fold speedups of HaplotypeCaller runs on Convey's platform. Not bad!
It's finally summer here in New England -- time for cave-dwelling developers to hit the beach and do the lobster dance (those of us who don't tan well anyway). We leave you with a new version of the GATK that includes a new(ish) plotting tool, some more performance improvements to the callers, a lot of feature tweaks and quite a few bug fixes. Be sure to check out the full list in the 2.6 Release Notes.
Highlights are below as usual, enjoy. There's one thing that we need to point out with particular emphasis: we have moved to Java 7, so you may need to update your system's Java version. Full explanation at the end of this document because it's a little long, but be sure to read it.
GATK old-timers may remember a tool called AnalyzeCovariates, which was part of the BQSR process in 1.x versions, many moons ago. Well, we've resurrected it to take over the plotting functionality of the BaseRecalibrator, to make it easier and faster to plot and compare the results of base recalibration. This also prevents issues with plot generation in scatter-gather mode. We'll update our docs on the BQSR workflow in the next few days, but in the meantime you can find full details of how to use this tool here.
We know you don't want to miss a single true variant, so for this release, we've put a lot of effort into making the HaplotypeCaller more sensitive. And it's paying off: in our tests, the HaplotypeCaller is now more sensitive than the UnifiedGenotyper for calling both SNPs and indels when run over whole genome datasets.
[graph to illustrate, coming soon]
You might think all our focus is on improving the HaplotypeCaller these days; you would be wrong. The UnifiedGenotyper is still essential for calling large numbers of samples together, for dealing with exotic ploidies, and for calling pooled samples. So we've given it a turbo boost that makes it go twice as fast for calling indels on multiple samples.
The key change here is the updated Hidden Markov Model used by the UG. You can see on the graph that as the number of exomes being called jointly increases, the new HMM keeps runtimes down significantly compared to the old HMM.
Don’t you hate it when you go back to a VCF you generated some months ago, and you have no idea which version of GATK you used at the time? (And yes, versions matter. Sometimes a lot.) We sure do, so we added a function to add the GATK version number in the header of the VCFs generated by GATK.
Speaking of software versions... As you probably know, the GATK runs on Java -- specifically, until now, version 6 of the Runtime Environment (which translates to version 1.6 if you ask
java -version at the command prompt). But the Java language has been evolving under our feet; version 7 has been out and stable for some time now, and version 8 is on the horizon. We were happy as clams with Java 6… but now, newer computers with recent OS versions ship with Java 7, and on MacOS X once you update the system it is difficult to go back to using Java 6. And since Java 7 is not fully backwards compatible, people have been running into version problems.
So, we have made the difficult but necessary decision to follow the tide, and migrate the GATK to Java 7. Starting with this release, GATK will now require Java 7 to run. If you try to run with Java 6, you will probably get an error like this:
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0
If you're not sure what version of Java you are currently using, you can find out very easily by typing the following command:
which should return something like this:
java version "1.7.0_17" Java(TM) SE Runtime Environment (build 1.7.0_17-b02) Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
If not, you'll need to update your java version. If you have any difficulty doing this, please don’t ask us in the forum -- you’ll get much better, faster help if you ask your local IT department.
This is going to be a short one, folks. The 2.5 release is pretty much all about bug fixes, with a couple of exceptions that we'll cover below.
Remember how we said that version 2.4 was going to be the least buggy ever? Well, that might have been a bit optimistic. We had a couple of stumpers in there -- and a flurry of little ones that were probably not novel (i.e. not specific to version 2.5) but finally bubbled up to the surface. We're not going to go over the bug fixes in detail, since the release notes include a comprehensive list. Basically, those are all fixed.
Well, not exactly new features, but noteworthy improvements to existing tools.
In addition to countless bug fixes, we've made drastic improvements to ReduceReads' compression algorithm, so you can now achieve much better compression rates without compromising on the retention of informative data. Keep in mind of course that as always, you'll see much bigger gains on certain types of data sets -- the higher the coverage in your original BAM files, the bigger the savings in file size and performance of the downstream tools.
We say this every time, and every time it's true: we've made some more improvements to the HaplotypeCaller that make it faster and more accurate. Well, it's still slower than the UnifiedGenotyper, in case you were going to ask (of course you were). But on the accuracy front, we say this without reservation or caveat: HC is now just as accurate as the UG for calling SNPs, and it is in a league of its own for calling indels. If you are even remotely interested in indels you should absolutely take it out for a spin. Go. Now.
Say goodbye to the mood swings and the pimples; it looks like this tool's awkward teenager phase is finally over. We've entirely reworked how DiagnoseTargets functions so it now uses a plugin system, which we think is much more convenient. This plugin system will be explained in detail in a forthcoming documentation article.
You may be aware that we had imposed a freeze of sorts on the annotation database version that could be used with the snpEff annotation. Well, we're happy to report that the author of the snpEff software package has made some significant upgrades, including a feature called GATK compatibility mode. As a result there is no longer any version constraint. We'll be updating our documentation on using snpEff with GATK soon (-ish), but in the meantime, feel free to go forth and annotate away. Just make sure to consult the snpEff manual for relevant information on using it with GATK.
Even as the dev team giveth, the dev team taketh away.
A few annotations were removed from the VariantAnnotator stables (as listed in the release notes), mainly because they didn't work properly. With all the caveats about how GATK is research software, we're still committed to providing quality tools that do something close to what they're advertised to do, at the bare minimum. If something doesn't fulfill that requirement, it's out.
We've also disabled the auto-generation of fai/dict files for fasta references. I can hear some of you groaning all the way from here. Yes, it was convenient -- but far too buggy. Come on people, it's a one-liner using Picard. Oh, and we're no longer allowing the use of compressed (.gz) references either -- also too buggy. The space savings were simply not worth the headaches.
Release version 2.3 is the last before the winter holidays, so we've done our best not to put in anything that will break easily. Which is not to say there's nothing important - this release contains a truckload of feature tweaks and bug fixes (see the release notes in the next tab for full list). And we do have one major new feature for you: a brand-spanking-new downsampler to replace the old one.
It has recently come to our attention that some datasets are not encoded in the standard format (Q0 == ASCII 33 according to the SAM specification, whereas Illumina encoding starts at ASCII 64). This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off. To prevent this from happening, we've added a sanity check of the quality score encodings that will abort the program run if they are not standard. If this happens to you, you'll need to run again with the flag
-fixMisencodedQuals). What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.
Good news on the performance front: we eliminated a bottleneck in the GATK engine that increased the runtime of many tools by as much as 10x, depending on the exact details of the data being fed into the GATK. The problem was caused by the internal timing code invoking expensive system timing resources far too often. Imagine you looked at your watch every two seconds -- it would take you ages to get anything done, right? Anyway, if you see your tools running unusually quickly, don't panic! This may be the reason, and it's a good thing.
You can now co-reduce separate BAM files by passing them in with multiple
-I or as an input list. The motivation for this is that samples that you plan to analyze together (e. g. tumor-normal pairs or related cohorts) should be reduced together, so that if a disagreement is triggered at a locus for one sample, that locus will remain unreduced in all samples. You will therefore conserve the full depth of information for later analysis of that locus.
The downsampler is the component of the GATK engine that handles downsampling, i. e. the process of removing a subset of reads from a pileup. The goal of this process is to speed up execution of the desired analysis, particularly in genome regions that are covered by excessive read depth.
In this release, we have replaced the old downsampler with a brand new one that extends some options and performs much better overall.
The GATK offers two different options for downsampling:
-dcov) enables you to set the maximum amount of coverage to keep at any position
-dfrac) enables you to remove a proportional amount of the reads at any position (e. g. take out half of all the reads)
Until now, it was not possible to use the
-dcov) option with read walkers; you were limited to using
-dfrac). In the new release, you will be able to downsample to coverage for read walkers.
However, please note that the process is a little different. The normal way of downsampling to coverage (e. g. for locus walkers) involves downsampling over the entire pileup of reads in one take. Due to technical reasons, it is still not possible to do that exact process for read walkers; instead the read-walker-compatible way of doing it involves downsampling within subsets of reads that are all aligned at the same starting position. This different mode of operation means you shouldn't use the same range of values; where you would use
-dcov 100 for a locus walker, you may need to use
-dcov 10 for a read walker. And these are general estimates - your mileage may vary depending on your dataset, so we recommend testing before applying on a large scale.
One important property of the downsampling process is that it should be as random as possible to avoid introducing biases into the selection of reads that will be kept for analysis. Unfortunately our old downsampler - specifically, the part of the downsampler that performed the downsampling to coverage - suffered from some biases. The most egregious problem was that as it walked through the data, it tended to privilege more recently encountered reads and displaced "older" reads. The new downsampler no longer suffers from these biases.
The old downsampler was embedded in the engine code in a way that made it hard to test in a systematic way. So when we implemented the new downsampler, we reorganized the code to make it a standalone engine component - the equivalent of promoting it from the cubicle farm to its own corner office. This has allowed us to cover it much better with systematic tests, so we have better assessment of whether it's working properly.
The new downsampler is enabled by default and we are confident that it works much better than the old one. BUT as with all brand-spanking-new features, early adopters may run into unexpected rough patches. So we're providing a way to disable it and use the old one, which is still in the box for now: just add
-use_legacy_downsampler to your command line. Obviously if you use this AND
-dcov with a read walker, you'll get an error, since the old downsampler can't downsample to coverage for read walkers.
We're very excited to present release version 2.2 to the public. As those of you who have been with us for a while know, it's been a much longer time than usual since the last minor release (v 2.1). Ah, but don't let the "minor" name fool you - this release is chock-full of major improvements that are going to make a big difference to pretty much everyone's use of the GATK. That's why it took longer to put together; we hope you'll agree it was worth the wait!
The biggest changes in this release fall in two categories: enhanced performance and improved accuracy. This is rounded out by a gaggle of bug fixes and updates to the resource bundle.
We know y'all have variants to call and papers to publish, so we've pulled out all the stops to make the GATK run faster without costing 90% of your grant in computing hardware. First, we're introducing a new multi-threading feature called Nanoscheduler that we've added to the GATK engine to expand your options for parallel processing. Thanks to the Nanoscheduler, we're finally able to bring multi-threading back to the BaseRecalibrator. We've also made some seriously hard-core algorithm optimizations to ReduceReads and the two variant callers, UnifiedGenotyper and HaplotypeCaller, that will cut your runtimes down so much you won't know what to do with all the free time. Or, you'll actually be able to get those big multisample analyses done in a reasonable amount of time…
This new multi-threading feature of the GATK engine allows you to take advantage of having multiple cores per machine, whether in your desktop computer or on your server farm. Basically, the Nanoscheduler creates clones of the GATK, assigns a subset of the job to each and runs it on a different core of the machine. Usage is similar to the
-nt mode you may already be familiar with, except you call this one with the new
-nct argument. Note that the Nanoscheduler currently reserves one thread for itself, which acts like a manager (it bosses the other threads around but doesn't get much work done itself) so to see any real performance gain you'll need to use at least
-nct 3, which yields two "worker" threads. This is a limitation of the current implementation which we hope to resolve soon. See the updated document on Parallelism with the GATK (v2) (link coming soon) for more details of how the Nanoscheduler works, as well as recommendations on how to optimize parallelization for each of the main GATK tools.
Many of you have complained that the rebooted BaseRecalibrator in GATK2 takes forever to run. Rightly so, because until now, you couldn't effectively run it in multi-threaded mode. The reason for that is fairly technical, but in essence, whenever a thread started working on a chunk of data it locked down access to the rest of the dataset, so any other threads would have to wait for it to finish working before they could begin. That's not really multi-threading, is it? No, we didn't think so either. So we rewrote the BaseRecalibrator to not do that anymore, and we gave it a much saner and effective way of handling thread safety: each thread locks down just the chunk of data it's assigned to process, not the whole dataset. The graph below shows the performance gains of the new system over the old one. Note that in practice, this is operated by the Nanoscheduler (see above); so remember, if you want to parallelize BaseRecalibrator, use
-nt, and be sure to assign three or more threads.
Without going into the gory technical details, we optimized the underlying compression algorithm that powers ReduceReads, and we're seeing some very significant improvements in runtime. For a "best-case scenario" BAM file, i.e. a well-formatted BAM with no funny business, the average is about a three-fold decrease in runtime. Yes, it's three times faster! And if that doesn't impress you, you may be interested to know that for "worst-case scenario" BAM files (which are closer to what we see in the wild, so to speak, than in our climate-controlled test facility) we see orders of magnitude of difference in runtimes. That's tens to hundreds of times faster. To many of you, that will make the difference between being able to reduce reads or not. Considering how reduced BAMs can help bring down storage needs and runtimes in downstream operations as well -- it's a pretty big deal.
Ah, another algorithm optimization that makes things go faster. This one affects the EXACT model that underlies how the UG calls variants. We've modified it to use a new approach to multiallelic discovery, which greatly improves scalability of joint calling for multi-sample projects. Previously, the relationship between the number of possible alternate alleles and the difficulty of the calculation (which directly impacts runtime) was exponential. So you had to place strict limits on the number of alternate alleles allowed (like 3, tops) if you wanted the UG run to finish during your lifetime. With the updated model, the relationship is linear, allowing the UG to comfortably handle around 6 to 10 alternate alleles without requiring some really serious hardware to run on. This will mostly affect projects with very diverse samples (as opposed to more monomorphic ones).
The last algorithm optimization for this release, but certainly not the least (there is no least, and no parent ever has a favorite child), this one affects the likelihood model used by the HaplotypeCaller. Previously, the HaplotypeCaller's HMM required calculations to be made in logarithmic space in order to maintain precision. These log-space calculations were very costly in terms of performance, and took up to 90% of the runtime of the HaplotypeCaller. Everyone and their little sister has been complaining that it operates on a geological time scale, so we modified it to use a new approach that gets rid of the log-space calculations without sacrificing precision. Words cannot express how well that worked, so here's a graph.
This graph shows runtimes for HaplotypeCaller and UnifiedGenotyper before (left side) and after (right side) the improvements described above. Note that the version numbers refer to development versions and do not map directly to the release versions.
Alright, going faster is great, I hear you say, but are the results any good? We're a little insulted that you asked, but we get it -- you have responsibilities, you have to make sure you get the best results humanly possible (and then some). So yes, the results are just as good with the faster tools -- and we've actually added a couple of features to make them even better than before. Specifically, the BaseRecalibrator gets a makeover that improves indel scores, and the UnifiedGenotyper gets equipped with a nifty little trick to minimize the impact of low-grade sample contamination.
When we brought multi-threading back to the BaseRecalibrator, we also revamped how the tool evaluates each read. Previously, the BaseRecalibrator accepted the read alignment/position issued by the aligner, and made all its calculations based on that alignment. But aligners make mistakes, so we've rewritten it to also consider other possible alignments and use a probabilistic approach to make its calculations. This delocalized approach leads to improved accuracy for indel quality scores.
In an ideal world, your samples would never get contaminated by other DNA. This is not an ideal world. Sample contamination happens more often than you'd think; usually at a low-grade level, but still enough to skew your results. To counteract this problem, we've added a contamination filter to the UnifiedGenotyper. Given an estimated level of contamination, the genotyper will downsample reads by that fraction for each allele group. By default, this number is set at 5% for high-pass data. So in other words, for each allele it detects, the genotyper throws out 5% of reads that have that allele.
We realize this may raise a few eyebrows, but trust us, it works, and it's safe. This method respects allelic proportions, so if the actual contamination is lower, your results will be unaffected, and if a significant amount of contamination is indeed present, its effect on your results will be minimized. If you see differences between results called with and without this feature, you have a contamination problem.
Note that this feature is turned ON by default. However it only kicks in above a certain amount of coverage, so it doesn't affect low-pass datasets.
We've added a lot of systematic tests to the new tools and features that were introduced in GATK 2.0 and 2.1 (Full versions), such as ReduceReads and the HaplotypeCaller. This has enabled us to flush out a lot of the "growing pains" bugs, in addition to those that people have reported on the forum, so all that is fixed now. We realize many of you have been waiting a long time for some of these bug fixes, so we thank you for your patience and understanding. We've also fixed the few bugs that popped up in the mature tools; these are all fixed in both Full and Lite versions of course.
Details will be available in the new Change log shortly.
Finally, we've updated the resource bundle with a variant callset that can be used as a standard for setting up your variant calling pipelines. Briefly, we generated this callset from the raw BAMs of our favorite trio (CEU Trio) according to our Best Practices (using the UnifiedGenotyper on unreduced BAMs). We additionally phased the calls using PhaseByTransmission. We've also updated the HapMap VCF.
Note that from now on, we plan to generate a new callset with each major and minor release, and the numbering of the bundle versions will follow the GATK version numbers to avoid any confusion.