There now is a "gsa-announce" mailing list that you can subscribe to if you want to be notified by email whenever a new release of the GATK becomes available.
To subscribe, send an email to:
email@example.com with the subject line subscribe.
To unsubscribe, send an email to:
firstname.lastname@example.org with the subject line unsubscribe.
The list is for announcements only -- you cannot post to it. Support requests should go here to our discussion forum.
This is going to be a short one, folks. The 2.5 release is pretty much all about bug fixes, with a couple of exceptions that we'll cover below.
Remember how we said that version 2.4 was going to be the least buggy ever? Well, that might have been a bit optimistic. We had a couple of stumpers in there -- and a flurry of little ones that were probably not novel (i.e. not specific to version 2.5) but finally bubbled up to the surface. We're not going to go over the bug fixes in detail, since the release notes include a comprehensive list. Basically, those are all fixed.
Well, not exactly new features, but noteworthy improvements to existing tools.
In addition to countless bug fixes, we've made drastic improvements to ReduceReads' compression algorithm, so you can now achieve much better compression rates without compromising on the retention of informative data. Keep in mind of course that as always, you'll see much bigger gains on certain types of data sets -- the higher the coverage in your original BAM files, the bigger the savings in file size and performance of the downstream tools.
We say this every time, and every time it's true: we've made some more improvements to the HaplotypeCaller that make it faster and more accurate. Well, it's still slower than the UnifiedGenotyper, in case you were going to ask (of course you were). But on the accuracy front, we say this without reservation or caveat: HC is now just as accurate as the UG for calling SNPs, and it is in a league of its own for calling indels. If you are even remotely interested in indels you should absolutely take it out for a spin. Go. Now.
Say goodbye to the mood swings and the pimples; it looks like this tool's awkward teenager phase is finally over. We've entirely reworked how DiagnoseTargets functions so it now uses a plugin system, which we think is much more convenient. This plugin system will be explained in detail in a forthcoming documentation article.
You may be aware that we had imposed a freeze of sorts on the annotation database version that could be used with the snpEff annotation. Well, we're happy to report that the author of the snpEff software package has made some significant upgrades, including a feature called GATK compatibility mode. As a result there is no longer any version constraint. We'll be updating our documentation on using snpEff with GATK soon (-ish), but in the meantime, feel free to go forth and annotate away. Just make sure to consult the snpEff manual for relevant information on using it with GATK.
Even as the dev team giveth, the dev team taketh away.
A few annotations were removed from the VariantAnnotator stables (as listed in the release notes), mainly because they didn't work properly. With all the caveats about how GATK is research software, we're still committed to providing quality tools that do something close to what they're advertised to do, at the bare minimum. If something doesn't fulfill that requirement, it's out.
We've also disabled the auto-generation of fai/dict files for fasta references. I can hear some of you groaning all the way from here. Yes, it was convenient -- but far too buggy. Come on people, it's a one-liner using Picard. Oh, and we're longer allowing the use of compressed (.gz) references either -- also too buggy. The space savings were simply not worth the headaches.
We're very excited to present release version 2.2 to the public. As those of you who have been with us for a while know, it's been a much longer time than usual since the last minor release (v 2.1). Ah, but don't let the "minor" name fool you - this release is chock-full of major improvements that are going to make a big difference to pretty much everyone's use of the GATK. That's why it took longer to put together; we hope you'll agree it was worth the wait!
The biggest changes in this release fall in two categories: enhanced performance and improved accuracy. This is rounded out by a gaggle of bug fixes and updates to the resource bundle.
We know y'all have variants to call and papers to publish, so we've pulled out all the stops to make the GATK run faster without costing 90% of your grant in computing hardware. First, we're introducing a new multi-threading feature called Nanoscheduler that we've added to the GATK engine to expand your options for parallel processing. Thanks to the Nanoscheduler, we're finally able to bring multi-threading back to the BaseRecalibrator. We've also made some seriously hard-core algorithm optimizations to ReduceReads and the two variant callers, UnifiedGenotyper and HaplotypeCaller, that will cut your runtimes down so much you won't know what to do with all the free time. Or, you'll actually be able to get those big multisample analyses done in a reasonable amount of time…
This new multi-threading feature of the GATK engine allows you to take advantage of having multiple cores per machine, whether in your desktop computer or on your server farm. Basically, the Nanoscheduler creates clones of the GATK, assigns a subset of the job to each and runs it on a different core of the machine. Usage is similar to the
-nt mode you may already be familiar with, except you call this one with the new
-nct argument. Note that the Nanoscheduler currently reserves one thread for itself, which acts like a manager (it bosses the other threads around but doesn't get much work done itself) so to see any real performance gain you'll need to use at least
-nct 3, which yields two "worker" threads. This is a limitation of the current implementation which we hope to resolve soon. See the updated document on Parallelism with the GATK (v2) (link coming soon) for more details of how the Nanoscheduler works, as well as recommendations on how to optimize parallelization for each of the main GATK tools.
Many of you have complained that the rebooted BaseRecalibrator in GATK2 takes forever to run. Rightly so, because until now, you couldn't effectively run it in multi-threaded mode. The reason for that is fairly technical, but in essence, whenever a thread started working on a chunk of data it locked down access to the rest of the dataset, so any other threads would have to wait for it to finish working before they could begin. That's not really multi-threading, is it? No, we didn't think so either. So we rewrote the BaseRecalibrator to not do that anymore, and we gave it a much saner and effective way of handling thread safety: each thread locks down just the chunk of data it's assigned to process, not the whole dataset. The graph below shows the performance gains of the new system over the old one. Note that in practice, this is operated by the Nanoscheduler (see above); so remember, if you want to parallelize BaseRecalibrator, use
-nt, and be sure to assign three or more threads.
Without going into the gory technical details, we optimized the underlying compression algorithm that powers ReduceReads, and we're seeing some very significant improvements in runtime. For a "best-case scenario" BAM file, i.e. a well-formatted BAM with no funny business, the average is about a three-fold decrease in runtime. Yes, it's three times faster! And if that doesn't impress you, you may be interested to know that for "worst-case scenario" BAM files (which are closer to what we see in the wild, so to speak, than in our climate-controlled test facility) we see orders of magnitude of difference in runtimes. That's tens to hundreds of times faster. To many of you, that will make the difference between being able to reduce reads or not. Considering how reduced BAMs can help bring down storage needs and runtimes in downstream operations as well -- it's a pretty big deal.
Ah, another algorithm optimization that makes things go faster. This one affects the EXACT model that underlies how the UG calls variants. We've modified it to use a new approach to multiallelic discovery, which greatly improves scalability of joint calling for multi-sample projects. Previously, the relationship between the number of possible alternate alleles and the difficulty of the calculation (which directly impacts runtime) was exponential. So you had to place strict limits on the number of alternate alleles allowed (like 3, tops) if you wanted the UG run to finish during your lifetime. With the updated model, the relationship is linear, allowing the UG to comfortably handle around 6 to 10 alternate alleles without requiring some really serious hardware to run on. This will mostly affect projects with very diverse samples (as opposed to more monomorphic ones).
The last algorithm optimization for this release, but certainly not the least (there is no least, and no parent ever has a favorite child), this one affects the likelihood model used by the HaplotypeCaller. Previously, the HaplotypeCaller's HMM required calculations to be made in logarithmic space in order to maintain precision. These log-space calculations were very costly in terms of performance, and took up to 90% of the runtime of the HaplotypeCaller. Everyone and their little sister has been complaining that it operates on a geological time scale, so we modified it to use a new approach that gets rid of the log-space calculations without sacrificing precision. Words cannot express how well that worked, so here's a graph.
This graph shows runtimes for HaplotypeCaller and UnifiedGenotyper before (left side) and after (right side) the improvements described above. Note that the version numbers refer to development versions and do not map directly to the release versions.
Alright, going faster is great, I hear you say, but are the results any good? We're a little insulted that you asked, but we get it -- you have responsibilities, you have to make sure you get the best results humanly possible (and then some). So yes, the results are just as good with the faster tools -- and we've actually added a couple of features to make them even better than before. Specifically, the BaseRecalibrator gets a makeover that improves indel scores, and the UnifiedGenotyper gets equipped with a nifty little trick to minimize the impact of low-grade sample contamination.
When we brought multi-threading back to the BaseRecalibrator, we also revamped how the tool evaluates each read. Previously, the BaseRecalibrator accepted the read alignment/position issued by the aligner, and made all its calculations based on that alignment. But aligners make mistakes, so we've rewritten it to also consider other possible alignments and use a probabilistic approach to make its calculations. This delocalized approach leads to improved accuracy for indel quality scores.
In an ideal world, your samples would never get contaminated by other DNA. This is not an ideal world. Sample contamination happens more often than you'd think; usually at a low-grade level, but still enough to skew your results. To counteract this problem, we've added a contamination filter to the UnifiedGenotyper. Given an estimated level of contamination, the genotyper will downsample reads by that fraction for each allele group. By default, this number is set at 5% for high-pass data. So in other words, for each allele it detects, the genotyper throws out 5% of reads that have that allele.
We realize this may raise a few eyebrows, but trust us, it works, and it's safe. This method respects allelic proportions, so if the actual contamination is lower, your results will be unaffected, and if a significant amount of contamination is indeed present, its effect on your results will be minimized. If you see differences between results called with and without this feature, you have a contamination problem.
Note that this feature is turned ON by default. However it only kicks in above a certain amount of coverage, so it doesn't affect low-pass datasets.
We've added a lot of systematic tests to the new tools and features that were introduced in GATK 2.0 and 2.1 (Full versions), such as ReduceReads and the HaplotypeCaller. This has enabled us to flush out a lot of the "growing pains" bugs, in addition to those that people have reported on the forum, so all that is fixed now. We realize many of you have been waiting a long time for some of these bug fixes, so we thank you for your patience and understanding. We've also fixed the few bugs that popped up in the mature tools; these are all fixed in both Full and Lite versions of course.
Details will be available in the new Change log shortly.
Finally, we've updated the resource bundle with a variant callset that can be used as a standard for setting up your variant calling pipelines. Briefly, we generated this callset from the raw BAMs of our favorite trio (CEU Trio) according to our Best Practices (using the UnifiedGenotyper on unreduced BAMs). We additionally phased the calls using PhaseByTransmission. We've also updated the HapMap VCF.
Note that from now on, we plan to generate a new callset with each major and minor release, and the numbering of the bundle versions will follow the GATK version numbers to avoid any confusion.
Release version 2.3 is the last before the winter holidays, so we've done our best not to put in anything that will break easily. Which is not to say there's nothing important - this release contains a truckload of feature tweaks and bug fixes (see the release notes in the next tab for full list). And we do have one major new feature for you: a brand-spanking-new downsampler to replace the old one.
It has recently come to our attention that some datasets are not encoded in the standard format (Q0 == ASCII 33 according to the SAM specification, whereas Illumina encoding starts at ASCII 64). This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off. To prevent this from happening, we've added a sanity check of the quality score encodings that will abort the program run if they are not standard. If this happens to you, you'll need to run again with the flag
-fixMisencodedQuals). What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.
Good news on the performance front: we eliminated a bottleneck in the GATK engine that increased the runtime of many tools by as much as 10x, depending on the exact details of the data being fed into the GATK. The problem was caused by the internal timing code invoking expensive system timing resources far too often. Imagine you looked at your watch every two seconds -- it would take you ages to get anything done, right? Anyway, if you see your tools running unusually quickly, don't panic! This may be the reason, and it's a good thing.
You can now co-reduce separate BAM files by passing them in with multiple
-I or as an input list. The motivation for this is that samples that you plan to analyze together (e. g. tumor-normal pairs or related cohorts) should be reduced together, so that if a disagreement is triggered at a locus for one sample, that locus will remain unreduced in all samples. You will therefore conserve the full depth of information for later analysis of that locus.
The downsampler is the component of the GATK engine that handles downsampling, i. e. the process of removing a subset of reads from a pileup. The goal of this process is to speed up execution of the desired analysis, particularly in genome regions that are covered by excessive read depth.
In this release, we have replaced the old downsampler with a brand new one that extends some options and performs much better overall.
The GATK offers two different options for downsampling:
-dcov) enables you to set the maximum amount of coverage to keep at any position
-dfrac) enables you to remove a proportional amount of the reads at any position (e. g. take out half of all the reads)
Until now, it was not possible to use the
-dcov) option with read walkers; you were limited to using
-dfrac). In the new release, you will be able to downsample to coverage for read walkers.
However, please note that the process is a little different. The normal way of downsampling to coverage (e. g. for locus walkers) involves downsampling over the entire pileup of reads in one take. Due to technical reasons, it is still not possible to do that exact process for read walkers; instead the read-walker-compatible way of doing it involves downsampling within subsets of reads that are all aligned at the same starting position. This different mode of operation means you shouldn't use the same range of values; where you would use
-dcov 100 for a locus walker, you may need to use
-dcov 10 for a read walker. And these are general estimates - your mileage may vary depending on your dataset, so we recommend testing before applying on a large scale.
One important property of the downsampling process is that it should be as random as possible to avoid introducing biases into the selection of reads that will be kept for analysis. Unfortunately our old downsampler - specifically, the part of the downsampler that performed the downsampling to coverage - suffered from some biases. The most egregious problem was that as it walked through the data, it tended to privilege more recently encountered reads and displaced "older" reads. The new downsampler no longer suffers from these biases.
The old downsampler was embedded in the engine code in a way that made it hard to test in a systematic way. So when we implemented the new downsampler, we reorganized the code to make it a standalone engine component - the equivalent of promoting it from the cubicle farm to its own corner office. This has allowed us to cover it much better with systematic tests, so we have better assessment of whether it's working properly.
The new downsampler is enabled by default and we are confident that it works much better than the old one. BUT as with all brand-spanking-new features, early adopters may run into unexpected rough patches. So we're providing a way to disable it and use the old one, which is still in the box for now: just add
-use_legacy_downsampler to your command line. Obviously if you use this AND
-dcov with a read walker, you'll get an error, since the old downsampler can't downsample to coverage for read walkers.
We are very proud (and more than a little relieved) to finally present version 2.4 of the GATK! It's been a long time coming, but we're certain you'll find it well worth the wait. This release is bursting at the seams with new features and improvements, as you'll read below. It is also very probably going to be our least-buggy initial release yet, thanks to the phenomenal effort that went into adding extensive automated tests to the codebase.
Important note: Keep in mind that this new release comes with a brand new license, as we announced a few weeks ago here. Be sure to at least check out the figure that explains the different packages we (and our commercial partner Appistry) offer, and get the one that is appropriate for your use of the GATK.
With that disclaimer out of the way, here are the feature highlights of version 2.4!
Let's start with what everyone wants to hear about: improvements in speed and accuracy. There are in fact far more improvements in accuracy than are described here, again because of the extensive test coverage we've added to the codebase. But here are the ones that we believe will have the most impact on your work.
We realized that even though BaseRecalibrator was doing a fabulous job in general, the calculation for the empirical quality of a bin (e.g. all bases at the 33rd cycle of a read) was not always accurate. Specifically, we would draw the same conclusions from bins with many or few observations -- but in the latter case that was not necessarily correct (we were seeing some Q6s get recalibrated up to Q30s, for example). We changed this behavior so that the BaseRecalibrator now calculates a proper Bayesian estimate of the empirical quality. As a result, for bins with very little data, the likelihood is dwarfed by a prior probability that tends towards the original quality; there is no effect on large bins, which were already fine. This brings noticeable improvements in the genotype likelihoods being produced from the genotypes, in particular for the heterozygous state (as expected).
You may remember that in the highlights for version 2.2, we were excited to announce that the HaplotypeCaller was no longer operating on geological time scales. Well, now the HC has made another big leap forward in terms of speed -- and it is now almost as fast as the UnifiedGenotyper. If you were reluctant to move from the UG to the HC based on runtime, that shouldn't be an issue anymore! Or, if you were unconvinced by the merits of the new calling algorithm, you'll be interested to know that our internal tests show that the HaplotypeCaller is now more accurate in calling variants (SNPs as well as Indels) than the UnifiedGenotyper.
How did we make this happen? There are too many changes to list here, but one of the key modifications that makes the HaplotypeCaller much faster (without sacrificing any accuracy!) is that we've greatly optimized how local Smith-Waterman re-assembly is applied. Previously, when the HC encountered a region where reassembly was needed, it performed SW re-assembly on the entire region, which was computationally very demanding. In the new implementation, the HC generates a "bubble" (yes, that's the actual technical term) around each individual haplotype, and applies the SW re-assembly only within that bubble. This brings down the computational challenge by orders of magnitude.
We're not just fluffing up the existing tools -- we're also adding new tools to extend the capabilities of our toolkit.
A new Read Filter, ReassignOneMappingQualityFilter, allows you to -- well, it's in the name -- reassign one mapping quality. This is useful for example to process data output by programs like TopHat which use MAPQ = 255 to convey meaningful information. The GATK would normally ignore any reads with that mapping quality. With the new filter, you can selectively reassign that quality to something else so that those reads will get utilized, without affecting the rest of your dataset.
In addition, the recently introduced contamination filter gets upgraded with the option to apply decontamination individually per sample.
Version 2.4 includes several new tools that grew out of existing tool options. The rationale for making them standalone tools is that they represent particularly useful capabilities that merit expansion, and expanding them within their "mother tool" was simply too cumbersome.
GenotypeConcordance graduates from being a module of VariantEval, to being its own fully-fledged tool. This comes with many bug fixes and an overhaul of how the concordance results are tabulated, which we hope will cause less confusion than it has in the past!
RegenotypeVariants takes over -- and improves upon -- the functionality previously provided by the
--regenotype option of SelectVariants. This tool allows you to refresh the genotype information in a VCF file after samples have been added or removed.
And we're also adding CatVariants, a tool to quickly combine multiple VCF files whose records are non-overlapping (e.g. as produced during scatter-gather using Queue). This should be a useful alternative to CombineVariants, which is primarily meant for more complex combination operations.
Going forward, we have decided to provide nightly automated builds from our development tree. This means that you can get the very latest development version -- no need to wait weeks for bug fixes or new features anymore! However, this comes with a gigantic caveat emptor: these are bleeding-edge versions that are likely to contain bugs, and features that have never been tested in the wild. And they're automatically generated at night, so we can't even guarantee that they'll run. All we can say of any of them is that the code was able to compile -- beyond that, we're off the hook. We won't answer support questions about the new stuff. So in short: you want to try the nightlies, you do so at your own risk.
If any of the above scares or confuses you, no problem -- just stay well clear of the owl and you won't get bitten.
But hey, if you're feeling particularly brave or lucky, have fun :)
The release of version 2.4 also coincides with some upgrades to the documentation that are significant enough to merit a brief mention.
From here on, every release (including minor releases, such as 2.3-9) will be accompanied by the generation of a PDF Guide Book that contains the online documentation articles as they are at that time. It will not only allow you to peruse the documentation offline, but it will also serve as versioned documentation. This way, if in the future you need to go back and examine results you obtained with an older version of the GATK, you can find easily find the documentation that was valid at that time. Note that the Technical Documentation (which contains the exhaustive lists of arguments for each tool) is not included in the Guide Book since it can be generated directly from the source code.
Speaking of the Technical Documentation, we are happy to announce that we've enriched those pages with additional information, including available parallelization options and default read filters for each tool, where applicable. We've also reorganized the main categories in the Technical Documentation index to make it easier to browse tools and find what you need.
Finally, a few words for developers who have previous experience with the GATK codebase. The VariantContext and related classes have been moved out of the GATK codebase and into the Picard public repository. The GATK now uses the resulting Variant.jar as an external library (currently version 1.85.1357). We've also updated the Picard and Tribble jars to version 1.84.1337.