This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.
So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.
We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!
Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.
Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command:
Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.
|Linux kernel version||Architecture / Processor||Expected speedup||Instruction set|
|Any 64-bit Linux||Any x86 64-bit||1-1.5x||Non-vector|
|Linux 2.6 or newer||Penryn (Core 2 or newer)||1.3-1.8x||SSE 4.1|
|Linux 2.6.30 or newer||SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer)||2-2.5x||AVX|
To find out exactly which processor is in your machine, you can run this command in the terminal:
$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.
Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:
Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:
SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).
The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.
The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.
Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:
At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.
The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).
We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.
Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.
The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.
The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.
We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.
Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.
Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.
We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.
Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.