Version highlights for GATK version 3.1
Posted in Announcements | Last updated on 2014-03-20 14:10:47


Comments (21)

This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.


Intel inside GATK

So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.

We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!

In practice

Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.

Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command: -pairHMM VECTOR_LOGLESS_CACHING.

Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.

Linux kernel version Architecture / Processor Expected speedup Instruction set
Any 64-bit Linux Any x86 64-bit 1-1.5x Non-vector
Linux 2.6 or newer Penryn (Core 2 or newer) 1.3-1.8x SSE 4.1
Linux 2.6.30 or newer SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer) 2-2.5x AVX

To find out exactly which processor is in your machine, you can run this command in the terminal:

$ cat /proc/cpuinfo | grep "model name"                                                                                    
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.

Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:

Penryn processors

  • http://ark.intel.com/products/codename/26543/Penryn
  • http://ark.intel.com/products/codename/24736/Wolfdale
  • http://ark.intel.com/products/codename/26555/Harpertown
  • http://ark.intel.com/products/codename/25006/Dunnington

Sandy Bridge processors

  • http://ark.intel.com/products/codename/29900/Sandy-Bridge?wapkw=sandy+bridge+processors

Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:

  • SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).

  • The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.

  • The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.


Return to top Comment on this article in the forum