When you're setting up a variant discovery pipeline, you face two problems: deciding what tools to run (with what options), and how to run them efficiently so that it doesn't take forever. Between our documentation and our support forum, we can get you most if not all the way to solving the first problem, unless you're working with something really unusual.
However, the second problem is not something we've been able to help much with. We only benchmark computational requirements/performance for the purposes of our in-house pipelines, which are very specific to our particular infrastructure, and we don't have the resources to test different configurations. As a result it's been hard for us to give satisfying answers to questions like "How long should this take?" or "How much RAM do I need?" -- and we're aware this is a big point of pain.
So I'm really pleased to announce that a team of engineers at Intel have been developing a system to profile pipelines that implement our Best Practices workflows on a range of hardware configurations. This is a project we've been supporting by providing test data and implementation advice, and it's really gratifying to see it bear fruit: the team recently published their first round of profiling, done on the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF) on a trio of whole genomes.
The white paper is available from the GATK-specific page of Intel's Health-IT initiative website and contains some very useful insights regarding key bottlenecks in the pipeline. It also details the applicability of parallelizing options for each tool, as well as the effect of using different numbers of threads on performance, when run on a single 36-core machine. Spoiler alert: more isn't always better!
Read on for a couple of highlights of what I thought was especially interesting in the Intel team's white paper.
First, this figure showing how parallelization affects turnaround time for each tool in the pipeline does a great job of identifying where parallelization makes a the biggest difference (hellooooo BWA), as well as where the addition of more parallel threads shows quickly diminishing returns.
Note that here the term "thread" refers both to parallelized execution that is achieved through multithreading approaches (in GATK, these are invoked using
-nct depending on the tool) as well as through local scatter-gather, which consists of running the tools over slices of data generated using Queue, the companion program to GATK. In terms of computational resources the result is similar: a certain number of cores on the machine are dedicated to running the given component. Full details are given in the script that accompanies the white paper; in a nutshell, multithreading was used for BWA-mem and RealignerTargetCreator, whereas scatter-gather was used to parallelize IndelRealigner, BaseRecalibrator, PrintReads and HaplotypeCaller.
You can see that for the genome mapping step, done with BWA-mem, going from 1 to just 4 threads leads to a 4-fold decrease in runtime, which is yuuuge since it's the most time-consuming step in the pipeline. And beyond that you can still bring runtime down further by throwing more threads at the problem, until the relative gains bottom out somewhere between 8 and 36 -- at which point you're spending very little time on BWA mapping. Your mileage may vary, but for reference, in Broad's production pipeline we give BWA-mem 12 threads.
On the GATK end of the pipeline, HaplotypeCaller also shows an about 4-fold speedup when going from a single core to 4 scatter-gather jobs run on different cores, but beyond that the gains from additional parallelization tend to be progressively more modest. Multithreading with
-nct is not used at all because it has proved fairly unstable in HaplotypeCaller, leading to occasional unpredictable crashes.
The remaining steps see less dramatic improvement, comparatively speaking, though BaseRecalibrator and PrintReads with
-BQSR do show a decent 2-fold speedup when run on four cores instead of one. But this shows fairly clearly that there's little point to blindly throwing more parallelization at these tools.
In this second figure we're looking at CPU utilization throughout the pipeline, i.e. how much computing the machine is doing at any given time -- as opposed to doing boring things like reading and writing data to and from files (I/O), which is like driving from Omaha to Denver (a long flat drive where nothing exciting happens, but things get fun once you get there).
Note that this figure corresponds specifically to an "optimized run", i.e. a specific configuration of the pipeline where each tool was parallelized optimally based on earlier results.
You can see that BWA-mem is a busy beaver, spending pretty much all of its time furiously calculating mapping scores and writing out results to the output SAM file as it goes. In contrast, if we look at the tools that write out BAM files, we see a flurry of activity up front, then the line goes flat during a long period spent just writing out results to the output BAM file.
In the case of BaseRecalibrator, the tool only outputs a recalibration table, which doesn't take very long at all. Then looking at PrintReads (run with
-BQSR) you see a similar activity profile to BaseRecalibrator's, which corresponds to the on-the-fly recalibration done by the engine before the recalibrated data is written to the final pre-processed BAM file that will be fed to HaplotypeCaller.
Finally you see that HaplotypeCaller itself is the most compute-intensive tool in the GATK end of the pipeline; although this is not shown in the figure, I can tell you that much of its time is spent on graph assembly and pairHMM alignment of haplotypes. Note that here HaplotypeCaller is writing out a GVCF file; if you were to run HaplotypeCaller in regular mode (not using
ERC GVCF) you would see a shorter period of I/O flatline because the variants-only VCF output amounts to a much smaller file.
If the above made you want to know more, head on over to Intel's Health-IT website and get the full white paper.
As I mentioned this is only the first pass in an ongoing project. The next step is going to involve implementing the joint genotyping and filtering with VQSR for the WGS trio, as well as profiling the equivalent exome pipeline on a cohort of ~30 exomes.
You may have noticed that this first pass was done one a single (albeit multi-core) machine; this was done on purpose to provide a baseline for end-to-end execution with the simplest configuration. Our friends at Intel will be looking at multi-machine setup in a future iteration, and for our part we'll have some new developments for you on the pipelining front soon -- so stay tuned to this blog or follow @gatk_dev on Twitter!
We all know how HaplotypeCaller analyses can take a long time. IBM is now providing a native implementation of the PairHMM algorithm that leverages the new hardware available in their POWER8 systems. This optimization currently work on the following systems: Ubuntu14 and RHEL7 with POWER8.
To take advantage of this optimization, you need to do the following:
Here is an example for running on a P8 system with Ubuntu:
java -Xmx32g -Djava.library.path=/path/to/PairHMM_P8_Ubuntu -jar $GATK_PATH/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $REFERENCE -I $INPUT_BAM --dbsnp $SNP_VCF \ -stand_emit_conf 10 -stand_call_conf 50 \ --pair_hmm_implementation VECTOR_LOGLESS_CACHING \ -o $OUTPUT_VCF
The latest version of the library accelerates HaplotypeCaller using the same floating precision as Java on POWER8. Also, exploiting Simultaneous Multithreading (SMT) along with the vector instructions, it can accelerate HaplotypeCaller more than the previous version, in particular, in the single-thread mode (no
-nct option is specified). For example, the PairHMM computation, which the library accelerates, consumes about a half of the HaplotypeCaller runtime in the single-thread mode, 1.88x speed-up will be expected. The library uses 37% of the available processors by default. The number of threads can be tuned by setting the environment variable
PHMM_N_THREADS. The source code is available here.
If you have any questions or issues (aside from downloading the file), please contact Yinhue Cheng at IBM (firstname.lastname@example.org) or Takeshi Ogasawara at IBM Japan (TAKESHI@jp.ibm.com).
Disclaimer: Please note that these libraries are not an official IBM product. You use it entirely at your own risk, and neither IBM nor the author assumes any liability whatsoever, nor do they assume responsibility for maintenance. Please report comments and corrections to email@example.com.
I'm running BaseRecalibrator on a full lane of HiSeq. I set the -nct switch to 8, but I see that the average load is <1. Why is this so? Doesn't the BaseRecalibrator use threads properly?
I've seen this thread http://gatkforums.broadinstitute.org/discussion/1263/baserecalibrator-trade-off-between-runtime-and-accuracy on how to speed up the process by downsampling. Is this a better option considering I have >200M reads?
BTW I'm using version 2.2-8 - do newer versions implement -num_threads?
I would really love to reduce the runtime from 36hrs.