When you're setting up a variant discovery pipeline, you face two problems: deciding what tools to run (with what options), and how to run them efficiently so that it doesn't take forever. Between our documentation and our support forum, we can get you most if not all the way to solving the first problem, unless you're working with something really unusual.
However, the second problem is not something we've been able to help much with. We only benchmark computational requirements/performance for the purposes of our in-house pipelines, which are very specific to our particular infrastructure, and we don't have the resources to test different configurations. As a result it's been hard for us to give satisfying answers to questions like "How long should this take?" or "How much RAM do I need?" -- and we're aware this is a big point of pain.
So I'm really pleased to announce that a team of engineers at Intel have been developing a system to profile pipelines that implement our Best Practices workflows on a range of hardware configurations. This is a project we've been supporting by providing test data and implementation advice, and it's really gratifying to see it bear fruit: the team recently published their first round of profiling, done on the per-sample segment of the germline variation pipeline (from BWA to HaplotypeCaller; FASTQ to GVCF) on a trio of whole genomes.
The white paper is available from the GATK-specific page of Intel's Health-IT initiative website and contains some very useful insights regarding key bottlenecks in the pipeline. It also details the applicability of parallelizing options for each tool, as well as the effect of using different numbers of threads on performance, when run on a single 36-core machine. Spoiler alert: more isn't always better!
Read on for a couple of highlights of what I thought was especially interesting in the Intel team's white paper.
First, this figure showing how parallelization affects turnaround time for each tool in the pipeline does a great job of identifying where parallelization makes a the biggest difference (hellooooo BWA), as well as where the addition of more parallel threads shows quickly diminishing returns.
Note that here the term "thread" refers both to parallelized execution that is achieved through multithreading approaches (in GATK, these are invoked using
-nct depending on the tool) as well as through local scatter-gather, which consists of running the tools over slices of data generated using Queue, the companion program to GATK. In terms of computational resources the result is similar: a certain number of cores on the machine are dedicated to running the given component. Full details are given in the script that accompanies the white paper; in a nutshell, multithreading was used for BWA-mem and RealignerTargetCreator, whereas scatter-gather was used to parallelize IndelRealigner, BaseRecalibrator, PrintReads and HaplotypeCaller.
You can see that for the genome mapping step, done with BWA-mem, going from 1 to just 4 threads leads to a 4-fold decrease in runtime, which is yuuuge since it's the most time-consuming step in the pipeline. And beyond that you can still bring runtime down further by throwing more threads at the problem, until the relative gains bottom out somewhere between 8 and 36 -- at which point you're spending very little time on BWA mapping. Your mileage may vary, but for reference, in Broad's production pipeline we give BWA-mem 12 threads.
On the GATK end of the pipeline, HaplotypeCaller also shows an about 4-fold speedup when going from a single core to 4 scatter-gather jobs run on different cores, but beyond that the gains from additional parallelization tend to be progressively more modest. Multithreading with
-nct is not used at all because it has proved fairly unstable in HaplotypeCaller, leading to occasional unpredictable crashes.
The remaining steps see less dramatic improvement, comparatively speaking, though BaseRecalibrator and PrintReads with
-BQSR do show a decent 2-fold speedup when run on four cores instead of one. But this shows fairly clearly that there's little point to blindly throwing more parallelization at these tools.
In this second figure we're looking at CPU utilization throughout the pipeline, i.e. how much computing the machine is doing at any given time -- as opposed to doing boring things like reading and writing data to and from files (I/O), which is like driving from Omaha to Denver (a long flat drive where nothing exciting happens, but things get fun once you get there).
Note that this figure corresponds specifically to an "optimized run", i.e. a specific configuration of the pipeline where each tool was parallelized optimally based on earlier results.
You can see that BWA-mem is a busy beaver, spending pretty much all of its time furiously calculating mapping scores and writing out results to the output SAM file as it goes. In contrast, if we look at the tools that write out BAM files, we see a flurry of activity up front, then the line goes flat during a long period spent just writing out results to the output BAM file.
In the case of BaseRecalibrator, the tool only outputs a recalibration table, which doesn't take very long at all. Then looking at PrintReads (run with
-BQSR) you see a similar activity profile to BaseRecalibrator's, which corresponds to the on-the-fly recalibration done by the engine before the recalibrated data is written to the final pre-processed BAM file that will be fed to HaplotypeCaller.
Finally you see that HaplotypeCaller itself is the most compute-intensive tool in the GATK end of the pipeline; although this is not shown in the figure, I can tell you that much of its time is spent on graph assembly and pairHMM alignment of haplotypes. Note that here HaplotypeCaller is writing out a GVCF file; if you were to run HaplotypeCaller in regular mode (not using
ERC GVCF) you would see a shorter period of I/O flatline because the variants-only VCF output amounts to a much smaller file.
If the above made you want to know more, head on over to Intel's Health-IT website and get the full white paper.
As I mentioned this is only the first pass in an ongoing project. The next step is going to involve implementing the joint genotyping and filtering with VQSR for the WGS trio, as well as profiling the equivalent exome pipeline on a cohort of ~30 exomes.
You may have noticed that this first pass was done one a single (albeit multi-core) machine; this was done on purpose to provide a baseline for end-to-end execution with the simplest configuration. Our friends at Intel will be looking at multi-machine setup in a future iteration, and for our part we'll have some new developments for you on the pipelining front soon -- so stay tuned to this blog or follow @gatk_dev on Twitter!
MuTect2 took a few days to process one pair of whole exome samples while muTect 1.17 only needs a few hours. Is this expected? Any suggestion as how to speed things up besides paralleling by chromosomes? Thanks!
Trying to decide if this is a job submission/network problem, or a job script problem!
Running VariantRecalibrator in SNP mode on 95 joint-called genomes. The ProgressMeter runs nicely until it hits 100% and then appears to keep looping through at the same position without writing any outputs. Here are the last few of lines of the .out file ...
The job eventually gets cancelled due to wall time limits (4 hours in this case). So, the question is, is this looping at 100% completion "correct", therefore my problem is related to just giving it more computational time? Or is there some error happening and it shouldn't be cycling through like this to begin with?
I have noticed when using the GenotypeGVCFs walker (version 3.2), that the remaining runtime estimate is very poor. The estimate from CombineGVCFs and other walkers on the other hand is very accurate. This is not a critical bug. It is rather a feature enhancement. I just noticed that the "completed" percentage is also incorrect. It does not start at 0%. In fact it has stayed constant after walking over 2 million base pairs of 2000 samples in 2 hours. I am not using multi threading. Not that important to me, but I thought I would let you know.
I am using GATK 2.7 HC with Berkeley Lab Checkpoint/Restart (BLCR). After I checkpoint and restart my job on the cluster I get the stdout and stderr shown below. I think I got it, because the restart happened after midnight. Usually I just get incorrect estimates on total remaining runtime. Would it be possible to make GATK fully compatible with the use of BLCR? Thank you.
stdout: INFO 01:15:01,258 ProgressMeter - 1:28001706 0.00e+00 -4690921.0 s -9223372036.0 s 80.0% -5862401.0 s -1171480.0 s
java.lang.IllegalArgumentException: runtime must be >= 0 but got -4690951902384272 at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.exceedsRuntimeLimit(GenomeAnalysisEngine.java:1200) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:329) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
I ran GATK's UnifiedGenotyper of version 2.7-4 (SNP calling) with multiple data sets that where growing in size (10 sets ranging from 8.5 GB to 88GB) with 36 cpu threads and 8 GB heap assigned. I noticed that the UnifiedGenotyper performed much better than linear with a runtime increase of 4.6 times for the largest data set, although this was 10 times larger than the smallest. Now I wonder how this is achieved by GATK? What part of the processing makes it scale so well to be even better than linear here? Is it the SNP calling that is so performant, or how the data is preprocessed/filtered?
I have run GATK's UnifiedGenotyper tool with different thread sizes (9, 18, 36 on a 40-core machine) in order to see how well it scales for an increasing thread size. I first tried out UnifiedGenotyper of version 1.7 and set -nt to 9, 18, and 36 with 8GB heap assigned for a 88GB large BAM file, respectively. When looking at the results, I noticed that runtime performance increased when using more threads, so I updated GATK to version 2.7-4 in order to check if this behavior has vanished with the newer version. I also noticed then that there exists another thread-parameter for the number of cpu threads.
So, I ran GATK's UnifiedGenotyper version 2.7-4 for a 88GB large file on a 40-core linux machine with -nt 1, -nct 9/18/36, and assigned 8GB heap. The behavior in runtime performance remained the same. What am I missing here?
I try to run the realigner target creator with the data divided on chromosomes. It seems to run smooth but the runtime for chromosome 1 seems to never end. I have checked my bam files and as expected the one with chr 1 is a little bit bigger than chr 2 but nothing proportional to the differences in predicted runtime. The version i use is GATK 2.3.0 and this is how i run it:
java -Xmx12g -jar programs/GenomeAnalysisTK-2.3-0/GenomeAnalysisTK.jar -l INFO -T VariantRecalibrator -R Homo_sapiens.GRCh37.57_dna_concat.fa -recalFile allchr_varrecal_BOTH_comb_ref.intervals -rscriptFile 15_allchr_varrecal_BOTH_comb_ref.intervals.plots.R -tranchesFile allchr_varrecal_BOTH_comb_ref.intervals.tranches -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=8.0 dbsnp_135.b37.vcf -resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ --mode BOTH -nt 8 -input allchr_real_recal_resrt_raw_BOTH_comb_ref.vcf -L Agilent_SureSelect.V4.GRCh37.70_targets_nochr.bed.pad100.interval_list --pedigreeValidationType SILENT --pedigree my_fam.fam
predicted runtime looks like:
INFO 15:11:33,778 ProgressMeter - 1:33587200 3.36e+07 2.3 h 4.1 m 1.1% 8.8 d 8.7 d INFO 15:11:33,778 ProgressMeter - 11:73495886 1.82e+09 2.3 h 4.5 s 61.0% 3.7 h 87.4 m INFO 15:11:33,778 ProgressMeter - 10:6623707 1.68e+09 2.3 h 4.9 s 54.5% 4.2 h 114.3 m INFO 15:11:33,778 ProgressMeter - 11:572298 1.82e+09 2.3 h 4.5 s 58.7% 3.9 h 96.4 m INFO 15:11:33,779 ProgressMeter - 11:122078003 1.82e+09 2.3 h 4.5 s 62.6% 3.6 h 81.7 m INFO 15:11:53,780 ProgressMeter - 11:10589211 1.82e+09 2.3 h 4.5 s 59.0% 3.9 h 95.3 m INFO 15:11:53,780 ProgressMeter - 11:105031869 1.82e+09 2.3 h 4.5 s 62.1% 3.7 h 83.9 m INFO 15:11:53,781 ProgressMeter - 11:46994477 1.82e+09 2.3 h 4.5 s 60.2% 3.8 h 90.8 m INFO 15:12:03,779 ProgressMeter - 11:8677862 1.82e+09 2.3 h 4.5 s 58.9% 3.9 h 95.7 m INFO 15:12:03,779 ProgressMeter - 11:130690257 1.82e+09 2.3 h 4.5 s 62.9% 3.6 h 81.1 m INFO 15:12:13,779 ProgressMeter - 11:84816026 1.82e+09 2.3 h 4.5 s 61.4% 3.7 h 86.4 m INFO 15:12:13,779 ProgressMeter - 10:17039143 1.68e+09 2.3 h 4.9 s 54.8% 4.2 h 113.3 m INFO 15:12:23,780 ProgressMeter - 11:18556991 1.82e+09 2.3 h 4.5 s 59.3% 3.9 h 94.6 m INFO 15:12:23,780 ProgressMeter - 11:113365440 1.82e+09 2.3 h 4.5 s 62.3% 3.7 h 83.2 m INFO 15:12:23,782 ProgressMeter - 11:55284794 1.82e+09 2.3 h 4.5 s 60.4% 3.8 h 90.1 m INFO 15:12:33,780 ProgressMeter - 1:33783808 3.38e+07 2.3 h 4.1 m 1.1% 8.8 d 8.7 d INFO 15:12:33,780 ProgressMeter - 12:3311608 1.95e+09 2.3 h 4.2 s 63.1% 3.6 h 80.5 m INFO 15:12:43,779 ProgressMeter - 11:93292307 1.82e+09 2.3 h 4.6 s 61.7% 3.7 h 85.8 m INFO 15:12:43,780 ProgressMeter - 10:24841443 1.68e+09 2.3 h 4.9 s 55.1% 4.2 h 112.5 m INFO 15:12:43,780 ProgressMeter - 11:19408689 1.82e+09 2.3 h 4.6 s 59.3% 3.9 h 94.8 m INFO 15:12:53,965 ProgressMeter - 11:26512308 1.82e+09 2.3 h 4.6 s 59.5% 3.9 h 94.0 m INFO 15:12:53,965 ProgressMeter - 11:121738354 1.82e+09 2.3 h 4.6 s 62.6% 3.7 h 82.6 m INFO 15:12:53,965 ProgressMeter - 11:63468191 1.82e+09 2.3 h 4.6 s 60.7% 3.8 h 89.4 m INFO 15:13:03,781 ProgressMeter - 12:11960820 1.95e+09 2.3 h 4.3 s 63.4% 3.6 h 79.8 m INFO 15:13:13,780 ProgressMeter - 11:101551879 1.82e+09 2.3 h 4.6 s 61.9% 3.7 h 85.1 m INFO 15:13:13,780 ProgressMeter - 10:32238974 1.68e+09 2.3 h 4.9 s 55.3% 4.2 h 111.8 m INFO 15:13:13,780 ProgressMeter - 11:27204534 1.82e+09 2.3 h 4.6 s 59.5% 3.9 h 94.1 m INFO 15:13:23,966 ProgressMeter - 11:71529440 1.82e+09 2.3 h 4.6 s 61.0% 3.8 h 88.8 m INFO 15:13:23,968 ProgressMeter - 11:34170083 1.82e+09 2.3 h 4.6 s 59.8% 3.9 h 93.4 m INFO 15:13:23,977 ProgressMeter - 11:129889715 1.82e+09 2.3 h 4.6 s 62.9% 3.7 h 81.9 m INFO 15:13:33,781 ProgressMeter - 1:33980416 3.40e+07 2.3 h 4.1 m 1.1% 8.8 d 8.7 d
INFO 17:40:38,765 ProgressMeter - chrX:147886444 3.21e+07 5.5 h 10.3 m 97.9% 5.6 h 7.1 m INFO 17:41:08,775 ProgressMeter - chrX:151653849 3.21e+07 5.5 h 10.3 m 98.0% 5.6 h 6.7 m INFO 17:41:38,785 ProgressMeter - chrX:153812787 3.22e+07 5.5 h 10.3 m 98.1% 5.6 h 6.5 m
Caused by: java.lang.ClassNotFoundException: net.sf.samtools.util.CloserUtil
at java.security.AccessController.doPrivileged(Native Method)
... 29 more
Caused by: java.util.zip.ZipException: error reading zip file
at java.util.zip.ZipFile.read(Native Method)
Any Idea what is the solution??
I've noticed a change in the log files that now shows a period of "starting" before actually genotyping. This can be hours...or minutes... I looked in previous logs and saw that there aren't such line in the log, but the clock jumps a similar amount of time. I was wondering if this is a known phenomenon and what it the GATK doing during that time. Is there anything we can do to reduce that waiting period?
Here are examples:
old (calling with 4500 samples, glm BOTH):
INFO 12:21:37,515 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 1157.62 INFO 12:26:52,780 RMDTrackBuilder - Loading Tribble index from disk for file /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf WARN 12:26:54,098 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A INFO 12:26:54,457 GenomeAnalysisEngine - Processing 43138 bp from intervals INFO 12:26:54,473 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:26:54,473 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:37:08,951 ProgressMeter - 17:26944207 0.00e+00 9.2 h 54587.4 w 0.0% Infinity w Infinity w INFO 21:40:40,002 ProgressMeter - 17:26945957 1.82e+02 9.2 h 301.8 w 0.9% 6.2 w 6.1 w INFO 21:41:53,697 ProgressMeter - 17:26946820 6.49e+02 9.2 h 84.8 w 1.5% 3.6 w 3.5 w
and here newer (calling 1500 sample, glm SNP):
INFO 18:42:42,025 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 24.55 INFO 18:42:47,648 RMDTrackBuilder - Loading Tribble index from disk for file /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dbsnp.vcf WARN 18:42:47,972 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A INFO 18:42:48,192 GenomeAnalysisEngine - Processing 43138 bp from intervals INFO 18:42:48,206 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 18:42:48,206 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 18:43:18,301 ProgressMeter - starting 0.00e+00 30.1 s 49.8 w 100.0% 30.1 s 0.0 s INFO 18:43:48,310 ProgressMeter - starting 0.00e+00 60.1 s 99.4 w 100.0% 60.1 s 0.0 s INFO 18:44:18,344 ProgressMeter - starting 0.00e+00 90.1 s 149.0 w 100.0% 90.1 s 0.0 s INFO 18:44:48,353 ProgressMeter - starting 0.00e+00 2.0 m 198.7 w 100.0% 2.0 m 0.0 s INFO 18:45:18,363 ProgressMeter - starting 0.00e+00 2.5 m 248.3 w 100.0% 2.5 m 0.0 s INFO 18:45:48,373 ProgressMeter - starting 0.00e+00 3.0 m 297.9 w 100.0% 3.0 m 0.0 s INFO 18:46:18,382 ProgressMeter - starting 0.00e+00 3.5 m 347.5 w 100.0% 3.5 m 0.0 s INFO 18:46:48,391 ProgressMeter - starting 0.00e+00 4.0 m 397.1 w 100.0% 4.0 m 0.0 s INFO 18:47:18,401 ProgressMeter - starting 0.00e+00 4.5 m 446.7 w 100.0% 4.5 m 0.0 s INFO 18:47:48,411 ProgressMeter - starting 0.00e+00 5.0 m 496.4 w 100.0% 5.0 m 0.0 s INFO 18:48:18,426 ProgressMeter - starting 0.00e+00 5.5 m 546.0 w 100.0% 5.5 m 0.0 s INFO 18:48:48,436 ProgressMeter - starting 0.00e+00 6.0 m 595.6 w 100.0% 6.0 m 0.0 s INFO 18:49:18,446 ProgressMeter - starting 0.00e+00 6.5 m 645.2 w 100.0% 6.5 m 0.0 s INFO 18:49:48,455 ProgressMeter - starting 0.00e+00 7.0 m 694.9 w 100.0% 7.0 m 0.0 s INFO 18:50:18,465 ProgressMeter - starting 0.00e+00 7.5 m 744.5 w 100.0% 7.5 m 0.0 s INFO 18:50:48,475 ProgressMeter - starting 0.00e+00 8.0 m 794.1 w 100.0% 8.0 m 0.0 s INFO 18:51:18,484 ProgressMeter - starting 0.00e+00 8.5 m 843.7 w 100.0% 8.5 m 0.0 s INFO 18:51:48,510 ProgressMeter - starting 0.00e+00 9.0 m 893.4 w 100.0% 9.0 m 0.0 s INFO 18:52:18,520 ProgressMeter - starting 0.00e+00 9.5 m 943.0 w 100.0% 9.5 m 0.0 s INFO 18:52:48,530 ProgressMeter - starting 0.00e+00 10.0 m 992.6 w 100.0% 10.0 m 0.0 s INFO 18:53:18,541 ProgressMeter - starting 0.00e+00 10.5 m 1042.2 w 100.0% 10.5 m 0.0 s INFO 18:53:48,552 ProgressMeter - starting 0.00e+00 11.0 m 1091.8 w 100.0% 11.0 m 0.0 s INFO 18:54:18,561 ProgressMeter - starting 0.00e+00 11.5 m 1141.5 w 100.0% 11.5 m 0.0 s INFO 18:54:48,570 ProgressMeter - starting 0.00e+00 12.0 m 1191.1 w 100.0% 12.0 m 0.0 s INFO 18:55:18,579 ProgressMeter - starting 0.00e+00 12.5 m 1240.7 w 100.0% 12.5 m 0.0 s INFO 18:55:48,590 ProgressMeter - 18:12725582 1.00e+01 13.0 m 129.0 w 0.5% 45.4 h 45.2 h
I've also noticed that output seems to be quite wrong about how much work it has left to do...perhaps these two things are related?
I would be grateful for any information.