Tagged with #multithreading 3 documentation articles | 1 announcement | 14 forum discussions

Created 2013-06-26 19:01:08 | Updated 2013-06-26 19:05:10 | Tags: developer parallelism multithreading nct nt

This document provides an overview of what are the steps required to make a walker multi-threadable using the -nct and the -nt arguments, which make use of the NanoSchedulable and TreeReducible interfaces, respectively.

NanoSchedulable / -nct

Providing -nct support requires that you certify that your walker's map() method is thread-safe -- eg., if any data structures are shared across map() calls, access to these must be properly synchronized. Once your map() method is thread-safe, you can implement the NanoSchedulable interface, an empty interface with no methods that just marks your walker as having a map() method that's safe to parallelize:

/**
* Root parallelism interface.  Walkers that implement this
* declare that their map function is thread-safe and so multiple
* map calls can be run in parallel in the same JVM instance.
*/
public interface NanoSchedulable {
}

TreeReducible / -nt

Providing -nt support requires that both map() and reduce() be thread-safe, and you also need to implement the TreeReducible interface. Implementing TreeReducible requires you to write a treeReduce() method that tells the engine how to combine the results of multiple reduce() calls:

public interface TreeReducible<ReduceType> {
/**
* A composite, 'reduce of reduces' function.
* @param lhs 'left-most' portion of data in the composite reduce.
* @param rhs 'right-most' portion of data in the composite reduce.
* @return The composite reduce type.
*/
ReduceType treeReduce(ReduceType lhs, ReduceType rhs);
}

This method differs from reduce() in that while reduce() adds the result of a single map() call onto a running total, treeReduce() takes the aggregated results from multiple map/reduce tasks that have been run in parallel and combines them. So, lhs and rhs might each represent the final result from several hundred map/reduce calls.

Example treeReduce() implementation from the UnifiedGenotyper:

public UGStatistics treeReduce(UGStatistics lhs, UGStatistics rhs) {
lhs.nBasesCallable += rhs.nBasesCallable;
lhs.nBasesCalledConfidently += rhs.nBasesCalledConfidently;
lhs.nBasesVisited += rhs.nBasesVisited;
return lhs;
}

Created 2012-12-18 21:35:34 | Updated 2013-01-26 05:10:36 | Tags: intro queue parallelism performance scatter-gather multithreading nct nt

This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.

1. Introducing the concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

• open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

• open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

• open the file, index the lines

• read the first line, count the number of words, tell us the number
• read the second line, count the number of words, tell us the number
• read the third line, count the number of words, tell us the number
• [repeat for all lines]

• collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.

2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.

• Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

• Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

• Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

• -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).

At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.

Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

Created 2012-12-14 21:59:43 | Updated 2016-03-31 18:41:31 | Tags: queue parallelism performance scatter-gather multithreading nct nt

This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.

Overview

As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

• -nt / --num_threads controls the number of data threads sent to the processor
• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.

Additional consideration when using -nct with versions 2.2 and 2.3

Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.

Scatter-gather

For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Applicability of parallelism to the major GATK tools

Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.

Tool Full name Type of traversal NT NCT SG
RTC RealignerTargetCreator RodWalker + - -
IR IndelRealigner ReadWalker - - +
BR BaseRecalibrator LocusWalker - + +
UG UnifiedGenotyper LocusWalker + + +

Recommended configurations

The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.

Tool RTC IR BR PR RR UG
Available modes NT SG NCT,SG NCT SG NT,NCT,SG
Cluster nodes 1 4 4 1 4 4 / 4 / 4
CPU threads (-nct) 1 1 8 4-8 1 3 / 6 / 24
Data threads (-nt) 24 1 1 1 1 8 / 4 / 1
Memory (Gb) 48 4 4 4 4 32 / 16 / 4

Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Created 2014-04-11 05:00:50 | Updated | Tags: selectvariants bug multithreading ad bug-fixed nt

This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.

Affected versions: 2.x versions up to 2.8 (not sure when it started)

Affected tool: SelectVariants

Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (-nt)

Effects: Genotype-level fields (such as AD) swapped among samples

This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.

Created 2016-04-21 13:06:39 | Updated 2016-04-21 13:17:48 | Tags: multithreading genotypegvcfs

Dear GATK Devs,

While working on a new version of our human exome pipeline I ran into the problem that the output VCF files were not the same, even though the commands and samples run through it were exactly the same. Some of the quality scores like MQranksum or QD are slightly different, but this influences the end result quite a bit. Initially I thought this might be our older version of GATK (3.3), but I noticed the same effect when using 3.5.

I couldn't find anything about this on the forum or in the documentation. Is this a known bug by chance?

Below is the result of a diff command on the two resulting VCF files. Paths and sample IDs are censored, but are identical for both.

14c14
---
145,148c145,148
chr1  888639  rs3748596       T       C       5306.35 .       AC=8;AF=1.00;AN=8;DB;DP=150;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=50.20;MQ0=0;QD=29.09;SOR=1.061      GT:AD:DP:GQ:PL  1/1:0,41:41:99:1498,123,0       1/1:0,49:49:99:1714,147,0       1/1:0,26:26:78:930,78,0 1/1:0,34:34:99:1190,102,0
chr1  888659  rs3748597       T       C       4601.35 .       AC=8;AF=1.00;AN=8;DB;DP=132;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=52.14;MQ0=0;QD=32.93;SOR=1.420      GT:AD:DP:GQ:PL  1/1:0,32:32:96:1173,96,0        1/1:0,43:43:99:1522,129,0       1/1:0,28:28:84:896,84,0 1/1:0,28:28:84:1036,84,0
chr1  889158  rs56262069      G       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=89;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=30.83;SOR=1.408       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,22:22:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,18:18:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0
chr1  889159  rs13302945      A       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=91;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=29.67;SOR=1.389       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,23:23:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,19:19:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0
---
chr1  888639  rs3748596       T       C       5306.35 .       AC=8;AF=1.00;AN=8;DB;DP=150;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=50.20;MQ0=0;QD=30.83;SOR=1.061      GT:AD:DP:GQ:PL  1/1:0,41:41:99:1498,123,0       1/1:0,49:49:99:1714,147,0       1/1:0,26:26:78:930,78,0 1/1:0,34:34:99:1190,102,0
chr1  888659  rs3748597       T       C       4601.35 .       AC=8;AF=1.00;AN=8;DB;DP=132;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=52.14;MQ0=0;QD=29.67;SOR=1.420      GT:AD:DP:GQ:PL  1/1:0,32:32:96:1173,96,0        1/1:0,43:43:99:1522,129,0       1/1:0,28:28:84:896,84,0 1/1:0,28:28:84:1036,84,0
chr1  889158  rs56262069      G       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=89;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=29.09;SOR=1.408       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,22:22:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,18:18:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0
chr1  889159  rs13302945      A       C       4026.35 .       AC=8;AF=1.00;AN=8;DB;DP=91;ExcessHet=3.0103;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=57.77;MQ0=0;QD=32.93;SOR=1.389       GT:AD:DP:GQ:PGT:PID:PL  1/1:0,23:23:69:1|1:889158_G_C:1024,69,0 1/1:0,26:26:78:1|1:889158_G_C:1170,78,0 1/1:0,19:19:57:1|1:889158_G_C:823,57,0  1/1:0,23:23:69:1|1:889158_G_C:1035,69,0

Thanks in forward!

Created 2016-03-14 19:39:51 | Updated | Tags: multithreading

Hi,

Maybe this is silly question, but I still don't understand what a 'data thread' means (-nt paramenter). I understand that -nct is the number of running threads that can run in parallel for a given data thread. But in which context can we have several data threads?

I made some tests, and I naively supposed that each cpu-thread would be child of some data thread, but when I execute UnifiedGenotype, all the threads are at the same level (I checked with htop).

Diego

Created 2016-02-26 00:17:39 | Updated | Tags: multithreading genotypegvcf

Hi,

Does threading (controlled by the -nt argument) work in GATK's GenotypeGVCFs?

Thanks a lot

Created 2015-03-17 14:33:19 | Updated | Tags: multithreading joint-calling scaffolds

Hi Team,

this is a followup of /discussion/5304/haplotypecaller-treatment-of-scaffolds

I am using Joint Genotyping on a row of ~80 individuals (scaffoldwise). I am using -nt 16

I have following problems:

A| I'm getting a lot of these warnings (I didn't have so many when I did Haplotype Caller) - Do I need to worry?

ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at scaffold_3:4434354 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

B| The job ends with following

INFO 04:11:46,918 ProgressMeter - scaffold_3:6037801 5037839.0 2.3 h 27.1 m 10.4% 21.8 h 19.5 h INFO 04:11:54,984 ProgressMeter - done 6037839.0 2.3 h 22.6 m 10.4% 21.8 h 19.5 h INFO 04:11:54,984 ProgressMeter - Total runtime 8198.11 secs, 136.64 min, 2.28 hours So: Time elapsed 2.3h and done, but 19.5 h to go. Is there a problem with multithreading?

Thanks! Alexander

Created 2014-11-11 23:43:50 | Updated | Tags: multithreading

Greetings GATK,

Sorry if this has been asked or covered elsewhere. Is there a version of GATK that supports the Intel Phi or MIC or is there a future release that will? Thank you

Created 2014-05-19 18:51:10 | Updated 2014-05-19 18:51:36 | Tags: multithreading

I've been using the GATK - in particular, the DiagnoseTargets and VariantsToTable tools - and have been running into trouble attempting to parallelise these tasks.

I've tried both the -n and -nct flags and it turns out that neither are supported by the above tools. Unfortunately there doesn't appear to be anything on the documentation that indicates this, so I only ever find out the hard way when trying to run them. As such, I have a couple of questions:

1. Does the documentation list which of the engine-wide parameters are unsupported by certain tools? If not, could it?

2. Even if the tools aren't automagically parallelisable, I still want to run them -- it's a little frustrating to kick off a long-running process over the weekend and get back on Monday to find it failed a few hours in! Is there an option to fall back to single-threaded execution if one of the multithreading flags isn't supported? If not, could there be?

Thanks!

Created 2014-03-28 13:58:31 | Updated | Tags: unifiedgenotyper multithreading multi-sample gatk capacity

Hi,

How many samples and of what file/genome size can the Unified Genotyper process simultaneously ? I will have about 250 samples, genome size 180Mb, bam sizes 3-5GB, reduced bam sizes ~1GB.

Also, how long would this take can it be multi-threaded ? Cheers,

Blue

Created 2013-10-29 14:38:24 | Updated | Tags: unifiedgenotyper performance multithreading runtime

Hello there,

I have run GATK's UnifiedGenotyper tool with different thread sizes (9, 18, 36 on a 40-core machine) in order to see how well it scales for an increasing thread size. I first tried out UnifiedGenotyper of version 1.7 and set -nt to 9, 18, and 36 with 8GB heap assigned for a 88GB large BAM file, respectively. When looking at the results, I noticed that runtime performance increased when using more threads, so I updated GATK to version 2.7-4 in order to check if this behavior has vanished with the newer version. I also noticed then that there exists another thread-parameter for the number of cpu threads.

So, I ran GATK's UnifiedGenotyper version 2.7-4 for a 88GB large file on a 40-core linux machine with -nt 1, -nct 9/18/36, and assigned 8GB heap. The behavior in runtime performance remained the same. What am I missing here?

Best,

Cindy

Created 2013-09-26 00:12:14 | Updated 2013-09-26 00:22:41 | Tags: multithreading

Hi,

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

-nt / --num_threads controls the number of data threads sent to the processor

Setup:
RHEL5, 144 GB memory, 12 cores (Intel 2.8 GHz)

~/src/jre1.7.0_40/bin/java -Xmx64g -Xms32g -d64 -jar /apps/gau/GATK_versions/GATKLite-2.1/GenomeAnalysisTKLite.jar -nt 8 -nct 6 -L chr8:90000001-120000000 -rbs 10000000 -T UnifiedGenotyper -rf BadCigar -R /dev/shm/CEUref.hg19.fasta -glm BOTH -D /dev/shm/dbsnp_135.hg19.reordered.vcf -metrics test.metrics.txt -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000 --max_alternate_alleles 10 -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A DepthOfCoverage -A DepthPerAlleleBySample -A FisherStrand -A HaplotypeScore -A HardyWeinberg -A IndelType -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A MappingQualityZeroFraction -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality -A SampleList -o chunk55.vcf -I ./AC2181ACXX_DS-124072_GAGTGG_L006_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124113_GTCCGC_L001_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124080_ATTCCT_L008_001.markdup.fixed.left.recal.rehead.bam -I ./AD23GUACXX_DS-124122_AGTCAA_L005_001.markdup.fixed.left.recal.rehead.bam...

(with 116 bam files)

/usr/sbin/lsof -p <java process ID> | wc -l
returns 728 open files. (This turns out to be the 116 BAM files opened each of 8 times).

When run with the following settings, I see some strange messages:

-nt 12 -nct 1
INFO 16:57:41,908 SAMDataSource - Running in asynchronous I/O mode; number of threads = 11

-nt 12 -nct 2
INFO 16:58:46,246 SAMDataSource - Running in asynchronous I/O mode; number of threads = 10
INFO 16:58:46,763 MicroScheduler - Running the GATK in parallel mode with 2 concurrent threads

and so on, until:

-nt 12 -nct 11
INFO 17:00:07,673 SAMDataSource - Running in asynchronous I/O mode; number of threads = 1
INFO 17:00:08,288 MicroScheduler - Running the GATK in parallel mode with 11 concurrent threads

-nt 12 -nct 13
ERROR MESSAGE: Invalid thread allocation. User requested 12 threads in total, but the count of cpu threads (13) is higher than the total threads

If -nt is the 'number of data threads', then why does SAMDataSource report <nt> - <nct> as the number of 'threads'?

If -nct is the 'number of CPU threads per data thread', then the total number of CPU threads running should really be <nt> * <nct>. Instead, it seems to be <nt> - <nct>, which makes no sense according to the definitions.

Memory considerations for multi-threading Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory

In any case, all other software I'm familiar with has no notion of a 'data thread', and it seems unnecessary and wasteful -- one simply specifies the inputs, and chooses a number of CPU threads, and the program handles the rest, without reading the same input multiple times.

Thanks,

Henry

Hello,

I am trying to test -nct parameter in printreads (bqsr). My machine has 32 cores (aws cc2.8xlarge) and input is a single BAM. So I tried "-nct 32" first, but it was almost as slow as "-nct 1". Actually any -nct >8 made it slow, so I am wondering if it's my fault or known/expected result. Thanks!

Created 2013-06-12 06:04:53 | Updated | Tags: baserecalibrator multithreading speed

Hi,

I'm running BaseRecalibrator on a full lane of HiSeq. I set the -nct switch to 8, but I see that the average load is <1. Why is this so? Doesn't the BaseRecalibrator use threads properly?

I've seen this thread http://gatkforums.broadinstitute.org/discussion/1263/baserecalibrator-trade-off-between-runtime-and-accuracy on how to speed up the process by downsampling. Is this a better option considering I have >200M reads?

I would really love to reduce the runtime from 36hrs.

Many thanks.

Created 2013-05-01 21:11:40 | Updated 2013-05-01 21:32:17 | Tags: unifiedgenotyper multithreading

Sometimes, while running the UnifiedGenotyper with -nct 8, my job might die for any number of reasons unrelated to your fine software. Can I start my next attempt, where the last one left off, simply by using the -L flag to specify the last genotyped location from the vcf output (and the rest of the genome)? This of course works, but the deeper question is this - since I used the -nct flag, might some of the threads have been scanning ahead in the genome and therefore the last genotyped location in that vcf is too far ahead and I might miss some intermediate variants? My sincere apologies if this has already been asked, I did look around first.

Created 2013-03-11 20:50:56 | Updated 2013-03-11 20:55:41 | Tags: multithreading java

While I have been able to work around this issue, I am still curious as to what is causing it in the first place.

I am currently engaged in variant calling on a rather large set of exome sequenced samples (prepared according to GATK best practices 4), but I have run into some issues with UnifiedGenotyper.

Put simply, I am able to run any of our input intervals through UnfiedGenotyper with a single data thread, and either 1 or 24 CPU threads with no trouble. Upon looking at the actual memory usage of the process while running, I decided to try increasing the number of data threads as it appeared that I could easily run 4 threads in the heap I have available. However, regardless of the thread or memory settings, when I raise the number of data threads beyond 1, once the initialization phase has completed, I immediately get a Java core dump. The GATK invocation I use is as follows:

java -Djava.io.tmpdir=$TMP -Xmx48G -jar$GATK -T UnifiedGenotyper \
-R $REF \ --dbsnp$ROD \
${GATKArg} \ -o${TMP}/${BASENAME}/${BASENAME}.${INTERVAL}.raw.variants.vcf \ -L${INTERVAL} \
-nt 4 \
-nct 8 \
-glm BOTH \
-A DepthOfCoverage \
-A AlleleBalance \
-A HaplotypeScore \
-A HomopolymerRun \
-A MappingQualityZero \
-A QualByDepth \
-A RMSMappingQuality \
-A SpanningDeletions \
-A MappingQualityRankSumTest \
-A FisherStrand \
-A InbreedingCoeff

For reference, this is GATK version 2.3-9, JRE 1.6.

The error I receive is shown below. Any insight into this issue would be greatly appreciated.

Sincerely, Jason Kost

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (exceptions.cpp:364), pid=14830, tid=1113614656
#  Error: ExceptionMark destructor expects no pending exceptions
#
# JRE version: 6.0_16-b01
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode linux-amd64 )
# Can not save log file, dump to screen..
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (exceptions.cpp:364), pid=14830, tid=1113614656
#  Error: ExceptionMark destructor expects no pending exceptions
#
# JRE version: 6.0_16-b01
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode linux-amd64 )
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Stack: [0x0000000042506000,0x0000000042607000],  sp=0x0000000042602bd0,  free space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x6bd1ef]
V  [libjvm.so+0x2be556]
V  [libjvm.so+0x3082dd]
V  [libjvm.so+0x2592d0]
V  [libjvm.so+0x258922]
V  [libjvm.so+0x2589a6]
V  [libjvm.so+0x25a37e]
V  [libjvm.so+0x650eab]
V  [libjvm.so+0x64f222]
V  [libjvm.so+0x64e101]
V  [libjvm.so+0x64dd63]
V  [libjvm.so+0x42c8e9]
V  [libjvm.so+0x43310e]
V  [libjvm.so+0x411d68]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  sun.misc.Launcher$AppClassLoader.loadClass(Ljava/lang/String;Z)Ljava/lang/Class;+41 j java.lang.ClassLoader.loadClass(Ljava/lang/String;)Ljava/lang/Class;+3 j java.lang.ClassLoader.loadClassInternal(Ljava/lang/String;)Ljava/lang/Class;+2 v ~StubRoutines::call_stub j org.broadinstitute.sting.gatk.phonehome.GATKRunReport$S3PutRunnable.run()V+6
v  ~StubRoutines::call_stub

---------------  P R O C E S S  ---------------

VM state:not at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread: None

Heap
PSYoungGen      total 63360K, used 3359K [0x00002ab2b3440000, 0x00002ab2ba990000, 0x00002ab6b3440000)
eden space 16384K, 20% used [0x00002ab2b3440000,0x00002ab2b3787fb8,0x00002ab2b4440000)
from space 46976K, 0% used [0x00002ab2b7550000,0x00002ab2b7550000,0x00002ab2ba330000)
to   space 50240K, 0% used [0x00002ab2b4440000,0x00002ab2b4440000,0x00002ab2b7550000)
PSOldGen        total 353536K, used 271234K [0x00002aaab3440000, 0x00002aaac8d80000, 0x00002ab2b3440000)
object space 353536K, 76% used [0x00002aaab3440000,0x00002aaac3d20bd8,0x00002aaac8d80000)
PSPermGen       total 21504K, used 16865K [0x00002aaaae040000, 0x00002aaaaf540000, 0x00002aaab3440000)
object space 21504K, 78% used [0x00002aaaae040000,0x00002aaaaf0b8538,0x00002aaaaf540000)

Dynamic libraries:
Can not get library information for pid = 14864

VM Arguments:
jvm_args: -Djava.io.tmpdir=/home/kostj/scratch/temp/ -Xmx48G
java_command: /home/kostj/nearline/gatk2/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /home/kostj/nearline/hg19/hg19.fasta --dbsnp /home/kostj/nearline/hg19/gatk2/dbsnp_137.hg19.vcf -I /home/kostj/scratch/temp//SPr0059/SPr0059.reduced.bam -I /home/kostj/scratch/temp//SP3608/SP3608.reduced.bam -I /home/kostj/scratch/temp//SP3528/SP3528.reduced.bam -I /home/kostj/scratch/temp//SP3508/SP3508.reduced.bam -I /home/kostj/scratch/temp//SP3474/SP3474.reduced.bam -I /home/kostj/scratch/temp//SP3277/SP3277.reduced.bam -I /home/kostj/scratch/temp//SP3255/SP3255.reduced.bam -I /home/kostj/scratch/temp//SP3148/SP3148.reduced.bam -I /home/kostj/scratch/temp//sp3127/sp3127.reduced.bam -I /home/kostj/scratch/temp//SP3070/SP3070.reduced.bam -I /home/kostj/scratch/temp//sp3035/sp3035.reduced.bam -I /home/kostj/scratch/temp//SP0618/SP0618.reduced.bam -I /home/kostj/scratch/temp//SNc0063/SNc0063.reduced.bam -I /home/kostj/scratch/temp//SNc0036/SNc0036.reduced.bam -I /home/kostj/scratch/temp//SMa0078/SMa0078.reduced.bam -I /home/kostj/scratch/temp//SMa0020/SMa0020.reduced.bam -I /home/kostj/scratch/temp//SLA969/SLA969.reduced.bam -I /home/kostj/scratch/temp//SLA966/SLA966.reduced.bam -I /home/kostj/scratch/temp//SLA956/SLA956.reduced.bam -I /home/kostj/scratch/temp//SLA929/SLA929.reduced.bam -I /home/kostj/scratch/temp//SLA922/SLA922.reduced.bam -I /home/kostj/scratch/temp//SLA885/SLA885.reduced.bam -I /home/kostj/scratch/temp//SLA877/SLA877.reduced.bam -I /home/kostj/scratch/temp//SLA863/SLA863.reduced.bam -I /home/kostj/scratch/temp//SLA861/SLA861.reduced.bam -I /home/kostj/scratch/temp//SLA833/SLA833.reduced.bam -I /home/kostj/scratch/temp//SLA809/SLA809.reduced.bam -I /home/kostj/scratch/temp//SLA808/SLA808.reduced.bam -I /home/kostj/scratch/temp//SLA807/SLA807.reduced.bam -I /home/kostj/scratch/temp//SLA798/SLA798.reduced.bam -I /home/kostj/scratch/temp//SLA791/SLA791.reduced.bam -I /home/kostj/scratch/temp//SLA783/SLA783.reduced.bam -I /home/kostj/scratch/temp//SLA767/SLA767.redu
Launcher Type: SUN_STANDARD

Environment Variables:
PATH=/share/bin/java/bin:/home/kostj/nearline/R/bin/:/tmp/8474173.1.chassis12.q:/share/bin/gcc/bin:/share/bin:/opt/SUNWhpc/HPC8.1/gnu/bin/:/share/bin/intel/bin/intel64:/share/bin/gmp/include:/share/bin/mpfr/include:/sge/bin/lx24-amd64:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin
SHELL=/bin/bash

Signal Handlers:

---------------  S Y S T E M  ---------------

OS:Linux
uname:Linux 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64
libc:glibc 2.5 NPTL 2.5
rlimit: STACK infinity, CORE infinity, NPROC 399360, NOFILE 1024, AS infinity

CPU:total 24 (16 cores per cpu, 2 threads per core) family 6 model 44 stepping 2, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, ht

Memory: 4k page, physical 49451764k(23657140k free), swap 16779884k(16697572k free)

vm_info: Java HotSpot(TM) 64-Bit Server VM (14.2-b01) for linux-amd64 JRE (1.6.0_16-b01), built on Jul 31 2009 05:52:33 by "java_re" with gcc 3.2.2 (SuSE Linux)

time: Mon Mar 11 16:00:00 2013
elapsed time: 923 seconds

#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
/sge/default/spool/pem610-040/job_scripts/8474173: line 83: 14830 Aborted                 (core dumped) java -Djava.io.tmpdir=$TMP -Xmx48G -jar$GATK -T UnifiedGenotyper -R $REF --dbsnp$ROD ${GATKArg} -o${TMP}/${BASENAME}/${BASENAME}.${INTERVAL}.raw.variants.vcf -L${INTERVAL} -nt 2 -nct 12 -glm BOTH -A DepthOfCoverage -A AlleleBalance -A HaplotypeScore -A HomopolymerRun -A MappingQualityZero -A QualByDepth -A RMSMappingQuality -A SpanningDeletions -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff

Created 2012-10-17 05:00:34 | Updated 2012-10-18 00:45:03 | Tags: unifiedgenotyper multithreading

Hi --

I'm running the UG on a subset of 1000G data (100 samples subset to chr20), and I would love to throw some more threads at the problem. However, using the -nt flag throws the following code exception:

java.lang.InternalError at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:815) at sun.misc.URLClassPath.getResource(URLClassPath.java:195) at sun.misc.URLClassPath.getResource(URLClassPath.java:247) at java.lang.ClassLoader.getBootstrapResource(ClassLoader.java:1298) at java.lang.ClassLoader.getResource(ClassLoader.java:1135) at java.lang.ClassLoader.getResource(ClassLoader.java:1133) at java.lang.ClassLoader.getSystemResource(ClassLoader.java:1260) at java.lang.ClassLoader.getSystemResourceAsStream(ClassLoader.java:1363) at java.lang.Class.getResourceAsStream(Class.java:2045) at javax.xml.stream.SecuritySupport$4.run(SecuritySupport.java:92) at java.security.AccessController.doPrivileged(Native Method) at javax.xml.stream.SecuritySupport.getResourceAsStream(SecuritySupport.java:87) at javax.xml.stream.FactoryFinder.findJarServiceProvider(FactoryFinder.java:285) at javax.xml.stream.FactoryFinder.find(FactoryFinder.java:253) at javax.xml.stream.FactoryFinder.find(FactoryFinder.java:177) at javax.xml.stream.XMLInputFactory.newInstance(XMLInputFactory.java:153) at org.simpleframework.xml.stream.NodeBuilder.(NodeBuilder.java:48) at org.simpleframework.xml.core.Persister.write(Persister.java:1000) at org.simpleframework.xml.core.Persister.write(Persister.java:982) at org.simpleframework.xml.core.Persister.write(Persister.java:963) at org.broadinstitute.sting.gatk.phonehome.GATKRunReport.postReportToStream(GATKRunReport.java:236) at org.broadinstitute.sting.gatk.phonehome.GATKRunReport.postReportToAWSS3(GATKRunReport.java:333) at org.broadinstitute.sting.gatk.phonehome.GATKRunReport.postReport(GATKRunReport.java:215) at org.broadinstitute.sting.gatk.CommandLineExecutable.generateGATKRunReport(CommandLineExecutable.java:157) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:116) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93) Caused by: java.io.FileNotFoundException: /usr/java/jdk1.7.0_03/jre/lib/resources.jar (Too many open files) at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(ZipFile.java:214) at java.util.zip.ZipFile.(ZipFile.java:144) at java.util.jar.JarFile.(JarFile.java:152) at java.util.jar.JarFile.(JarFile.java:89) at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:706) at sun.misc.URLClassPath$JarLoader.access$600(URLClassPath.java:587) at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:667) at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:660) at java.security.AccessController.doPrivileged(Native Method) at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:659) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:813) ... 27 more ERROR ------------------------------------------------------------------------------------------ ERROR A GATK RUNTIME ERROR has occurred (version 2.1-13-g1706365): ERROR ERROR Please visit the wiki to see if this is a known problem ERROR If not, please post the error, with stack trace, to the GATK forum ERROR Visit our website and forum for extensive documentation and answers to ERROR commonly asked questions http://www.broadinstitute.org/gatk ERROR ERROR MESSAGE: Code exception (see stack trace for error itself) ERROR ------------------------------------------------------------------------------------------ The code is called with the following code. Without the -nt flag it executes normally. java -jar /userhome/vasya/apps/GenomeAnalysisTK-2.1-13/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \$(echo $bamFiles) \ -R${fasta} \ -nt 2 \ -o \${ug_vcf_root}

Are there any known issues with the threading options in the UG for GATK2?

Any suggestions for where to look next would be much appreciated! Its hard to know where to start with this.