RealignerTargetCreator appears to take more time when multithreaded using the -nt flag
Posted in Ask the GATK team | Last updated on 2012-08-08 15:44:43

Comments (4)

Hi all,

We're doing some analysis on quite big data and time is an issue, so I did a bit of scaling testing on a subset of the data before beginning. The results were unexpected.

When I run GATK RealignerTargetCreator with -nt 8 and give it 8 cores to work with, it actually takes about 2.5 times LONGER than if I just run it single-threaded. I don't mean that the user or CPU time goes up - the real, walltime goes up. In the -nt 8 case, the 8 cores would have been on a single node of our cluster with shared memory.

I tried testing on two different kinds of subsets of the data and both performed worse when multithreaded. I first tried restricting the input data by genomic region, ie just analysing chr22. When multithreading didn't seem to be working as expected in this test, I thought that maybe GATK was trying to parallelise over genomic regions, so I instead tried testing on a single lane of input data (a 9.6G bam file spread over the whole genome). This also ran more slowly when multithreaded.

So my question is: should I use -nt 8 in my real analysis even though it was a bad option in testing? Is it possible that multithreading will be bad for small amounts of data, but good in the large-data case? Or, does this indicate that I'm doing something wrong when trying to run RealignerTargetCreator multithreaded?

I really would like to use the fastest option for the real data as it will be very big. Any help much appreciated.

Thanks, Clare

Return to top Comment on this article in the forum