According to http://www.broadinstitute.org/gatk/guide/article?id=1975:
There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:
-nt / --num_threads controls the number of data threads sent to the processor
-nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread
RHEL5, 144 GB memory, 12 cores (Intel 2.8 GHz)
~/src/jre1.7.0_40/bin/java -Xmx64g -Xms32g -d64 -jar /apps/gau/GATK_versions/GATKLite-2.1/GenomeAnalysisTKLite.jar -nt 8 -nct 6 -L chr8:90000001-120000000 -rbs 10000000 -T UnifiedGenotyper -rf BadCigar -R /dev/shm/CEUref.hg19.fasta -glm BOTH -D /dev/shm/dbsnp_135.hg19.reordered.vcf -metrics test.metrics.txt -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000 --max_alternate_alleles 10 -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A DepthOfCoverage -A DepthPerAlleleBySample -A FisherStrand -A HaplotypeScore -A HardyWeinberg -A IndelType -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A MappingQualityZeroFraction -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality -A SampleList -o chunk55.vcf -I ./AC2181ACXX_DS-124072_GAGTGG_L006_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124113_GTCCGC_L001_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124080_ATTCCT_L008_001.markdup.fixed.left.recal.rehead.bam -I ./AD23GUACXX_DS-124122_AGTCAA_L005_001.markdup.fixed.left.recal.rehead.bam...
(with 116 bam files)
/usr/sbin/lsof -p <java process ID> | wc -l
returns 728 open files. (This turns out to be the 116 BAM files opened each of 8 times).
When run with the following settings, I see some strange messages:
-nt 12 -nct 1
INFO 16:57:41,908 SAMDataSource - Running in asynchronous I/O mode; number of threads = 11
-nt 12 -nct 2
INFO 16:58:46,246 SAMDataSource - Running in asynchronous I/O mode; number of threads = 10
INFO 16:58:46,763 MicroScheduler - Running the GATK in parallel mode with 2 concurrent threads
and so on, until:
-nt 12 -nct 11
INFO 17:00:07,673 SAMDataSource - Running in asynchronous I/O mode; number of threads = 1
INFO 17:00:08,288 MicroScheduler - Running the GATK in parallel mode with 11 concurrent threads
-nt 12 -nct 13
ERROR MESSAGE: Invalid thread allocation. User requested 12 threads in total, but the count of cpu threads (13) is higher than the total threads
If -nt is the 'number of data threads', then why does SAMDataSource report
<nt> - <nct> as the number of 'threads'?
If -nct is the 'number of CPU threads per data thread', then the total number of CPU threads running should really be
<nt> * <nct>. Instead, it seems to be
<nt> - <nct>, which makes no sense according to the definitions.
Memory considerations for multi-threading Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory
In any case, all other software I'm familiar with has no notion of a 'data thread', and it seems unnecessary and wasteful -- one simply specifies the inputs, and chooses a number of CPU threads, and the program handles the rest, without reading the same input multiple times.