I've been trying to get ReduceReads working in a pipeline I've made that incorporates GATK tools to call variants in RNA-seq data. After performing indel realignment and base recalibration I'm trying to use ReduceReads prior to calling variants using Unified Genotyper.
I've been using GATK version 2.3.9. When I try to use ReduceReads on a 1.7Gb .bam file, I need to set aside 100Gb memory to perform the operation for the process to complete (otherwise I'll get an error saying I didn't provide enough memory to run the program and to adjust the maximum heap size using the -Xmx option etc).
The problem isn't that ReduceReads doesn't work - it does, however of the 100Gb I set aside, it uses 80-90Gb of it. This means I can't run more than one job at a time due to the constraints of the machine I'm using etc.
I've been looking through the GATK forum and understand it may be a GATK version issue, though I've tried using GATK 2.5.2 ReduceReads for this step and it still requires 70-80Gb memory.
Can anyone provide any clues as to what I may be doing wrong? or whether I can do something to make it use less memory so I can run multiple jobs simultaneously?
The command I'm using is:
java -Xmx100g -Djava.io.tmpdir=/RAW/javatmp/ -jar /NMC/LCR/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T ReduceReads -R /SCRATCH/LCR/BWAIndex_hg19/genome.fa -I out.bam.sorted.bam.readGroups.bam.rmdup.bam.realigned.bam.recalibrated.bam -o out.bam.sorted.bam.readGroups.bam.rmdup.bam.realigned.bam.recalibrated.bam.reducedReads.bam
Thanks in advance,