Tagged with #compression
1 documentation article | 0 announcements | 1 forum discussion


Comments (2)

Objective

Compress the read data in order to minimize file sizes, which facilitates massively multisample processing.

Prerequisites

  • TBD

Steps

  1. Compress your sequence data

1. Compress your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ReduceReads \ 
    -R reference.fa \ 
    -I recal_reads.bam \ 
    -L 20 \ 
    -o reduced_reads.bam 

Expected Result

This creates a file called reduced_reads.bam containing only the sequence information that is essential for calling variants.

Note that ReduceReads is not meant to be run on multiple samples at once. If you plan on merging your sample bam files, you should run ReduceReads on individual samples before doing so.

No posts found with the requested search criteria.
Comments (6)

Trying to run

java -jar $GATKJAR -R $REF -T UnifiedGenotyper -I file1.bam -I file2.bam -I file3.bam -glm BOTH -o output.vcf.gz

gives an error like:

 ##### ERROR ------------------------------------------------------------------------------------------
 ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): 
 ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
 ##### ERROR Please do not post this error to the GATK forum
 ##### ERROR
 ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
 ##### ERROR Visit our website and forum for extensive documentation and answers to 
 ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
 ##### ERROR
 ##### ERROR MESSAGE: There was a failure because temporary file /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub1033673347640679118.tmp could not be found while running the GATK with more than one thread.  Possible causes for this problem include: your system's open file handle limit is too small, your output or temp directories do not have sufficient space, or just an isolated file system blip
 ##### ERROR ------------------------------------------------------------------------------------------

The file is actually there, and is gzip-compressed and vcf-formatted.

However, if I specify -o output.vcf instead of -o output.vcf.gz, then everything works. I suspect the problem is with the autodetection of the codec. In VariantContextWriterStorage, LocalParallelizationProblem is thrown not only if the tmp file cannot be found, but whenever a FeatureDescriptor cannot be found for the file.

So... It seems like compressed output cannot be used from threaded processing with UnifiedGenotyper. Is my assessment correct?

  1. A better error message would be helpful to prevent others from trying the same thing I did.
  2. It would be nice to be able to write compressed output from a threaded UnifiedGenotyper, perhaps: a) the temp file could be written uncompressed even though the final file will be compressed, or b) the Codec-detection could detect gzip-compressed files?