Tagged with #compression
0 documentation articles | 0 announcements | 3 forum discussions


No articles to display.

No articles to display.


Created 2015-04-30 19:31:03 | Updated 2015-04-30 19:31:51 | Tags: bcf2codec compression

Comments (3)

GATK Team,

I have recently started to look into using bgzipped BCF files as our primary means of input/output to GATK in order to save time parsing the VCF files. Unfortunately, due to space limitations, unzipped BCF files are not an option, as it looks like they're ~8x the size of a bgzipped VCF.

When I ran a simple "round trip" to convert vcf.gz -> bcf.gz -> vcf.gz (using SelectVariants) just to test the potential processing gains, I got the following error on the bcf.gz->vcf.gz leg:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version nightly-2015-04-30-gdd4ddcb): 
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: Tabix indexed files only work with ASCII codecs, but received non-Ascii codec BCF2Codec, for input source: myFile.bcf.gz
##### ERROR ------------------------------------------------------------------------------------------

This issue persists with the nightly build as well.

Is the native reading of bcf.gz files something that is on the horizon for the GATK team, or is it still a long way off? It looks like this code is pretty deep in the htsjdk library, and fixing it may require a change to the class hierarchy.

Thanks,

John Wallace


Created 2014-12-26 21:09:34 | Updated | Tags: indelrealigner compression

Comments (1)

Hi GATK team, my jobs are currently running and I'm a little bit lazy to try this later: I saw that the .interval files produced by RealignerTargetCreator can be quite large. Can I use a ".interval.gz" extension on the command line of RealignerTargetCreator ? Can I use this *.gz file with IndelRealigner ?


Created 2013-03-19 20:08:23 | Updated | Tags: parallel blip compression exception

Comments (6)

Trying to run

java -jar $GATKJAR -R $REF -T UnifiedGenotyper -I file1.bam -I file2.bam -I file3.bam -glm BOTH -o output.vcf.gz

gives an error like:

 ##### ERROR ------------------------------------------------------------------------------------------
 ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): 
 ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
 ##### ERROR Please do not post this error to the GATK forum
 ##### ERROR
 ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
 ##### ERROR Visit our website and forum for extensive documentation and answers to 
 ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
 ##### ERROR
 ##### ERROR MESSAGE: There was a failure because temporary file /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub1033673347640679118.tmp could not be found while running the GATK with more than one thread.  Possible causes for this problem include: your system's open file handle limit is too small, your output or temp directories do not have sufficient space, or just an isolated file system blip
 ##### ERROR ------------------------------------------------------------------------------------------

The file is actually there, and is gzip-compressed and vcf-formatted.

However, if I specify -o output.vcf instead of -o output.vcf.gz, then everything works. I suspect the problem is with the autodetection of the codec. In VariantContextWriterStorage, LocalParallelizationProblem is thrown not only if the tmp file cannot be found, but whenever a FeatureDescriptor cannot be found for the file.

So... It seems like compressed output cannot be used from threaded processing with UnifiedGenotyper. Is my assessment correct?

  1. A better error message would be helpful to prevent others from trying the same thing I did.
  2. It would be nice to be able to write compressed output from a threaded UnifiedGenotyper, perhaps: a) the temp file could be written uncompressed even though the final file will be compressed, or b) the Codec-detection could detect gzip-compressed files?