Tagged with #compression
0 documentation articles | 0 announcements | 3 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2015-04-30 19:31:03 | Updated 2015-04-30 19:31:51 | Tags: bcf2codec compression
Comments (3)

GATK Team,

I have recently started to look into using bgzipped BCF files as our primary means of input/output to GATK in order to save time parsing the VCF files. Unfortunately, due to space limitations, unzipped BCF files are not an option, as it looks like they're ~8x the size of a bgzipped VCF.

When I ran a simple "round trip" to convert vcf.gz -> bcf.gz -> vcf.gz (using SelectVariants) just to test the potential processing gains, I got the following error on the bcf.gz->vcf.gz leg:

```

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version nightly-2015-04-30-gdd4ddcb):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Tabix indexed files only work with ASCII codecs, but received non-Ascii codec BCF2Codec, for input source: myFile.bcf.gz
ERROR ------------------------------------------------------------------------------------------

```

This issue persists with the nightly build as well.

Is the native reading of bcf.gz files something that is on the horizon for the GATK team, or is it still a long way off? It looks like this code is pretty deep in the htsjdk library, and fixing it may require a change to the class hierarchy.

Thanks,

John Wallace


Created 2014-12-26 21:09:34 | Updated | Tags: indelrealigner compression
Comments (1)

Hi GATK team, my jobs are currently running and I'm a little bit lazy to try this later: I saw that the .interval files produced by RealignerTargetCreator can be quite large. Can I use a ".interval.gz" extension on the command line of RealignerTargetCreator ? Can I use this *.gz file with IndelRealigner ?


Created 2013-03-19 20:08:23 | Updated | Tags: parallel blip compression exception
Comments (6)

Trying to run

java -jar $GATKJAR -R $REF -T UnifiedGenotyper -I file1.bam -I file2.bam -I file3.bam -glm BOTH -o output.vcf.gz

gives an error like:

 ##### ERROR ------------------------------------------------------------------------------------------
 ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): 
 ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
 ##### ERROR Please do not post this error to the GATK forum
 ##### ERROR
 ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
 ##### ERROR Visit our website and forum for extensive documentation and answers to 
 ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
 ##### ERROR
 ##### ERROR MESSAGE: There was a failure because temporary file /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub1033673347640679118.tmp could not be found while running the GATK with more than one thread.  Possible causes for this problem include: your system's open file handle limit is too small, your output or temp directories do not have sufficient space, or just an isolated file system blip
 ##### ERROR ------------------------------------------------------------------------------------------

The file is actually there, and is gzip-compressed and vcf-formatted.

However, if I specify -o output.vcf instead of -o output.vcf.gz, then everything works. I suspect the problem is with the autodetection of the codec. In VariantContextWriterStorage, LocalParallelizationProblem is thrown not only if the tmp file cannot be found, but whenever a FeatureDescriptor cannot be found for the file.

So... It seems like compressed output cannot be used from threaded processing with UnifiedGenotyper. Is my assessment correct?

  1. A better error message would be helpful to prevent others from trying the same thing I did.
  2. It would be nice to be able to write compressed output from a threaded UnifiedGenotyper, perhaps: a) the temp file could be written uncompressed even though the final file will be compressed, or b) the Codec-detection could detect gzip-compressed files?