# Tagged with #reducereads 0 documentation articles | 6 announcements | 39 forum discussions

No articles to display.

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.

• This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

## Haplotype Caller

• Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
• Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
• Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
• Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
• Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

## CombineGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

## GenotypeGVCFs

• This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

• This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

## Unified Genotyper

• Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

## Variant Recalibrator

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

• Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

## Variant Annotator

• Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
• Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
• In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

## Combine Variants

• Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

## Select Variants

• Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
• Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

## CalculateGenotypePosteriors

• Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
• It no longer strips out the AD annotation from genotypes.
• AC/AF/AN counts are updated after fixing genotypes.
• Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

## Indel Realigner

• Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

• Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

## Miscellaneous

• Thanks to Phillip Dexheimer for several Queue related fixes and patches.
• Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
• Providing an empty BAM list no results in a user error.
• Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
• Logger output now goes to STDERR instead of STDOUT.
• Picard, Tribble, and Variant jars updated to version 1.107.1683.

Created 2014-03-04 03:13:15 | Updated 2014-03-04 03:45:24 | Tags: reducereads

#### This is an important heads-up regarding the GATK 3.0 release.

The purpose of the ReduceReads algorithm was to enable joint analysis of large cohorts by the UnifiedGenotyper. The new workflow for joint discovery, which involves doing a single-sample pass with the HaplotypeCaller in gVCF mode followed by a joint analysis on multiple sample gVCFs, renders the compression step obsolete.

In addition, based on our most recent analyses, we have come to the conclusion that the quality of variant calls made on BAMs compressed with ReduceReads is inferior to the standard targeted by GATK tools. In comparison, the results obtained with the new workflow are far superior.

For these reasons, we have made the difficult decision to remove the ReduceReads tool from version 3.0 of the toolkit. To be clear, reduced BAMs will NOT be supported in GATK 3.0.

We realize that this may cause some disruption to your existing workflows, and for that we apologize. Please understand that we are driven to provide tools that produce the best possible results. Now that all the data is in, we have found that the best results cannot be achieved with reduced BAMs, so we feel that the best thing to do is to remove this inferior tool from the toolkit, and promote the new tools.

As always we welcome your comments, and we look forward to showing you how the new calling workflow will yield superior results.

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

## Unified Genotyper

• Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

## Haplotype Caller

• Improved the indexing scheme for gVCF outputs using the reference calculation model.
• The reference calculation model now works with reduced reads.
• Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
• Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

## Variant Recalibrator

• Disable tranche plots in INDEL mode.
• Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

• Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

## Diagnose Targets

• Added calculation for GC content.
• Added an option to filter the bases based on their quality scores.

## Combine Variants

• Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

## Select Variants

• Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

## Miscellaneous

• SplitSamFile now produces an index with the BAM.
• Length metric updates to QualifyMissingIntervals.
• Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
• Picard jar updated to version 1.104.1628.
• Tribble jar updated to version 1.104.1628.
• Variant jar updated to version 1.104.1628.

Created 2013-08-21 21:15:21 | Updated 2014-02-08 20:09:15 | Tags: indelrealigner unifiedgenotyper variantrecalibrator haplotypecaller reducereads release-notes

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

• Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
• Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
• Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
• Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

## Unified Genotyper

• Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
• FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
• Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

## Haplotype Caller

• Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
• Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
• Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
• Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
• Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
• FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
• Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
• Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

## Variant Annotator

• Fixed bug where only the last -comp was being annotated at a site.

## Indel Realigner

• Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

## Phase By Transmission

• Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

## Variant Recalibrator

• General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
• Added mode to not emit (at all) variant records that are filtered out.
• This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
• Fixed bug where the tool occasionally produced bad log10 values internally.

## Miscellaneous

• General performance improvements to the VCF reading code contributed by Michael McCowan.
• Error messages are much less verbose and "scary."
• Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
• Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
• Fixed bug in how the GATK counts filtered reads in the traversal output.
• Added a new tool called Qualify Intervals.
• Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
• Picard/sam/tribble/variant jars updated to version 1.96.1534.

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

## Base Quality Score Recalibration

• Improved the algorithm around homopolymer runs to use a "delocalized context".
• Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
• Fixed bug where the tool failed for reads that begin with insertions.
• Fixed bug in the scatter-gather functionality.
• Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

## Unified Genotyper

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
• Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
• Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
• Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
• Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
• The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
• Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
• Generalized ploidy model now handles reference calls correctly.

## Haplotype Caller

• Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
• Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
• Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
• Now requires at least 10 samples to merge variants into complex events.

## Variant Annotator

• Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

• Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
• Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
• Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

## Variant Filtration

• Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

## Variant Eval

• AlleleCount stratification now supports records with ploidy other than 2.

## Combine Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Now outputs the first non-missing QUAL, not the maximum.

## Select Variants

• Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
• Removed the -number argument because it gave biased results.

## Validate Variants

• Added option to selectively choose particular strict validation options.
• Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
• improved the error message around unused ALT alleles.

## Somatic Indel Detector

• Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

## Miscellaneous

• Fixed raw HapMap file conversion bug in VariantsToVCF.
• Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
• Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
• Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
• Fixed bug in BCF2 writer for case where all genotypes are missing.
• Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
• Fixed bug in Phase By Transmission when there are no likelihoods present.
• Fixed bug in fasta .fai generation.
• Picard jar remains at version 1.67.1197.
• Tribble jar remains at version 110.

Created 2012-09-24 20:29:35 | Updated 2012-09-24 20:29:35 | Tags: reducereads bug

We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.

This has now been fixed in the most recent release.

To check if you are using a buggy version, run the following:

    samtools view -H BAM This will produce the following output:  @PG ID:GATK ReduceReads VN:XXX If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing. Our most sincere apologies for the inconvenience. Created 2015-09-24 01:40:58 | Updated | Tags: reducereads Dear GATK team: recently ,I using GATK(version 3.4-0-g7e26428) to do ReduceReads，I got the outcome: ##### ERROR MESSAGE: Walker ReduceReads is no longer available in the GATK; it has been deprecated since version 3.0 (use recommended best practices pipeline with the HaplotypeCaller) from that I know that the ReduceReads Walker is no longer available, but you know after doing BQSR,the outcome bam file is nearly two times biger than before doing BQSR . So I want to use the ReduceReads command to the BQSR outcome bam file; next, I do some research to downlaod earlier version of GATK form "http://gatkforums.broadinstitute.org/",but I couldn't find one, Could you give me an earlier version of GATK,(before version3.0,maybe 2.8?or2.6? is ok) thank you! Created 2015-08-24 13:13:41 | Updated | Tags: reducereads bwa-mem Dear all, I know it's not a good place to ask my first question. But I am anxious to know the answer. If someone knows the answer, it will be greatly appreciated. I forgot to use the -M option in bwa mem on all my data sets. It will take another week if I rerun all steps from bwa mem to variant calling for each sample. In topics from earlier years, some people said that the downstream variant calling will be decreased in number without the -M option in bwa mem. But more recent years discussion said that Picard can recognize all those flags including the split alignment flags in sam files. Yet I got no error message in the 'MarkDuplicates' step by Picard. My first question is that 1. How different will the variant calling be with or without the -M option? If your answer is the -M option has to be added, I will have to run all my analyses from the beginning. So my second question is that 2. Is it good to use 'ReduceReads' to save some time for variant calling? Thank you very much, Lisa Created 2014-02-18 01:42:18 | Updated | Tags: reducereads I've been following the best practices guide and I've gotten some odd looking output from ReduceReads. Here's a sample: C100 16 chrM 4934 60 1M * 0 0 T 7 BD:Z:E RG:Z:JC01_L1 BI:Z:L RR:B:c,1 RS:A:1 The odd part is the CIGAR string. Is "1M" a reasonable CIGAR string? Furthermore, prior to ReduceReads, Picard tools' ValidateSamFile finished with no errors, and the validation for the ReduceReads output is like so: WARNING: Record 1, Read name 1, NM tag (nucleotide differences) is missing That occurs for records 1 - 100 and then ValidateSamFile does not report any more. Here is the command line I used for ReduceReads: java -Xmx2g -Djava.io.tmpdir=pwd/tmp -jarGATK -T ReduceReads -R $genomes/hg19.fa -I$alignments/$lane.dedup.realn.recal.bam -o$alignments/$lane.dedup.realn.recal.reduced.bam Note that pwd is surrounded by back ticks, I just don't know how to disable them from interrupting the code format. Any advice? Created 2013-11-29 11:01:47 | Updated | Tags: haplotypecaller reducereads Hi. When I reduce my BAM files with ReduceReads and then run HaplotypeCaller I get this warning: Ignoring SAM validation error: ERROR: Read name 1060, No real operator (M|I|D|N) in CIGAR Why is this? Is it OK to inore it? Created 2013-11-26 11:46:44 | Updated | Tags: reducereads I am having issues making sense of the behaviour of reducereads in an example where I know that the call is real (Sanger validated...). In the non-reduced pileup the variant is pretty solid: 11 47364249 G 12 aA..A.A,.A.^~. The compressed pileup looks like: 11 47364249 G 2 .A which I guess is OK but the depth associated with the A call is 1 instead of the expected 5 (as evidenced when I run a samtools view). Accordingly, after running the UG, I get 7 reads supporting the reference and 1 supporting the alternative. 0/1:7,1:8:15:15,0,203 But really it should be 7/5. I am losing 4 precious reads that turn this variant into missing. I really can't make sense of it. I tweaked all the options I could find to keep as many reads as possible when reducing: -minqual 0 -minmap 0 --dont_hardclip_low_qual_tails -noclip_ad I am using GenomeAnalysisTK-2.7-4, and I can upload the slide of the BAM file to illustrate the problem. I also attach my debugging code (as a txt file) if someone wants to see if I am missing a key option. Is that a bug? Or am I missing something obvious? Created 2013-11-24 19:16:47 | Updated | Tags: reducereads Hello, I am new to the field, so please forgive me if this has a simple answer. I am attempting to call variants on 6 whole exome sequences. The best practices documentation suggested using 30 or more samples, so I downloaded 24 bam files from the 1000genomes database to use with mine. However, whenever I attempted to use ReduceReads on the files, I received the following error right near 100% completion: ##### ERROR MESSAGE: BUG: requested unknown contig=NC_007605 index=-1 I am using the latest b37 reference file from the bundle, and I have tried re-indexing it and re-forming the .dict file. Here is the stack trace from the above error. What is causing this, and how do I fix it? ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: requested unknown contig=NC_007605 index=-1 at org.broadinstitute.sting.utils.MRUCachingSAMSequenceDictionary.updateCache(MRUCachingSAMSequenceDictionary.java:178) at org.broadinstitute.sting.utils.MRUCachingSAMSequenceDictionary.getSequence(MRUCachingSAMSequenceDictionary.java:109) at org.broadinstitute.sting.utils.GenomeLocParser.validateGenomeLoc(GenomeLocParser.java:284) at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:239) at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:449) at org.broadinstitute.sting.gatk.datasources.providers.ReadReferenceView.getReferenceContext(ReadReferenceView.java:98) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:139) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano2.next(TraverseReadsNano.java:128) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.aggregateMapData(TraverseReadsNano.java:119) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:101) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) Created 2013-11-20 08:06:15 | Updated | Tags: reducereads Hello ! I am using ReduceReads but it gets blocked always at the same amount of reads reduced ! For instance, I tried several times on one individual and it is blocked after 19.8 %, another one after 72%. I tried on the cluster and also locally on my computer, I have the same result. However, there is no error and it seems that it is still running, but not advancing anymore... I validated my Bam file with ValidateSamFile before performing this step and there is no error. Any idea on what's going on ? I am using the version 2.3.6 because I read that the last version that is installed on the cluster I use (2.6.5) has a bug. I can't install new software and in particular the new version, however, all the old versions are still available and I read that the 2.3 didn't have the bug. Thanks for our help. Muriel Created 2013-11-12 22:13:18 | Updated | Tags: reducereads validatesamfile I have a quick question about changes ReduceReads does to a file: This is the output from a ValidateSamFile run before and after using ReduceReads on the latest nightly-build. Given that the best practices state that the SAM/BAM file must pass validation this is strange because it seems that ReduceReads is creating errors. Any thoughts? After: java -jar /usr/local/picard-tools/ValidateSamFile.jar I=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.reduced.bam IGNORE=MISSING_TAG_NM [Tue Nov 12 15:01:28 MST 2013] net.sf.picard.sam.ValidateSamFile INPUT=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.reduced.bam IGNORE=[MISSING_TAG_NM] MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Nov 12 15:01:28 MST 2013] Executing as srynearson@newrepublic.genetics.utah.edu on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_06_26_07_26-b00; Picard version: 1.99(1563) ERROR: Record 20952, Read name 20524, Mate alignment does not match alignment start of mate ERROR: Record 490879, Read name 482367, Mate alignment does not match alignment start of mate ERROR: Record 514101, Read name 505228, Mate alignment does not match alignment start of mate ERROR: Record 863536, Read name 849794, Mate alignment does not match alignment start of mate ERROR: Record 1014607, Read name 998725, Mate alignment does not match alignment start of mate ERROR: Record 1025644, Read name 1009614, Mate alignment does not match alignment start of mate ERROR: Record 1121141, Read name 1103717, Mate alignment does not match alignment start of mate ERROR: Record 1248134, Read name 1228854, Mate alignment does not match alignment start of mate ERROR: Record 1278077, Read name 1258325, Mate alignment does not match alignment start of mate ERROR: Record 1301946, Read name 1281828, Mate alignment does not match alignment start of mate Before: java -jar /usr/local/picard-tools/ValidateSamFile.jar I=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam IGNORE=MISSING_TAG_NM [Tue Nov 12 15:02:01 MST 2013] net.sf.picard.sam.ValidateSamFile INPUT=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam IGNORE=[MISSING_TAG_NM] MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Nov 12 15:02:01 MST 2013] Executing as srynearson@newrepublic.genetics.utah.edu on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_06_26_07_26-b00; Picard version: 1.99(1563) INFO 2013-11-12 15:03:45 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:01:43s. Time for last 10,000,000: 103s. Last read position: 2:74,687,627 Created 2013-11-08 10:45:37 | Updated | Tags: reducereads error picard Hi guys, I've seen this error has been reported other times, for different reasons. The thing is that, the bam file I'm using to reduce the reads has been processed through GATK pipeline without problems, realignment and recalibration included. Therefore, I assumed the bam file generated after BQSR would be GATK-compliant. I was running with Queue, so I just run here the exact command passed to the job in an interactive mode, to see what happens. Here below is the full command and error message (apologies for lengthy output), where there's no stack trace after the error.  [fles@login07 reduced] 'java'  '-Xmx12288m'  '-Djava.io.tmpdir=/scratch/scratch/fles/project_analysis/reduced/tmp'  '-cp' '/home/fles/applications/Queue-2.7-4-g6f46d11/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'ReduceReads'  '-I' '/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam'  '-R' '/home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta'  '-o' '/scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam'
INFO  09:27:21,728 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:27:21,730 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:29:52
INFO  09:27:21,731 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  09:27:21,735 HelpFormatter - Program Args: -T ReduceReads -I /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam -R /home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta -o /scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam
INFO  09:27:21,735 HelpFormatter - Date/Time: 2013/11/08 09:27:21
INFO  09:27:21,735 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:27:21,735 HelpFormatter - --------------------------------------------------------------------------------
INFO  09:27:34,156 GenomeAnalysisEngine - Strictness is SILENT
INFO  09:27:34,491 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 40
INFO  09:27:34,503 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 09:27:34,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
INFO  09:27:35,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  09:27:35,045 GenomeAnalysisEngine - Done preparing for traversal
INFO  09:27:35,045 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  09:27:35,080 ReadShardBalancer$1 - Loading BAM index data INFO 09:27:35,081 ReadShardBalancer$1 - Done loading BAM index data
INFO  09:28:05,059 ProgressMeter -      1:18958138        1.00e+06   30.0 s       30.0 s      0.6%        81.8 m    81.3 m
INFO  09:28:35,069 ProgressMeter -      1:46733396        2.30e+06   60.0 s       26.0 s      1.5%        66.4 m    65.4 m
INFO  09:29:05,079 ProgressMeter -      1:92187730        3.50e+06   90.0 s       25.0 s      3.0%        50.5 m    49.0 m
INFO  09:29:35,088 ProgressMeter -     1:145281942        4.90e+06  120.0 s       24.0 s      4.7%        42.7 m    40.7 m
INFO  09:30:05,098 ProgressMeter -     1:152323864        6.40e+06    2.5 m       23.0 s      4.9%        50.9 m    48.4 m
INFO  09:30:35,893 ProgressMeter -     1:181206886        7.70e+06    3.0 m       23.0 s      5.8%        51.4 m    48.4 m
INFO  09:31:05,902 ProgressMeter -     1:217604563        8.90e+06    3.5 m       23.0 s      7.0%        49.9 m    46.4 m
INFO  09:31:35,913 ProgressMeter -      2:14782401        1.02e+07    4.0 m       23.0 s      8.5%        47.0 m    43.0 m
INFO  09:32:05,922 ProgressMeter -      2:62429207        1.15e+07    4.5 m       23.0 s     10.0%        44.8 m    40.3 m
INFO  09:32:35,931 ProgressMeter -      2:97877374        1.28e+07    5.0 m       23.0 s     11.2%        44.7 m    39.7 m
INFO  09:33:06,218 ProgressMeter -     2:135574018        1.42e+07    5.5 m       23.0 s     12.4%        44.5 m    38.9 m
INFO  09:33:36,227 ProgressMeter -     2:179431307        1.56e+07    6.0 m       23.0 s     13.8%        43.5 m    37.5 m
INFO  09:34:06,237 ProgressMeter -     2:216279690        1.69e+07    6.5 m       23.0 s     15.0%        43.4 m    36.9 m
INFO  09:34:36,248 ProgressMeter -      3:14974731        1.81e+07    7.0 m       23.0 s     16.4%        42.9 m    35.9 m
INFO  09:35:07,073 ProgressMeter -      3:52443620        1.94e+07    7.5 m       23.0 s     17.6%        42.9 m    35.4 m
INFO  09:35:37,084 ProgressMeter -     3:111366536        2.05e+07    8.0 m       23.0 s     19.5%        41.3 m    33.2 m
INFO  09:36:07,094 ProgressMeter -     3:155571144        2.18e+07    8.5 m       23.0 s     20.9%        40.8 m    32.3 m
INFO  09:36:37,103 ProgressMeter -       4:3495327        2.31e+07    9.0 m       23.0 s     22.4%        40.4 m    31.3 m
INFO  09:37:07,114 ProgressMeter -      4:48178306        2.43e+07    9.5 m       23.0 s     23.8%        40.0 m    30.5 m
INFO  09:37:37,270 ProgressMeter -     4:106747046        2.56e+07   10.0 m       23.0 s     25.7%        39.0 m    29.0 m
INFO  09:38:07,483 ProgressMeter -     4:181303657        2.69e+07   10.5 m       23.0 s     28.1%        37.5 m    26.9 m
INFO  09:38:37,493 ProgressMeter -      5:41149454        2.81e+07   11.0 m       23.0 s     29.7%        37.1 m    26.1 m
INFO  09:38:51,094 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM file /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam is malformed: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam
##### ERROR ------------------------------------------------------------------------------------------

Following your usual advice, I validated the bam file produced by BQSR with Picard and I get the exact same error, but no specific error indication

        [fles@login07 reduced]$java -jar ~/applications/picard-tools-1.102/ValidateSamFile.jar \ > INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam \ > IGNORE_WARNINGS=TRUE [Fri Nov 08 09:59:42 GMT 2013] net.sf.picard.sam.ValidateSamFile INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam IGNORE_WARNINGS=true MAX_OUTPUT=100 VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Fri Nov 08 09:59:42 GMT 2013] Executing as fles@login07 on Linux 2.6.18-194.11.4.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.102(1591) INFO 2013-11-08 10:01:01 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:01:18s. Time for last 10,000,000: 78s. Last read position: 1:204,966,172 INFO 2013-11-08 10:02:19 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:02:36s. Time for last 10,000,000: 78s. Last read position: 2:232,121,396 INFO 2013-11-08 10:03:36 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:03:54s. Time for last 10,000,000: 77s. Last read position: 4:123,140,629 [Fri Nov 08 10:04:00 GMT 2013] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 4.30 minutes. Runtime.totalMemory()=300941312 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397) at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371) at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:357) at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:200) at net.sf.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:558)
at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:532) at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481) at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687)
at net.sf.samtools.SAMFileReaderAssertableIterator.next(SAMFileReader.java:665) at net.sf.picard.sam.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:241) at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:177) at net.sf.picard.sam.SamFileValidator.validateSamFileSummary(SamFileValidator.java:104) at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:164) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:100) Caused by: java.io.IOException: Unexpected compressed block length: 1 at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:358) at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:113) at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:238) at java.io.DataInputStream.read(DataInputStream.java:149) at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:395) any suggestions on what I might do wrong? Created 2013-11-07 17:34:46 | Updated | Tags: printreads reducereads I used BaseRecalibration and generated a bam file using PrintReads correctly along with .bai which was generated automatically. However, when I am trying to reduce the bam file using ReduceReads, I am getting the error below: ##### ERROR MESSAGE: SAM/BAM file /bam_base_recalib/dedup_51769_R1_R2_RG_realigned_recal_reads.bam is malformed: Premature EOF; BinaryCodec in readmode; file: /bam_base_recalib/dedup_51769_R1_R2_RG_realigned_recal_reads.bam Created 2013-10-22 15:46:35 | Updated | Tags: reducereads can I use samtools index to index reduced bam files? I'm using GATK version 2.7-2-g6bda569 and HC asks for the index files. Thanks, Created 2013-10-15 16:52:31 | Updated 2013-10-15 17:12:00 | Tags: reducereads gatk I'm using GATK 2.74 on my server with Java 1.7 on Mac. I'm running through many .bam files that was produced by the upstream PrintReads. I am using a for-loop in a shell script to loop all the .bam files through ReduceReads Some of these bam files were not compressed by GATK-ReduceReads, and it gives 0B .bam output files. The command line I use is following: java -Xmx10g -Djava.awt.headless=true -jarCLASSPATH/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ./GATK_ref/hg19.fasta \ -S LENIENT \ -log ../GATK/BQSR/log/file1.compress.log \ -I ../GATK/BQSR/file1.recal.bam \ -o ../GATK/BQSR/file1.compressed.bam

I'm copying the tail part of the log for one of the failed .bam files: from this log, I don't see any error. Maybe ReduceReads walker just terminated itself earlier???

INFO 10:28:13,654 ProgressMeter - chr4:48710213 2.47e+07 19.2 m 46.0 s 23.6% 81.4 m 62.2 m INFO 10:28:43,657 ProgressMeter - chr4:83322785 2.54e+07 19.7 m 46.0 s 24.7% 79.8 m 60.1 m INFO 10:29:13,659 ProgressMeter - chr4:112743605 2.60e+07 20.2 m 46.0 s 25.6% 78.8 m 58.6 m INFO 10:29:43,662 ProgressMeter - chr4:151294343 2.67e+07 20.7 m 46.0 s 26.8% 77.1 m 56.4 m INFO 10:30:13,664 ProgressMeter - chr4:184999795 2.72e+07 21.2 m 46.0 s 27.9% 75.9 m 54.7 m INFO 10:30:43,667 ProgressMeter - chr5:14341390 2.79e+07 21.7 m 46.0 s 28.6% 75.9 m 54.2 m INFO 10:31:13,796 ProgressMeter - chr5:37088471 2.85e+07 22.2 m 46.0 s 29.3% 75.7 m 53.6 m INFO 10:31:43,798 ProgressMeter - chr5:67907077 2.92e+07 22.7 m 46.0 s 30.3% 74.9 m 52.3 m INFO 10:32:13,914 ProgressMeter - chr5:90107126 2.97e+07 23.2 m 46.0 s 31.0% 74.8 m 51.7 m INFO 10:32:44,969 ProgressMeter - chr5:127450350 3.03e+07 23.7 m 46.0 s 32.2% 73.7 m 50.0 m

Created 2013-10-13 06:06:13 | Updated | Tags: reducereads gatk

I recently upgraded from GATK 2.5 to the latest 2.74 stable release, but the Reduce Reads throws the following error when I try to run it with a Bam file that was produced by "PrintReads" (3 samples merged in one Bam file).

MESSAGE: Bad input: Reduce Reads is not meant to be run for more than 1 sample at a time except for the specific case of tumor/normal >pairs in cancer analysis

##### ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.closeVariantRegions(SingleSampleCompressor.java:83) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.closeVariantRegionsInAllSamples(MultiSampleCompressor.java:94) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:76) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:387) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:87) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:226) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:215) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:254) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

##### ERROR ------------------------------------------------------------------------------------------

Is there something wrong with running RR with multiple samples?

d

Created 2013-01-10 16:27:19 | Updated | Tags: unifiedgenotyper reducereads

Hello,

Im trying to call variants using UnifiedGenotyper on ca 450 reduced bams in 100000 bp chunks. It works fine for some of the chunks, but for others I get the following error message:

##### ERROR MESSAGE: SAM/BAM nameOfBam.bam is malformed: Unexpected compressed block length: 1

Can anyone explain to me why there is a problem with a specific bam file when I call on for example chunk chr20:25400000-25500000 but not when I call on chunk chr20:10000000-10100000?

Thank you, Tota

Created 2013-01-09 17:40:00 | Updated | Tags: reducereads

I have read on your recent slides for "Data Compression with Reduce Reads" that "Tumor and Normal samples (or any set of samples) get co-­‐reduced, meaning that every variable region triggered by one sample will be forced in every sample."

I have data from 4 variant strains of an organism, my samples in RG info, and 4 individuals for each strain, my libraries in RG info. Currently I have a bam file for each of the 16 different libraries.

If I want to run ReduceReads as I have quite high coverage, but preserve information across all of my samples where a site is not consensus in just one as there is no snp information available for this organism and I don't want to lose any important data. Should I merge all bam files for all samples before proceeding with ReduceReads with downsampling turned off? Or just leave out ReduceReads?

Thanks Anna

Created 2012-12-18 22:13:32 | Updated | Tags: reducereads reviewedstingexception

New gatk version... trying out ReduceReads again.

6 of 8 exomes I tried were processed by ReduceReads just fine, but two throw the exception Removed too many insertions, header is now negative! (at different genomic locations).

I did not find any mention of this error in the GATK forums, is this a known problem?

Command line: java -Xmx6g -jar GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T ReduceReads -o test.rr.bam -I rr-too-many-insertions.bam

java -v: java version "1.6.0_27" Java(TM) SE Runtime Environment (build 1.6.0_27-b07) Java HotSpot(TM) 64-Bit Server VM (build 20.2-b06, mixed mode)

Run log:

INFO  16:03:26,898 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:03:27,382 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-0-g9593e74, Compiled 2012/12/17 16:58:19
INFO  16:03:27,383 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  16:03:27,388 HelpFormatter - Program Args: -R human_g1k_v37.fasta -T ReduceReads -o test.rr.bam -I rr-too-many-insertions.bam
INFO  16:03:27,388 HelpFormatter - Date/Time: 2012/12/18 16:03:26
INFO  16:03:27,388 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:03:27,388 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:03:27,471 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:03:27,577 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  16:03:27,585 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 16:03:27,620 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
INFO  16:03:27,656 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  16:03:27,714 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 16:03:27,717 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  16:03:27,739 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 16:03:28,739 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Removed too many insertions, header is now negative! at org.broadinstitute.sting.gatk.walkers.compression.reducereads.HeaderElement.removeInsertionToTheRight(HeaderElement.java:151) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.updateHeaderCounts(SlidingWindow.java:881) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.removeFromHeader(SlidingWindow.java:816) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.compressVariantRegion(SlidingWindow.java:604) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegion(SlidingWindow.java:623) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegions(SlidingWindow.java:643) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.closeVariantRegions(SingleSampleCompressor.java:83) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.closeVariantRegionsInAllSamples(MultiSampleCompressor.java:94) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:76) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:387) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:87) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:226)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:215) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:254) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:94) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-0-g9593e74): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Removed too many insertions, header is now negative! ##### ERROR ------------------------------------------------------------------------------------------ (there is no progress listed here because this log is from after I bisected to find a narrow region where the problem is occuring). Created 2012-11-06 19:19:57 | Updated 2013-01-07 20:08:54 | Tags: reducereads crest I've tried using the output of the Reduced Bams as an input to Crest (after some preprocessing) but it hangs on chr7. Has anyone else used the reduced bam in other programs? Is this output meant to only be used in GATK? Thanks! Created 2012-11-04 10:16:50 | Updated 2012-11-05 02:28:44 | Tags: haplotypecaller reducereads dbsnp error reviewedstingexception Hello dear GATK Team, when trying to run Haplotypecaller on my exome files prepared with ReduceReads i get the error stated below. As you can see the newest GATK Version is used. Also UnifiedGenotyper does not produce any errors on te exact same data (90 SOLiD exomes creatted according to Best Practice v4). ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions? at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396) at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392) at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56) at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24) at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:429) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:104) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:249) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:204) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:179) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:136) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:29) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions? ##### ERROR ------------------------------------------------------------------------------------------ The Command line used (abbreviated): java -Xmx30g -jar /home/common/GenomeAnalysisTK-2.2-3/GenomeAnalysisTK.jar \ -R /home/common/hg19/ucschg19/ucsc.hg19.fasta \ -T HaplotypeCaller \ -I ReduceReads/XXXXX.ontarget.MarkDups.nRG.reor.Real.Recal.reduced.bam [x90]\ --dbsnp /home/common/hg19/dbsnp_135.hg19.vcf \ -o 93Ind_ped_reduced_HC_snps.raw.vcf \ -ped familys.ped \ --pedigreeValidationType SILENT \ -stand_call_conf 20.0 \ -stand_emit_conf 10.0 Created 2012-10-19 18:14:30 | Updated 2012-10-19 18:24:47 | Tags: reducereads ad dp Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur? Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks. Created 2012-10-15 18:59:35 | Updated 2012-10-18 00:50:04 | Tags: reducereads performance Hi Team, I have been running GATK2 ReduceReads on a large (100Gb) Bam file, and even though at the very beginning it runs very smoothly and predicts a week for finishing the task, after a few hours it gets totally stock. We first thought that it could be a garbage collection (or java memory allocation issue), but the logs show that the garbage collection works well. The command is (similar behavior for smaller Xms and Xmx values) java -Xmx30g -Xms30g -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC -XX:+PrintGCDetails -Xloggc:gc.log -verbose:gc -jar$path $ref -T ReduceReads -I input.bam -o output.bam The first few lines of the log file are INFO 01:12:21,541 TraversalEngine - chr1:1094599 5.89e+05 9.9 m 16.8 m 0.0% 19.4 d 19.4 d INFO 01:13:21,628 TraversalEngine - chr1:2112411 9.44e+05 10.9 m 11.6 m 0.1% 11.2 d 11.2 d INFO 01:14:22,065 TraversalEngine - chr1:3051535 1.29e+06 11.9 m 9.3 m 0.1% 8.5 d 8.5 d INFO 01:15:22,297 TraversalEngine - chr1:4084547 1.59e+06 12.9 m 8.1 m 0.1% 6.9 d 6.9 d INFO 01:16:24,130 TraversalEngine - chr1:4719991 1.82e+06 13.9 m 7.7 m 0.2% 6.4 d 6.4 d but after a short while it gets totally stock, and even in the location 121485073 of chromosome 1, there is almost no progress at all, and the estimated finish time goes over 11 weeks, and still increasing. Any idea what the reason for this could be, and how we can solve the problem? The same command runs successfully on small (less than 5gig) Bam files though Thanks in advance. --Sina Created 2012-09-25 10:56:19 | Updated 2012-09-25 10:56:19 | Tags: reducereads Hi, I'm running GATK version 2.1-8 with reads mapped to mm10. ReduceReads fails somewhere on chr2 with above message. From previous posts I understood that this bug has appeared already? Could you please help me to fix it? Thank you, Ania Created 2012-09-13 15:20:22 | Updated 2012-09-13 16:19:18 | Tags: unifiedgenotyper reducereads Hi, I'm trying to use GATK release2.0 with my nine exome-seq samples, following the steps on best practice I generated per-sample, ready-to-process .bam files and then used -T ReduceReads to generate .reduced.bam files for the next step (-T UnifiedGenotyper). When using these .reduced.bam files as UG input I receive this error message: "##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?" if I take my original .bam files as input things work smoothly. Do you have any idea what causes the problem? Thanks a lot, Samira Here is the command lines I use: java -Xmx4g -jar$GATKv4 \
-R $GATK_BUNDLE/ucsc.hg19.fasta \ -T ReduceReads \ -L$capture_library.bed \
-I $i.recal_s.bam \ -o$i.reduced.bam

java -jar $GATKv4 \ -T HaplotypeCaller \ -R$GATK_BUNDLE/ucsc.hg19.fasta \
-I InputReducedBams.list \

##### ERROR ------------------------------------------------------------------------------------------

I run it with the standard arguments: java -jar GenomAnalysisTK.jar \ --logging_level ERROR \ -R hg19.fa \ -T ReduceReads \ -I in.bam \ -o reduced.in.bam

Anny suggestions?

Thanks, Thomas

Created 2012-08-01 23:26:12 | Updated 2012-08-01 23:26:12 | Tags: reducereads parallel

I'm working with ReduceReads and would like to use it in some kind of parallel mode. The presentation mentions that a 50x way run may drastically reduce run time but I'm not sure how to invoke this. I tried -nt and it complained. Should I be giving it multiple intervals and merging? If so, how does it deal with edge variants?

Thanks.