Tagged with #reducereads
3 documentation articles | 6 announcements | 36 forum discussions


Comments (2)

Objective

Compress the read data in order to minimize file sizes, which facilitates massively multisample processing.

Prerequisites

  • TBD

Steps

  1. Compress your sequence data

1. Compress your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T ReduceReads \ 
    -R reference.fa \ 
    -I recal_reads.bam \ 
    -L 20 \ 
    -o reduced_reads.bam 

Expected Result

This creates a file called reduced_reads.bam containing only the sequence information that is essential for calling variants.

Note that ReduceReads is not meant to be run on multiple samples at once. If you plan on merging your sample bam files, you should run ReduceReads on individual samples before doing so.

Comments (8)

What is a synthetic read?

When running reduce reads, the algorithm will find regions of low variation in the genome and compress them together. To represent this compressed region, we use a synthetic read that carries all the information necessary to downstream tools to perform likelihood calculations over the reduced data.

They are called Synthetic because they are not read by a sequencer, these reads are automatically generated by the GATK and can be extremely long. In a synthetic read, each base will represent the consensus base for that genomic location. Each base will have it's consensus quality score represented in the equivalent offset in the quality score string.

Consensus Bases

ReduceReads has several filtering parameters for consensus regions. Consensus is created based on base qualities, mapping qualities and other adjustable parameters from the command line. All filters are described in the technical documentation of reduce reads.

Consensus Quality Scores

The consensus quality score of a consensus base is essentially the mean of all bases that passed all the filters and represent an observation of that base. It is represented in the quality score field of the SAM format.

n is the number of bases that contributed to the consensus base and q_i is the corresponding quality score of each base.

Insertion quality scores and Deletion quality scores (generated by BQSR) will undergo the same process and will be represented the same way.

Mapping Quality

The mapping quality of a synthetic read is a value representative of the mapping qualities of all the reads that contributed to it. This is an average of the root mean square of the mapping quality of all reads that contributed to the bases of the synthetic read. It is represented in the mapping quality score field of the SAM format.

where n is the number of reads and x_i is the mapping quality of each read.

Original Alignments

A synthetic read may come with up to two extra tags representing its original alignment information. Due to many filters in ReduceReads, reads are hard-clipped to the are of interest. These hard-clips are always represented in the cigar string with the H element and the length of the clipping in genomic coordinates. Sometimes hard clipping will make it impossible to retrieve what was the original alignment start / end of a read. In those cases, the read will contain extra tags with integer values representing their original alignment start or end.

Here are the two integer tags:

  • OP -- original alignment start
  • OE -- original alignment end

For all other reads, where this can still be obtained through the cigar string (i.e. using getAlignmentStart() or getUnclippedStart()), these tags are not created.

The RR Tag

the RR tag is a tag that holds the observed depth (after filters) of every base that contributed to a reduce read. That means all bases that passed the mapping and base quality filters, and had the same observation as the one in the reduced read.

The RR tag carries an array of bytes and for increased compression, it works like this: the first number represents the depth of the first base in the reduced read. all subsequent numbers will represent the offset depth from the first base. Therefore, to calculate the depth of base "i" using the RR array, one must use :

RR[0] + RR[i]

but make sure i > 0. Here is the code we use to return the depth of the i'th base:

return (i==0) ? firstCount : (byte) Math.min(firstCount + offsetCount, Byte.MAX_VALUE);


Using Synthetic Reads with GATK tools

The GATK is 100% compatible with synthetic reads. You can use Reduced BAM files in combination with non-reduced BAM files in any GATK analysis tools and it will work seamlessly.

Programming in the GATK

If you are programming using the GATK framework, the GATKSAMRecord class carries all the necessary functionality to use synthetic reads transparently with methods like:

  • public final byte getReducedCount(final int i)
  • public int getOriginalAlignmentStart()
  • public int getOriginalAlignmentEnd()
  • public boolean isReducedRead()
Comments (0)

A new tool has been released!

Check out the documentation at ReduceReads.

Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


SplitNCigarReads

  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.

CombineGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.

GenotypeGVCFs

  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.

SimulateReadsForVariants

  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.

CalculateGenotypePosteriors

  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.

Miscellaneous

  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.
Comments (5)

This is an important heads-up regarding the GATK 3.0 release.

The purpose of the ReduceReads algorithm was to enable joint analysis of large cohorts by the UnifiedGenotyper. The new workflow for joint discovery, which involves doing a single-sample pass with the HaplotypeCaller in gVCF mode followed by a joint analysis on multiple sample gVCFs, renders the compression step obsolete.

In addition, based on our most recent analyses, we have come to the conclusion that the quality of variant calls made on BAMs compressed with ReduceReads is inferior to the standard targeted by GATK tools. In comparison, the results obtained with the new workflow are far superior.

For these reasons, we have made the difficult decision to remove the ReduceReads tool from version 3.0 of the toolkit. To be clear, reduced BAMs will NOT be supported in GATK 3.0.

We realize that this may cause some disruption to your existing workflows, and for that we apologize. Please understand that we are driven to provide tools that produce the best possible results. Now that all the data is in, we have found that the best results cannot be achieved with reduced BAMs, so we feel that the best thing to do is to remove this inferior tool from the toolkit, and promote the new tools.

As always we welcome your comments, and we look forward to showing you how the new calling workflow will yield superior results.

Comments (2)

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.


Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.
Comments (2)

GATK 2.7 was released on August 21, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history


Reduce Reads

  • Changed the underlying convention of having unstranded reduced reads; instead there are now at least 2 compressed reads at every position, one for each strand (forward and reverse). This allows us to maintain strand information that is useful for downstream filtering.
  • Fixed bug where representative depths were arbitrarily being capped at 127 (instead of the expected 255).
  • Fixed bug where insertions downstream of a variant region weren't triggering a stop to the compression.
  • Fixed bug when using --cancer_mode where alignments were being emitted out of order (and causing the tool to fail).

Unified Genotyper

  • Added --onlyEmitSamples argument that, when provided, instructs that caller to emit only the selected samples into the VCF (even though the calling is performed over all samples present in the provided bam files).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Added a (very) experimental argument (allSitePLs) that will have the caller emit PLs for all sites (including reference sites). Note that this does not give a fully accurate reference model because it models only SNPs. Full a proper handling of the reference model, please use the Haplotype Caller.

Haplotype Caller

  • Added a still somewhat experimental PCR indel error model to the Haplotype Caller. By default this modeling is turned on and is very useful for removing false positive indel calls associated with PCR slippage around short tandem repeats (esp. homopolymers). Users have the option (with the --pcr_indel_model argument) of turning it off or making it even more aggressive (at the expense of losing some true positives too).
  • Added the ability to emit accurate likelihoods for non-variant positions (i.e. what we call a "reference model" that incorporates indels as well as SNP confidences at every position). The output format can be either a record for every position or use the gVCF style recording of blocks. See the --emitRefConfidence argument for more details; note that this replaces the use of "--output_mode EMIT_ALL_SITES" in the HaplotypeCaller.
  • Improvements to the internal likelihoods that are generated by the Haplotype Caller. Specifically, this tool now uses a tri-state correction like the Unified Genotyper, corrects for overlapping read pairs (from the same underlying fragment), and does not run contamination removal (allele-biased downsampling) by default.
  • Several small runtime performance improvements were added (although we are still hard at work on larger improvements that will allow calling to scale to many samples; we're just not there yet).
  • Fixed bug in how adapter clipping was performed (we now clip only after reverting soft-clipped bases).
  • FPGA support was added to the underlying HMM that is automatically used when the appropriate hardware is available on the machine.
  • Improved the "dangling tail" recovery in the assembly algorithm, which allows for higher sensitivity in calling variants at the edges of coverage (e.g. near the ends of targets in an exome).
  • Added the ability to run allele-biased downsampling with different per-sample values like the Unified Genotyper (contributed by Yossi Farjoun).

Variant Annotator

  • Fixed bug where only the last -comp was being annotated at a site.

Indel Realigner

  • Fixed bug that arises because of secondary alignments and that was causing the tool not to update the alignment start of the mate when a read was realigned.

Phase By Transmission

  • Fixed bug where multi-allelic records were being completely dropped by this tool. Now they are emitted unphased.

Variant Recalibrator

  • General improvements to the Gaussian modeling, mostly centered around separating the parameters for the positive and negative training models.
  • The percentBadVariants argument has been replaced with the numBad argument.
  • Added mode to not emit (at all) variant records that are filtered out.
  • This tool now automatically orders the annotation dimensions by their standard deviation instead of the order they were specified on the command-line in order to stabilize the training and have it produce optimal results.
  • Fixed bug where the tool occasionally produced bad log10 values internally.

Miscellaneous

  • General performance improvements to the VCF reading code contributed by Michael McCowan.
  • Error messages are much less verbose and "scary."
  • Added a LibraryReadFilter contributed by Louis Bergelson.
  • Fixed the ReadBackedPileup class to represent mapping qualities as ints, not (signed) bytes.
  • Added the engine-wide ability to do on-the-fly BAM file sample renaming at runtime (see the documentation for the --sample_rename_mapping_file argument for more details).
  • Fixed bug in how the GATK counts filtered reads in the traversal output.
  • Added a new tool called Qualify Intervals.
  • Fixed major bug in the BCF encoding (the previous version was producing problematic files that were failing when trying to be read back into the GATK).
  • Picard/sam/tribble/variant jars updated to version 1.96.1534.
Comments (2)

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

  • Improved the algorithm around homopolymer runs to use a "delocalized context".
  • Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
  • Fixed bug where the tool failed for reads that begin with insertions.
  • Fixed bug in the scatter-gather functionality.
  • Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
  • Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
  • Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
  • Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
  • Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
  • The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
  • Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
  • Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

  • Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

Reduce Reads

  • Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
  • Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
  • Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

  • Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

  • AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Now outputs the first non-missing QUAL, not the maximum.

Select Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Removed the -number argument because it gave biased results.

Validate Variants

  • Added option to selectively choose particular strict validation options.
  • Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
  • improved the error message around unused ALT alleles.

Somatic Indel Detector

  • Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

Miscellaneous

  • New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
  • Fixed raw HapMap file conversion bug in VariantsToVCF.
  • Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
  • Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
  • Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
  • Fixed bug in BCF2 writer for case where all genotypes are missing.
  • Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
  • Fixed bug in Phase By Transmission when there are no likelihoods present.
  • Fixed bug in fasta .fai generation.
  • Updated and improved version of the BadCigar read filter.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.
Comments (0)

We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.

This has now been fixed in the most recent release.

To check if you are using a buggy version, run the following:

    samtools view -H $BAM

This will produce the following output:

    @PG ID:GATK ReduceReads VN:XXX

If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing.

Our most sincere apologies for the inconvenience.

Comments (2)

I've been following the best practices guide and I've gotten some odd looking output from ReduceReads. Here's a sample:

C100 16 chrM 4934 60 1M * 0 0 T 7 BD:Z:E RG:Z:JC01_L1 BI:Z:L RR:B:c,1 RS:A:1

The odd part is the CIGAR string. Is "1M" a reasonable CIGAR string? Furthermore, prior to ReduceReads, Picard tools' ValidateSamFile finished with no errors, and the validation for the ReduceReads output is like so:

WARNING: Record 1, Read name 1, NM tag (nucleotide differences) is missing

That occurs for records 1 - 100 and then ValidateSamFile does not report any more.

Here is the command line I used for ReduceReads:

java -Xmx2g -Djava.io.tmpdir=pwd/tmp -jar $GATK -T ReduceReads -R $genomes/hg19.fa -I $alignments/$lane.dedup.realn.recal.bam -o $alignments/$lane.dedup.realn.recal.reduced.bam

Note that pwd is surrounded by back ticks, I just don't know how to disable them from interrupting the code format.

Any advice?

Comments (4)

Hi. When I reduce my BAM files with ReduceReads and then run HaplotypeCaller I get this warning:

Ignoring SAM validation error: ERROR: Read name 1060, No real operator (M|I|D|N) in CIGAR

Why is this? Is it OK to inore it?

Comments (13)

I am having issues making sense of the behaviour of reducereads in an example where I know that the call is real (Sanger validated...). In the non-reduced pileup the variant is pretty solid:

11 47364249 G 12 aA..A.A,.A.^~.

The compressed pileup looks like: 11 47364249 G 2 .A

which I guess is OK but the depth associated with the A call is 1 instead of the expected 5 (as evidenced when I run a samtools view). Accordingly, after running the UG, I get 7 reads supporting the reference and 1 supporting the alternative. 0/1:7,1:8:15:15,0,203 But really it should be 7/5. I am losing 4 precious reads that turn this variant into missing.

I really can't make sense of it. I tweaked all the options I could find to keep as many reads as possible when reducing: -minqual 0 -minmap 0 --dont_hardclip_low_qual_tails -noclip_ad

I am using GenomeAnalysisTK-2.7-4, and I can upload the slide of the BAM file to illustrate the problem. I also attach my debugging code (as a txt file) if someone wants to see if I am missing a key option.

Is that a bug? Or am I missing something obvious?

Comments (7)

Hello,

I am new to the field, so please forgive me if this has a simple answer. I am attempting to call variants on 6 whole exome sequences. The best practices documentation suggested using 30 or more samples, so I downloaded 24 bam files from the 1000genomes database to use with mine. However, whenever I attempted to use ReduceReads on the files, I received the following error right near 100% completion:

ERROR MESSAGE: BUG: requested unknown contig=NC_007605 index=-1

I am using the latest b37 reference file from the bundle, and I have tried re-indexing it and re-forming the .dict file. Here is the stack trace from the above error. What is causing this, and how do I fix it?

ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: requested unknown contig=NC_007605 index=-1 at org.broadinstitute.sting.utils.MRUCachingSAMSequenceDictionary.updateCache(MRUCachingSAMSequenceDictionary.java:178) at org.broadinstitute.sting.utils.MRUCachingSAMSequenceDictionary.getSequence(MRUCachingSAMSequenceDictionary.java:109) at org.broadinstitute.sting.utils.GenomeLocParser.validateGenomeLoc(GenomeLocParser.java:284) at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:239) at org.broadinstitute.sting.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:449) at org.broadinstitute.sting.gatk.datasources.providers.ReadReferenceView.getReferenceContext(ReadReferenceView.java:98) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:139) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:128) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.aggregateMapData(TraverseReadsNano.java:119) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:101) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

Comments (3)

Hello ! I am using ReduceReads but it gets blocked always at the same amount of reads reduced ! For instance, I tried several times on one individual and it is blocked after 19.8 %, another one after 72%. I tried on the cluster and also locally on my computer, I have the same result. However, there is no error and it seems that it is still running, but not advancing anymore... I validated my Bam file with ValidateSamFile before performing this step and there is no error. Any idea on what's going on ? I am using the version 2.3.6 because I read that the last version that is installed on the cluster I use (2.6.5) has a bug. I can't install new software and in particular the new version, however, all the old versions are still available and I read that the 2.3 didn't have the bug. Thanks for our help. Muriel

Comments (1)

I have a quick question about changes ReduceReads does to a file:

This is the output from a ValidateSamFile run before and after using ReduceReads on the latest nightly-build. Given that the best practices state that the SAM/BAM file must pass validation this is strange because it seems that ReduceReads is creating errors.

Any thoughts?

After: java -jar /usr/local/picard-tools/ValidateSamFile.jar I=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.reduced.bam IGNORE=MISSING_TAG_NM [Tue Nov 12 15:01:28 MST 2013] net.sf.picard.sam.ValidateSamFile INPUT=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.reduced.bam IGNORE=[MISSING_TAG_NM] MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Nov 12 15:01:28 MST 2013] Executing as srynearson@newrepublic.genetics.utah.edu on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_06_26_07_26-b00; Picard version: 1.99(1563) ERROR: Record 20952, Read name 20524, Mate alignment does not match alignment start of mate ERROR: Record 490879, Read name 482367, Mate alignment does not match alignment start of mate ERROR: Record 514101, Read name 505228, Mate alignment does not match alignment start of mate ERROR: Record 863536, Read name 849794, Mate alignment does not match alignment start of mate ERROR: Record 1014607, Read name 998725, Mate alignment does not match alignment start of mate ERROR: Record 1025644, Read name 1009614, Mate alignment does not match alignment start of mate ERROR: Record 1121141, Read name 1103717, Mate alignment does not match alignment start of mate ERROR: Record 1248134, Read name 1228854, Mate alignment does not match alignment start of mate ERROR: Record 1278077, Read name 1258325, Mate alignment does not match alignment start of mate ERROR: Record 1301946, Read name 1281828, Mate alignment does not match alignment start of mate

Before: java -jar /usr/local/picard-tools/ValidateSamFile.jar I=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam IGNORE=MISSING_TAG_NM [Tue Nov 12 15:02:01 MST 2013] net.sf.picard.sam.ValidateSamFile INPUT=NA12761.mapped.ILLUMINA.bwa.CEU.exome.20121211.bam IGNORE=[MISSING_TAG_NM] MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Tue Nov 12 15:02:01 MST 2013] Executing as srynearson@newrepublic.genetics.utah.edu on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_06_26_07_26-b00; Picard version: 1.99(1563) INFO 2013-11-12 15:03:45 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:01:43s. Time for last 10,000,000: 103s. Last read position: 2:74,687,627

Comments (8)

Hi guys, I've seen this error has been reported other times, for different reasons. The thing is that, the bam file I'm using to reduce the reads has been processed through GATK pipeline without problems, realignment and recalibration included. Therefore, I assumed the bam file generated after BQSR would be GATK-compliant. I was running with Queue, so I just run here the exact command passed to the job in an interactive mode, to see what happens.

Here below is the full command and error message (apologies for lengthy output), where there's no stack trace after the error.

        [fles@login07 reduced]$ 'java'  '-Xmx12288m'  '-Djava.io.tmpdir=/scratch/scratch/fles/project_analysis/reduced/tmp'  '-cp' '/home/fles/applications/Queue-2.7-4-g6f46d11/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'ReduceReads'  '-I' '/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam'  '-R' '/home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta'  '-o' '/scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam'
        INFO  09:27:21,728 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  09:27:21,730 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:29:52 
        INFO  09:27:21,731 HelpFormatter - Copyright (c) 2010 The Broad Institute 
        INFO  09:27:21,731 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
        INFO  09:27:21,735 HelpFormatter - Program Args: -T ReduceReads -I /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam -R /home/fles/Scratch/gatkbundle_2.5/human_g1k_v37.fasta -o /scratch/scratch/fles/project_analysis/reduced/projectTrios.U1_PJ5208467.recal.reduced.bam 
        INFO  09:27:21,735 HelpFormatter - Date/Time: 2013/11/08 09:27:21 
        INFO  09:27:21,735 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  09:27:21,735 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  09:27:34,156 GenomeAnalysisEngine - Strictness is SILENT 
        INFO  09:27:34,491 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 40 
        INFO  09:27:34,503 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  09:27:34,627 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12 
        INFO  09:27:35,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
        INFO  09:27:35,045 GenomeAnalysisEngine - Done preparing for traversal 
        INFO  09:27:35,045 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
        INFO  09:27:35,046 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
        INFO  09:27:35,080 ReadShardBalancer$1 - Loading BAM index data 
        INFO  09:27:35,081 ReadShardBalancer$1 - Done loading BAM index data 
        INFO  09:28:05,059 ProgressMeter -      1:18958138        1.00e+06   30.0 s       30.0 s      0.6%        81.8 m    81.3 m 
        INFO  09:28:35,069 ProgressMeter -      1:46733396        2.30e+06   60.0 s       26.0 s      1.5%        66.4 m    65.4 m 
        INFO  09:29:05,079 ProgressMeter -      1:92187730        3.50e+06   90.0 s       25.0 s      3.0%        50.5 m    49.0 m 
        INFO  09:29:35,088 ProgressMeter -     1:145281942        4.90e+06  120.0 s       24.0 s      4.7%        42.7 m    40.7 m 
        INFO  09:30:05,098 ProgressMeter -     1:152323864        6.40e+06    2.5 m       23.0 s      4.9%        50.9 m    48.4 m 
        INFO  09:30:35,893 ProgressMeter -     1:181206886        7.70e+06    3.0 m       23.0 s      5.8%        51.4 m    48.4 m 
        INFO  09:31:05,902 ProgressMeter -     1:217604563        8.90e+06    3.5 m       23.0 s      7.0%        49.9 m    46.4 m 
        INFO  09:31:35,913 ProgressMeter -      2:14782401        1.02e+07    4.0 m       23.0 s      8.5%        47.0 m    43.0 m 
        INFO  09:32:05,922 ProgressMeter -      2:62429207        1.15e+07    4.5 m       23.0 s     10.0%        44.8 m    40.3 m 
        INFO  09:32:35,931 ProgressMeter -      2:97877374        1.28e+07    5.0 m       23.0 s     11.2%        44.7 m    39.7 m 
        INFO  09:33:06,218 ProgressMeter -     2:135574018        1.42e+07    5.5 m       23.0 s     12.4%        44.5 m    38.9 m 
        INFO  09:33:36,227 ProgressMeter -     2:179431307        1.56e+07    6.0 m       23.0 s     13.8%        43.5 m    37.5 m 
        INFO  09:34:06,237 ProgressMeter -     2:216279690        1.69e+07    6.5 m       23.0 s     15.0%        43.4 m    36.9 m 
        INFO  09:34:36,248 ProgressMeter -      3:14974731        1.81e+07    7.0 m       23.0 s     16.4%        42.9 m    35.9 m 
        INFO  09:35:07,073 ProgressMeter -      3:52443620        1.94e+07    7.5 m       23.0 s     17.6%        42.9 m    35.4 m 
        INFO  09:35:37,084 ProgressMeter -     3:111366536        2.05e+07    8.0 m       23.0 s     19.5%        41.3 m    33.2 m 
        INFO  09:36:07,094 ProgressMeter -     3:155571144        2.18e+07    8.5 m       23.0 s     20.9%        40.8 m    32.3 m 
        INFO  09:36:37,103 ProgressMeter -       4:3495327        2.31e+07    9.0 m       23.0 s     22.4%        40.4 m    31.3 m 
        INFO  09:37:07,114 ProgressMeter -      4:48178306        2.43e+07    9.5 m       23.0 s     23.8%        40.0 m    30.5 m 
        INFO  09:37:37,270 ProgressMeter -     4:106747046        2.56e+07   10.0 m       23.0 s     25.7%        39.0 m    29.0 m 
        INFO  09:38:07,483 ProgressMeter -     4:181303657        2.69e+07   10.5 m       23.0 s     28.1%        37.5 m    26.9 m 
        INFO  09:38:37,493 ProgressMeter -      5:41149454        2.81e+07   11.0 m       23.0 s     29.7%        37.1 m    26.1 m 
        INFO  09:38:51,094 GATKRunReport - Uploaded run statistics report to AWS S3 
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11): 
        ##### ERROR
        ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
        ##### ERROR The error message below tells you what is the problem.
        ##### ERROR
        ##### ERROR If the problem is an invalid argument, please check the online documentation guide
        ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
        ##### ERROR
        ##### ERROR Visit our website and forum for extensive documentation and answers to 
        ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
        ##### ERROR
        ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
        ##### ERROR
        ##### ERROR MESSAGE: SAM/BAM file /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam is malformed: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam
        ##### ERROR ------------------------------------------------------------------------------------------

Following your usual advice, I validated the bam file produced by BQSR with Picard and I get the exact same error, but no specific error indication

        [fles@login07 reduced]$ java -jar ~/applications/picard-tools-1.102/ValidateSamFile.jar \
        > INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam \
        > IGNORE_WARNINGS=TRUE
        [Fri Nov 08 09:59:42 GMT 2013] net.sf.picard.sam.ValidateSamFile INPUT=/home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam IGNORE_WARNINGS=true    MAX_OUTPUT=100 VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
        [Fri Nov 08 09:59:42 GMT 2013] Executing as fles@login07 on Linux 2.6.18-194.11.4.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.102(1591)
        INFO    2013-11-08 10:01:01 SamFileValidator    Validated Read    10,000,000 records.  Elapsed time: 00:01:18s.  Time for last 10,000,000:   78s.  Last read position: 1:204,966,172
        INFO    2013-11-08 10:02:19 SamFileValidator    Validated Read    20,000,000 records.  Elapsed time: 00:02:36s.  Time for last 10,000,000:   78s.  Last read position: 2:232,121,396
        INFO    2013-11-08 10:03:36 SamFileValidator    Validated Read    30,000,000 records.  Elapsed time: 00:03:54s.  Time for last 10,000,000:   77s.  Last read position: 4:123,140,629
        [Fri Nov 08 10:04:00 GMT 2013] net.sf.picard.sam.ValidateSamFile done. Elapsed time: 4.30 minutes.
        Runtime.totalMemory()=300941312
        To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp
        Exception in thread "main" net.sf.samtools.util.RuntimeIOException: Read error; BinaryCodec in readmode; file: /home/fles/Scratch/project_analysis/recalibrated/projectTrios.U1_PJ5208467.clean.dedup.recal.bam
            at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:397)
            at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:371)
            at net.sf.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:357)
            at net.sf.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:200)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:558)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:532)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:522)
            at net.sf.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:481)
            at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:687)
            at net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:665)
            at net.sf.picard.sam.SamFileValidator.validateSamRecordsAndQualityFormat(SamFileValidator.java:241)
            at net.sf.picard.sam.SamFileValidator.validateSamFile(SamFileValidator.java:177)
            at net.sf.picard.sam.SamFileValidator.validateSamFileSummary(SamFileValidator.java:104)
            at net.sf.picard.sam.ValidateSamFile.doWork(ValidateSamFile.java:164)
            at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
            at net.sf.picard.sam.ValidateSamFile.main(ValidateSamFile.java:100)
        Caused by: java.io.IOException: Unexpected compressed block length: 1
            at net.sf.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:358)
            at net.sf.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:113)
            at net.sf.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:238)
            at java.io.DataInputStream.read(DataInputStream.java:149)
            at net.sf.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:395)

any suggestions on what I might do wrong?

Comments (10)

I used BaseRecalibration and generated a bam file using PrintReads correctly along with .bai which was generated automatically. However, when I am trying to reduce the bam file using ReduceReads, I am getting the error below:

ERROR MESSAGE: SAM/BAM file

/bam_base_recalib/dedup_51769_R1_R2_RG_realigned_recal_reads.bam is malformed: Premature EOF; BinaryCodec in readmode; file: /bam_base_recalib/dedup_51769_R1_R2_RG_realigned_recal_reads.bam

Comments (5)

can I use samtools index to index reduced bam files? I'm using GATK version 2.7-2-g6bda569 and HC asks for the index files.

Thanks,

Comments (10)

I'm using GATK 2.74 on my server with Java 1.7 on Mac. I'm running through many .bam files that was produced by the upstream PrintReads. I am using a for-loop in a shell script to loop all the .bam files through ReduceReads Some of these bam files were not compressed by GATK-ReduceReads, and it gives 0B .bam output files.

The command line I use is following:

java -Xmx10g -Djava.awt.headless=true -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ./GATK_ref/hg19.fasta \ -S LENIENT \ -log ../GATK/BQSR/log/file1.compress.log \ -I ../GATK/BQSR/file1.recal.bam \ -o ../GATK/BQSR/file1.compressed.bam

I'm copying the tail part of the log for one of the failed .bam files: from this log, I don't see any error. Maybe ReduceReads walker just terminated itself earlier???

INFO 10:28:13,654 ProgressMeter - chr4:48710213 2.47e+07 19.2 m 46.0 s 23.6% 81.4 m 62.2 m INFO 10:28:43,657 ProgressMeter - chr4:83322785 2.54e+07 19.7 m 46.0 s 24.7% 79.8 m 60.1 m INFO 10:29:13,659 ProgressMeter - chr4:112743605 2.60e+07 20.2 m 46.0 s 25.6% 78.8 m 58.6 m INFO 10:29:43,662 ProgressMeter - chr4:151294343 2.67e+07 20.7 m 46.0 s 26.8% 77.1 m 56.4 m INFO 10:30:13,664 ProgressMeter - chr4:184999795 2.72e+07 21.2 m 46.0 s 27.9% 75.9 m 54.7 m INFO 10:30:43,667 ProgressMeter - chr5:14341390 2.79e+07 21.7 m 46.0 s 28.6% 75.9 m 54.2 m INFO 10:31:13,796 ProgressMeter - chr5:37088471 2.85e+07 22.2 m 46.0 s 29.3% 75.7 m 53.6 m INFO 10:31:43,798 ProgressMeter - chr5:67907077 2.92e+07 22.7 m 46.0 s 30.3% 74.9 m 52.3 m INFO 10:32:13,914 ProgressMeter - chr5:90107126 2.97e+07 23.2 m 46.0 s 31.0% 74.8 m 51.7 m INFO 10:32:44,969 ProgressMeter - chr5:127450350 3.03e+07 23.7 m 46.0 s 32.2% 73.7 m 50.0 m

Comments (4)

I recently upgraded from GATK 2.5 to the latest 2.74 stable release, but the Reduce Reads throws the following error when I try to run it with a Bam file that was produced by "PrintReads" (3 samples merged in one Bam file).

MESSAGE: Bad input: Reduce Reads is not meant to be run for more than 1 sample at a time except for the specific case of tumor/normal >pairs in cancer analysis

java -Xmx6g -Djava.awt.headless=true -jar $CLASSPATH/GenomeAnalysisTK.jar \ -T ReduceReads \ -R ../GATK_ref/hg19.fasta \ -I ../GATK/BQSR/all3Samples2.recal.bam \ -o ../GATK/BQSR/all3Samples.recal.compressed.bam

It used to work with the old version of GATK, but it does not work now. Where could it be wrong?

Comments (3)

What is the expected effect of using GATK 2.7 HaplotyeCaller+VQSR on a WGS 30x bam or the same bam being processed through ReducedReads beforehand? Does one expect exactly the same variants called on both files or a small difference between them? Do HaplotypeCaller or VQSR treat the input differently if it comes from a full WGS bam or a WGS reduced bam?

Comments (12)

I am putting together a class on NGS data analysis and am working with one of Illumina's "Platinum" data sets (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR262996). This is a result set for a long range mate-pair experiment with about 33x coverage. When I run ReduceReads on the mapped, aligned, and indel re-aligned BAM file using default parameters, I get only about a 20% reduction in file size. Closer inspection shows that none of the reads were actually removed. I tried with both the complete results and a chromosome 20 subset.

What am I missing?

Comments (21)

Hello, While trying to run ReduceReads on my BAM outputs of PrintReads, using the latest build v2.6-2-ge03a5e9, I get the following error, kindly comment on this error trace:

Command: java -Xmx8g -jar /path/to/my/server/ashish/tools/GenomeAnalysisTK-2.6-2-ge03a5e9/GenomeAnalysisTK.jar -T ReduceReads -R /path/to/my/server/pipeline/lib/hs37d5.fa -I /path/to/my/server/output/bamext.base.recal/41453_TTAGGC_L002_001.bam -o /path/to/my/server/output/bamext.reduced/41453_TTAGGC_L002_001.bam

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Removed too many insertions, header is now negative at position 151039408 at org.broadinstitute.sting.gatk.walkers.compression.reducereads.HeaderElement.removeInsertionToTheRight(HeaderElement.java:210) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.actuallyUpdateHeaderForRead(SlidingWindow.java:1246) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.updateHeaderCounts(SlidingWindow.java:1162) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.removeFromHeader(SlidingWindow.java:1139) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.compressVariantRegion(SlidingWindow.java:742) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegion(SlidingWindow.java:835) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegions(SlidingWindow.java:884) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.close(SlidingWindow.java:971) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.addAlignment(SingleSampleCompressor.java:113) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:123) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:116) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:477) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:113) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:251) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:240) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:279) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:102) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:311) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.6-2-ge03a5e9):
ERROR
ERROR Please check the documentation guide to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Removed too many insertions, header is now negative at position 151039408
ERROR ------------------------------------------------------------------------------------------
Comments (2)

Hi,

I've been trying to get ReduceReads working in a pipeline I've made that incorporates GATK tools to call variants in RNA-seq data. After performing indel realignment and base recalibration I'm trying to use ReduceReads prior to calling variants using Unified Genotyper.

I've been using GATK version 2.3.9. When I try to use ReduceReads on a 1.7Gb .bam file, I need to set aside 100Gb memory to perform the operation for the process to complete (otherwise I'll get an error saying I didn't provide enough memory to run the program and to adjust the maximum heap size using the -Xmx option etc).

The problem isn't that ReduceReads doesn't work - it does, however of the 100Gb I set aside, it uses 80-90Gb of it. This means I can't run more than one job at a time due to the constraints of the machine I'm using etc.

I've been looking through the GATK forum and understand it may be a GATK version issue, though I've tried using GATK 2.5.2 ReduceReads for this step and it still requires 70-80Gb memory.

Can anyone provide any clues as to what I may be doing wrong? or whether I can do something to make it use less memory so I can run multiple jobs simultaneously?

The command I'm using is:

java -Xmx100g -Djava.io.tmpdir=/RAW/javatmp/ -jar /NMC/LCR/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T ReduceReads -R /SCRATCH/LCR/BWAIndex_hg19/genome.fa -I out.bam.sorted.bam.readGroups.bam.rmdup.bam.realigned.bam.recalibrated.bam -o out.bam.sorted.bam.readGroups.bam.rmdup.bam.realigned.bam.recalibrated.bam.reducedReads.bam

Thanks in advance,

Alex

Comments (0)

I am a complete newb. Even with help and support from my lab mates, I need to read your materials. I was sent by the GATK Guide Book (page 10; section 4) to Dropbox location https://www.dropbox.com/sh/e31kvbg5v63s51t/ajQmlTL6YH where I picked up ReduceReads.pdf On page 11 of that document there are ten graphs. The resolution of the .pdf file is so low that I cannot read the legends on the left side and bottom of these ten graphs. Could you point me to the high resolution version of this .pdf ?

Thanks

Comments (1)

I am using UnifiedGenotyper to call SNPs in certain regions from custom capture data. I previously had the pipeline working, but now I am trying with files that have been reduced using ReduceReads, and also changed to a newer version. I have many bam files, but I also get the error when I try with just two. See below for my script and the error message.

Many thanks.

java -Xmx20g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper \ -R human_g1k_v37.fasta \ -B:dbsnp,vcf dbsnp_132.b37.vcf \ -L baitgroupfile.picard \ -I file1.reduced.bam \ -I file2.reduced.bam \ -o out.vcf \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -G Standard \ -metrics out.metrics

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

net.sf.samtools.SAMFormatException: Unrecognized tag type: B at net.sf.samtools.BinaryTagCodec.readValue(BinaryTagCodec.java:270) at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:220) at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:302) at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:282) at net.sf.samtools.SAMRecord.getAttribute(SAMRecord.java:830) at net.sf.picard.sam.MergingSamRecordIterator.next(MergingSamRecordIterator.java:132) at net.sf.picard.sam.MergingSamRecordIterator.next(MergingSamRecordIterator.java:39) at org.broadinstitute.sting.gatk.iterators.PrivateStringSAMCloseableIterator.next(StingSAMIteratorAdapter.java:100) at org.broadinstitute.sting.gatk.iterators.PrivateStringSAMCloseableIterator.next(StingSAMIteratorAdapter.java:84) at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource$ReleasingIterator.next(SAMDataSource.java:803) at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource$ReleasingIterator.next(SAMDataSource.java:769) at org.broadinstitute.sting.gatk.iterators.ReadFormattingIterator.next(ReadFormattingIterator.java:77) at org.broadinstitute.sting.gatk.iterators.ReadFormattingIterator.next(ReadFormattingIterator.java:19) at org.broadinstitute.sting.gatk.filters.CountingFilteringIterator.getNextRecord(CountingFilteringIterator.java:106) at org.broadinstitute.sting.gatk.filters.CountingFilteringIterator.(CountingFilteringIterator.java:59) at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource.applyDecoratingIterators(SAMDataSource.java:598) at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource.getIterator(SAMDataSource.java:517) at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource.seek(SAMDataSource.java:462) at org.broadinstitute.sting.gatk.executive.MicroScheduler.getReadIterator(MicroScheduler.java:150) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:56) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:209) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:109) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:239) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:87)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 1.0.4905):
ERROR
ERROR Please visit to wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
ERROR
ERROR MESSAGE: Unrecognized tag type: B
ERROR ------------------------------------------------------------------------------------------
Comments (2)

Hello all thank for the great work.

I have run into an issue with ReduceReads and I was hoping you could offer some insite. I'm getting the following stack trace issue (attached file). I looked around the forums, and others who were getting stack trace issue using ReduceReads were told code fixes would remedy the issue so I thought I would check with you. I also ran samtools flagstat.

Comments (2)

Hi, I'm just wondering if it is a good idea to run my pipeline again with ReduceReads. I skipped it originally as I only have four (mouse) samples but having re-read the documentation with the additional filters, I am now considering if it might add value. Any thoughts appreciated.

Comments (1)

HI all,

I am analyzing some whole genome sequencing datas .After preprocessing by Queue got a large bam file on sample level (~ 200GB/sample ) and I wanted to use ReaduceReads module to reduce the bam file size. and running following command: /usr/java/latest/bin/java -Xmx16g -jar /path_to_GenomeAnalysisTK-2.3-9/GenomeAnalysisTK.jar -R /path_to_human_g1k_v37.fasta -T ReduceReads -I /path_to_Queue/project.sample.clean.dedup.recal.bam -o sample.reduced.bam --generate_md5

After 8 hours , the estimated time goes to 6.9 days.

INFO 20:02:25,508 ProgressMeter - 1:120660726 5.63e+07 6.5 h 7.0 m 3.9% 7.0 d 6.7 d INFO 20:03:25,509 ProgressMeter - 1:120660726 5.63e+07 6.5 h 7.0 m 3.9% 7.0 d 6.7 d INFO 20:04:25,510 ProgressMeter - 1:120660726 5.63e+07 6.6 h 7.0 m 3.9% 7.0 d 6.8 d INFO 20:05:25,511 ProgressMeter - 1:120660726 5.63e+07 6.6 h 7.0 m 3.9% 7.0 d 6.8 d INFO 20:06:25,512 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.0 m 3.9% 7.1 d 6.8 d INFO 20:07:25,528 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.0 m 3.9% 7.1 d 6.8 d INFO 20:08:25,529 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.1 m 3.9% 7.1 d 6.8 d INFO 20:09:25,530 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.1 m 3.9% 7.1 d 6.8 d INFO 20:10:25,531 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.1 m 3.9% 7.1 d 6.9 d INFO 20:11:25,532 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.1 m 3.9% 7.2 d 6.9 d INFO 20:12:25,533 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.1 m 3.9% 7.2 d 6.9 d INFO 20:13:25,534 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.2 m 3.9% 7.2 d 6.9 d INFO 20:14:25,535 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.2 m 3.9% 7.2 d 6.9 d

The tool version is GenomeAnalysisTK-2.3-9

Is there anything wrong with my command ? How could I speed up this procedure? Thanks a lot .

Comments (3)

ReduceReads is very slow for MT reads. After it gets by the MT, it runs much faster (See output below)

Any ideas why and what to do to speed it up?

John

INFO 23:32:37,536 HelpFormatter - --------------------------------------------------------------------------------- INFO 23:32:37,545 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-16-g9f648cb, Compiled 2012/12/04 03:46:58 INFO 23:32:37,545 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 23:32:37,545 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 23:32:37,551 HelpFormatter - Program Args: -R /unprotected/projects/genetics_program/resources/gatk_bundle/hg19/ucsc.hg19.fasta -T ReduceReads -I bam/LP6005113-DNA_E01.recal.bam -o LP6005113- DNA_E01.reduced.bam INFO 23:32:37,551 HelpFormatter - Date/Time: 2013/01/28 23:32:37 INFO 23:32:37,551 HelpFormatter - --------------------------------------------------------------------------------- INFO 23:32:37,551 HelpFormatter - --------------------------------------------------------------------------------- INFO 23:32:37,605 GenomeAnalysisEngine - Strictness is SILENT INFO 23:32:37,984 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 23:32:37,992 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 23:32:38,073 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08 INFO 23:32:38,113 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 23:32:38,114 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 23:33:14,010 ProgressMeter - chrM:1992 4.00e+04 35.9 s 15.0 m 0.0% 93.5 w 93.5 w INFO 23:35:18,038 ProgressMeter - chrM:2879 6.00e+04 2.7 m 44.4 m 0.0% 288.2 w 288.2 w INFO 23:37:06,320 ProgressMeter - chrM:3259 7.00e+04 4.5 m 63.9 m 0.0% 427.0 w 427.0 w INFO 23:39:03,457 ProgressMeter - chrM:3662 8.00e+04 6.4 m 80.3 m 0.0% 546.0 w 546.0 w INFO 23:41:16,174 ProgressMeter - chrM:4087 9.00e+04 8.6 m 95.9 m 0.0% 657.7 w 657.7 w INFO 23:43:46,243 ProgressMeter - chrM:4550 1.00e+05 11.1 m 111.3 m 0.0% 761.8 w 761.8 w INFO 23:46:47,501 ProgressMeter - chrM:4973 1.10e+05 14.2 m 2.1 h 0.0% 886.1 w 886.1 w INFO 23:49:57,085 ProgressMeter - chrM:5379 1.20e+05 17.3 m 2.4 h 0.0% 1002.1 w 1002.1 w INFO 23:52:52,173 ProgressMeter - chrM:5823 1.30e+05 20.2 m 2.6 h 0.0% 1081.7 w 1081.7 w INFO 23:54:28,697 ProgressMeter - chrM:7492 1.70e+05 21.8 m 2.1 h 0.0% 907.5 w 907.5 w INFO 23:55:45,484 ProgressMeter - chrM:7883 1.80e+05 23.1 m 2.1 h 0.0% 913.0 w 913.0 w INFO 23:57:16,597 ProgressMeter - chrM:8305 1.90e+05 24.6 m 2.2 h 0.0% 923.5 w 923.5 w INFO 23:59:07,109 ProgressMeter - chrM:8731 2.00e+05 26.5 m 2.2 h 0.0% 944.1 w 944.1 w INFO 00:01:16,623 ProgressMeter - chrM:9124 2.10e+05 28.6 m 2.3 h 0.0% 977.1 w 977.1 w INFO 00:04:12,150 ProgressMeter - chrM:9526 2.20e+05 31.6 m 2.4 h 0.0% 1031.4 w 1031.4 w INFO 00:06:51,054 ProgressMeter - chrM:9896 2.30e+05 34.2 m 2.5 h 0.0% 1076.2 w 1076.2 w INFO 00:09:31,477 ProgressMeter - chrM:10244 2.40e+05 36.9 m 2.6 h 0.0% 1120.9 w 1120.9 w INFO 00:12:57,847 ProgressMeter - chrM:10626 2.50e+05 40.3 m 2.7 h 0.0% 1181.3 w 1181.3 w INFO 00:16:48,872 ProgressMeter - chrM:11139 2.60e+05 44.2 m 2.8 h 0.0% 1234.5 w 1234.5 w INFO 00:20:54,282 ProgressMeter - chrM:11634 2.70e+05 48.3 m 3.0 h 0.0% 1291.4 w 1291.4 w INFO 00:25:23,381 ProgressMeter - chrM:12098 2.80e+05 52.8 m 3.1 h 0.0% 1357.2 w 1357.2 w INFO 00:30:01,695 ProgressMeter - chrM:12464 2.90e+05 57.4 m 3.3 h 0.0% 1433.2 w 1433.2 w INFO 00:34:41,008 ProgressMeter - chrM:12805 3.00e+05 62.0 m 3.4 h 0.0% 1508.2 w 1508.2 w INFO 00:39:41,462 ProgressMeter - chrM:13307 3.10e+05 67.1 m 3.6 h 0.0% 1568.4 w 1568.4 w INFO 00:45:21,827 ProgressMeter - chrM:13764 3.20e+05 72.7 m 3.8 h 0.0% 1644.6 w 1644.6 w INFO 00:51:15,645 ProgressMeter - chrM:14173 3.30e+05 78.6 m 4.0 h 0.0% 1726.7 w 1726.7 w INFO 00:57:38,039 ProgressMeter - chrM:14639 3.40e+05 85.0 m 4.2 h 0.0% 1807.2 w 1807.2 w INFO 01:06:06,413 ProgressMeter - chrM:15067 3.50e+05 93.5 m 4.5 h 0.0% 1930.9 w 1930.9 w INFO 01:15:07,742 ProgressMeter - chrM:15463 3.60e+05 102.5 m 4.7 h 0.0% 2063.0 w 2063.0 w INFO 01:23:17,067 ProgressMeter - chrM:15827 3.70e+05 110.6 m 5.0 h 0.0% 2176.0 w 2176.0 w INFO 01:31:08,225 ProgressMeter - chrM:16237 3.80e+05 118.5 m 5.2 h 0.0% 2271.6 w 2271.5 w INFO 01:32:08,631 ProgressMeter - chr1:3000534 1.17e+06 119.5 m 102.5 m 0.1% 12.3 w 12.3 w INFO 01:33:09,058 ProgressMeter - chr1:5169965 1.91e+06 2.0 h 63.2 m 0.2% 7.2 w 7.2 w INFO 01:34:09,530 ProgressMeter - chr1:7090404 2.65e+06 2.0 h 45.9 m 0.2% 5.3 w 5.3 w INFO 01:35:10,334 ProgressMeter - chr1:8806475 3.32e+06 2.0 h 37.0 m 0.3% 4.3 w 4.3 w INFO 01:36:10,654 ProgressMeter - chr1:10887467 4.08e+06 2.1 h 30.3 m 0.3% 3.5 w 3.5 w INFO 01:37:10,892 ProgressMeter - chr1:12756332 4.77e+06 2.1 h 26.1 m 0.4% 3.0 w 3.0 w INFO 01:38:11,087 ProgressMeter - chr1:14746000 5.29e+06 2.1 h 23.8 m 0.5% 18.5 d 18.4 d INFO 01:39:11,327 ProgressMeter - chr1:16699493 6.02e+06 2.1 h 21.0 m 0.5% 16.5 d 16.4 d INFO 01:40:11,606 ProgressMeter - chr1:18706430 6.86e+06 2.1 h 18.6 m 0.6% 14.8 d 14.8 d

Comments (2)

Hi there, I've tried to run ReduceReads for the first time and I got this error:

`$ java -Xmx8g -jar /lustre1/tools/bin/GenomeAnalysisTK-2.3-6.jar -T ReduceReads -R /lustre1/genomes/hg19/fa/hg19.fa -I filein.bam -o fileout.bam […]

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.closeVariantRegions(SingleSampleCompressor.java:83) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.closeVariantRegionsInAllSamples(MultiSampleCompressor.java:94) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:76) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:387) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:87) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:226) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:215) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:254) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.3-6-gebbba25):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

` Is there something wrong with running RR with multiple samples?

d

Comments (3)

Hello,

Im trying to call variants using UnifiedGenotyper on ca 450 reduced bams in 100000 bp chunks. It works fine for some of the chunks, but for others I get the following error message:

ERROR MESSAGE: SAM/BAM nameOfBam.bam is malformed: Unexpected compressed block length: 1

Can anyone explain to me why there is a problem with a specific bam file when I call on for example chunk chr20:25400000-25500000 but not when I call on chunk chr20:10000000-10100000?

Thank you, Tota

Comments (11)

I have read on your recent slides for "Data Compression with Reduce Reads" that "Tumor and Normal samples (or any set of samples) get co-­‐reduced, meaning that every variable region triggered by one sample will be forced in every sample."

I have data from 4 variant strains of an organism, my samples in RG info, and 4 individuals for each strain, my libraries in RG info. Currently I have a bam file for each of the 16 different libraries.

If I want to run ReduceReads as I have quite high coverage, but preserve information across all of my samples where a site is not consensus in just one as there is no snp information available for this organism and I don't want to lose any important data. Should I merge all bam files for all samples before proceeding with ReduceReads with downsampling turned off? Or just leave out ReduceReads?

Thanks Anna

Comments (23)

New gatk version... trying out ReduceReads again.

6 of 8 exomes I tried were processed by ReduceReads just fine, but two throw the exception Removed too many insertions, header is now negative! (at different genomic locations).

I did not find any mention of this error in the GATK forums, is this a known problem?

Command line: java -Xmx6g -jar GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T ReduceReads -o test.rr.bam -I rr-too-many-insertions.bam

java -v: java version "1.6.0_27" Java(TM) SE Runtime Environment (build 1.6.0_27-b07) Java HotSpot(TM) 64-Bit Server VM (build 20.2-b06, mixed mode)

Run log:

INFO  16:03:26,898 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:03:27,382 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-0-g9593e74, Compiled 2012/12/17 16:58:19 
INFO  16:03:27,383 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:03:27,383 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:03:27,388 HelpFormatter - Program Args: -R human_g1k_v37.fasta -T ReduceReads -o test.rr.bam -I rr-too-many-insertions.bam 
INFO  16:03:27,388 HelpFormatter - Date/Time: 2012/12/18 16:03:26 
INFO  16:03:27,388 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:03:27,388 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:03:27,471 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:03:27,577 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
INFO  16:03:27,585 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:03:27,620 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 
INFO  16:03:27,656 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  16:03:27,657 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
INFO  16:03:27,714 ReadShardBalancer$1 - Loading BAM index data for next contig 
INFO  16:03:27,717 ReadShardBalancer$1 - Done loading BAM index data for next contig 
INFO  16:03:27,739 ReadShardBalancer$1 - Loading BAM index data for next contig 
INFO  16:03:28,739 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Removed too many insertions, header is now negative!
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.HeaderElement.removeInsertionToTheRight(HeaderElement.java:151)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.updateHeaderCounts(SlidingWindow.java:881)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.removeFromHeader(SlidingWindow.java:816)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.compressVariantRegion(SlidingWindow.java:604)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegion(SlidingWindow.java:623)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegions(SlidingWindow.java:643)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.closeVariantRegions(SingleSampleCompressor.java:83)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.closeVariantRegionsInAllSamples(MultiSampleCompressor.java:94)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:76)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:387)
    at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:87)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:226)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:215)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:254)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
    at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:94)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-0-g9593e74):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Removed too many insertions, header is now negative!
##### ERROR ------------------------------------------------------------------------------------------

(there is no progress listed here because this log is from after I bisected to find a narrow region where the problem is occuring).

Comments (1)

I've tried using the output of the Reduced Bams as an input to Crest (after some preprocessing) but it hangs on chr7. Has anyone else used the reduced bam in other programs? Is this output meant to only be used in GATK?

Thanks!

Comments (33)

Hello dear GATK Team,

when trying to run Haplotypecaller on my exome files prepared with ReduceReads i get the error stated below. As you can see the newest GATK Version is used. Also UnifiedGenotyper does not produce any errors on te exact same data (90 SOLiD exomes creatted according to Best Practice v4).

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
    at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447)
    at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396)
    at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392)
    at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56)
    at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24)
    at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:429)
    at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:104)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:249)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:204)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:179)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:136)
    at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:29)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
##### ERROR ------------------------------------------------------------------------------------------

The Command line used (abbreviated):

java -Xmx30g -jar /home/common/GenomeAnalysisTK-2.2-3/GenomeAnalysisTK.jar \
  -R /home/common/hg19/ucschg19/ucsc.hg19.fasta \
  -T HaplotypeCaller \
  -I ReduceReads/XXXXX.ontarget.MarkDups.nRG.reor.Real.Recal.reduced.bam [x90]\
  --dbsnp /home/common/hg19/dbsnp_135.hg19.vcf \
  -o 93Ind_ped_reduced_HC_snps.raw.vcf \
  -ped familys.ped \
  --pedigreeValidationType SILENT \
  -stand_call_conf 20.0 \
  -stand_emit_conf 10.0
Comments (4)

Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?

Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.

Comments (5)

Hi Team,

I have been running GATK2 ReduceReads on a large (100Gb) Bam file, and even though at the very beginning it runs very smoothly and predicts a week for finishing the task, after a few hours it gets totally stock. We first thought that it could be a garbage collection (or java memory allocation issue), but the logs show that the garbage collection works well.

The command is (similar behavior for smaller Xms and Xmx values) java -Xmx30g -Xms30g -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC -XX:+PrintGCDetails -Xloggc:gc.log -verbose:gc -jar $path $ref -T ReduceReads -I input.bam -o output.bam

The first few lines of the log file are

INFO 01:12:21,541 TraversalEngine - chr1:1094599 5.89e+05 9.9 m 16.8 m 0.0% 19.4 d 19.4 d INFO 01:13:21,628 TraversalEngine - chr1:2112411 9.44e+05 10.9 m 11.6 m 0.1% 11.2 d 11.2 d INFO 01:14:22,065 TraversalEngine - chr1:3051535 1.29e+06 11.9 m 9.3 m 0.1% 8.5 d 8.5 d INFO 01:15:22,297 TraversalEngine - chr1:4084547 1.59e+06 12.9 m 8.1 m 0.1% 6.9 d 6.9 d INFO 01:16:24,130 TraversalEngine - chr1:4719991 1.82e+06 13.9 m 7.7 m 0.2% 6.4 d 6.4 d

but after a short while it gets totally stock, and even in the location 121485073 of chromosome 1, there is almost no progress at all, and the estimated finish time goes over 11 weeks, and still increasing.

Any idea what the reason for this could be, and how we can solve the problem? The same command runs successfully on small (less than 5gig) Bam files though

Thanks in advance. --Sina

Comments (3)

Hi, I'm running GATK version 2.1-8 with reads mapped to mm10. ReduceReads fails somewhere on chr2 with above message. From previous posts I understood that this bug has appeared already? Could you please help me to fix it? Thank you,

Ania

Comments (2)

Hi,

I'm trying to use GATK release2.0 with my nine exome-seq samples, following the steps on best practice I generated per-sample, ready-to-process .bam files and then used -T ReduceReads to generate .reduced.bam files for the next step (-T UnifiedGenotyper). When using these .reduced.bam files as UG input I receive this error message: "##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?" if I take my original .bam files as input things work smoothly. Do you have any idea what causes the problem?

Thanks a lot, Samira

Here is the command lines I use:

java -Xmx4g -jar $GATKv4 \
    -R $GATK_BUNDLE/ucsc.hg19.fasta \
    -T ReduceReads \
    -L $capture_library.bed \
    -I $i.recal_s.bam \
    -o $i.reduced.bam

 java -jar $GATKv4 \
    -T HaplotypeCaller \
    -R $GATK_BUNDLE/ucsc.hg19.fasta \
    -I InputReducedBams.list \
    -L $capture_library.bed \
    --dbsnp GATK_BUNDLE/dbsnp_135.hg19.vcf \
    -o raw.snp.indel.UnifiedGenotyper.rsv.vcf
Comments (2)

Hallo everyone, I have a question about ReduceReads when using scatter/gather. In the argument details of ReduceReads you write for the parameter -nocmp_names: "... If you scatter/gather there is no guarantee that read name uniqueness will be maintained -- in this case we recommend not compressing."

Do you mean, that if I use scatter/gather, I should use ReduceReads with the -nocmp_names option so that the read names will not be compressed OR do you mean that I should not use ReduceReads at all when scatter/gathering.

I assume the first is meant, I just wanted to make sure. Thank you for your time and effort. Eva

Comments (13)

We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?

Comments (2)

Hi all, I am trying to use the new feature "reduceReads" and I get an error everytime. Can anyone tell me what is the problem? BTW, I am working on yeast's genome and not human, if it is matter.

INFO 14:21:07,687 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:21:07,688 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:17:07 INFO 14:21:07,688 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:21:07,688 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:21:07,689 HelpFormatter - Program Args: -R /home/mps/references/SK1_v2/fasta/SK1_v2.fixed.fa -T ReduceReads -I output.marked.realigned.fixed.recal.bam -o output.marked.realigned.fixed.recal.reduced.bam -l INFO INFO 14:21:07,689 HelpFormatter - Date/Time: 2012/08/09 14:21:07 INFO 14:21:07,689 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:21:07,690 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:21:07,759 GenomeAnalysisEngine - Strictness is SILENT INFO 14:21:07,791 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:21:07,804 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 14:21:08,076 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 14:21:08,076 TraversalEngine - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 14:21:38,548 TraversalEngine - SK1.chr01:63354 3.90e+04 30.5 s 13.0 m 0.5% 98.2 m 97.7 m INFO 14:22:08,706 TraversalEngine - SK1.chr01:79167 5.20e+04 60.6 s 19.4 m 0.6% 2.6 h 2.6 h INFO 14:22:38,976 TraversalEngine - SK1.chr01:98653 6.90e+04 90.9 s 22.0 m 0.8% 3.1 h 3.1 h INFO 14:23:10,903 TraversalEngine - SK1.chr01:114413 8.20e+04 2.0 m 25.0 m 0.9% 3.7 h 3.6 h INFO 14:23:43,523 TraversalEngine - SK1.chr01:125477 9.20e+04 2.6 m 28.2 m 1.0% 4.2 h 4.2 h INFO 14:24:15,215 TraversalEngine - SK1.chr01:145667 1.09e+05 3.1 m 28.6 m 1.2% 4.4 h 4.3 h INFO 14:24:45,785 TraversalEngine - SK1.chr01:163339 1.23e+05 3.6 m 29.5 m 1.3% 4.5 h 4.5 h INFO 14:25:17,660 TraversalEngine - SK1.chr01:179555 1.46e+05 4.2 m 28.5 m 1.5% 4.7 h 4.7 h INFO 14:25:49,088 TraversalEngine - SK1.chr01:213605 1.71e+05 4.7 m 27.4 m 1.7% 4.5 h 4.4 h INFO 14:25:51,716 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArithmeticException: / by zero at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.downsampleVariantRegion(SlidingWindow.java:539) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegion(SlidingWindow.java:498) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegions(SlidingWindow.java:520) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.close(SlidingWindow.java:562) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.addAlignment(SingleSampleCompressor.java:64) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:70) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:344) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:83) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:107) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.0-36-gf5c1c1a):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: / by zero
ERROR ------------------------------------------------------------------------------------------
Comments (27)

Hi,

when I run ReduceReads I get the following exception just when it's supposed to finish:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.util.NoSuchElementException at java.util.LinkedList$ListItr.next(Unknown Source) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.updateHeaderCounts(SlidingWindow.java:697) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.addRead(SlidingWindow.java:128) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.addAlignment(SingleSampleCompressor.java:73) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:70) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:347) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:86) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:107) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.0-21-ga40b695):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I run it with the standard arguments: java -jar GenomAnalysisTK.jar \ --logging_level ERROR \ -R hg19.fa \ -T ReduceReads \ -I in.bam \ -o reduced.in.bam

Anny suggestions?

Thanks, Thomas

Comments (1)

I'm working with ReduceReads and would like to use it in some kind of parallel mode. The presentation mentions that a 50x way run may drastically reduce run time but I'm not sure how to invoke this. I tried -nt and it complained. Should I be giving it multiple intervals and merging? If so, how does it deal with edge variants?

Thanks.