When running reduce reads, the algorithm will find regions of low variation in the genome and compress them together. To represent this compressed region, we use a synthetic read that carries all the information necessary to downstream tools to perform likelihood calculations over the reduced data.
They are called Synthetic because they are not read by a sequencer, these reads are automatically generated by the GATK and can be extremely long. In a synthetic read, each base will represent the consensus base for that genomic location. Each base will have it's consensus quality score represented in the equivalent offset in the quality score string.
ReduceReads has several filtering parameters for consensus regions. Consensus is created based on base qualities, mapping qualities and other adjustable parameters from the command line. All filters are described in the technical documentation of reduce reads.
The consensus quality score of a consensus base is essentially the mean of all bases that passed all the filters and represent an observation of that base. It is represented in the quality score field of the SAM format.

n is the number of bases that contributed to the consensus base and q_i is the corresponding quality score of each base.
Insertion quality scores and Deletion quality scores (generated by BQSR) will undergo the same process and will be represented the same way.
The mapping quality of a synthetic read is a value representative of the mapping qualities of all the reads that contributed to it. This is an average of the root mean square of the mapping quality of all reads that contributed to the bases of the synthetic read. It is represented in the mapping quality score field of the SAM format.

where n is the number of reads and x_i is the mapping quality of each read.
A synthetic read may come with up to two extra tags representing its original alignment information. Due to many filters in ReduceReads, reads are hard-clipped to the are of interest. These hard-clips are always represented in the cigar string with the H element and the length of the clipping in genomic coordinates. Sometimes hard clipping will make it impossible to retrieve what was the original alignment start / end of a read. In those cases, the read will contain extra tags with integer values representing their original alignment start or end.
Here are the two integer tags:
For all other reads, where this can still be obtained through the cigar string (i.e. using getAlignmentStart() or getUnclippedStart()), these tags are not created.
the RR tag is a tag that holds the observed depth (after filters) of every base that contributed to a reduce read. That means all bases that passed the mapping and base quality filters, and had the same observation as the one in the reduced read.
The RR tag carries an array of bytes and for increased compression, it works like this: the first number represents the depth of the first base in the reduced read. all subsequent numbers will represent the offset depth from the first base. Therefore, to calculate the depth of base "i" using the RR array, one must use :
RR[0] + RR[i]
but make sure i > 0. Here is the code we use to return the depth of the i'th base:
return (i==0) ? firstCount : (byte) Math.min(firstCount + offsetCount, Byte.MAX_VALUE);
The GATK is 100% compatible with synthetic reads. You can use Reduced BAM files in combination with non-reduced BAM files in any GATK analysis tools and it will work seamlessly.
If you are programming using the GATK framework, the GATKSAMRecord class carries all the necessary functionality to use synthetic reads transparently with methods like:
We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.
This has now been fixed in the most recent release.
To check if you are using a buggy version, run the following:
samtools view -H $BAM
This will produce the following output:
@PG ID:GATK ReduceReads VN:XXX
If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing.
Our most sincere apologies for the inconvenience.
GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Does the relationship between AD and DP stil hold in VCF produced from ReduceRead BAMs? That is the sum of AD is <= DP Or can other scenarios now occur?
Also is AD summarized to 1,0 or 0,1 for homozygous REF and ALT? Thanks.
Hi Team,
I have been running GATK2 ReduceReads on a large (100Gb) Bam file, and even though at the very beginning it runs very smoothly and predicts a week for finishing the task, after a few hours it gets totally stock. We first thought that it could be a garbage collection (or java memory allocation issue), but the logs show that the garbage collection works well.
The command is (similar behavior for smaller Xms and Xmx values) java -Xmx30g -Xms30g -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC -XX:+PrintGCDetails -Xloggc:gc.log -verbose:gc -jar $path $ref -T ReduceReads -I input.bam -o output.bam
The first few lines of the log file are
INFO 01:12:21,541 TraversalEngine - chr1:1094599 5.89e+05 9.9 m 16.8 m 0.0% 19.4 d 19.4 d INFO 01:13:21,628 TraversalEngine - chr1:2112411 9.44e+05 10.9 m 11.6 m 0.1% 11.2 d 11.2 d INFO 01:14:22,065 TraversalEngine - chr1:3051535 1.29e+06 11.9 m 9.3 m 0.1% 8.5 d 8.5 d INFO 01:15:22,297 TraversalEngine - chr1:4084547 1.59e+06 12.9 m 8.1 m 0.1% 6.9 d 6.9 d INFO 01:16:24,130 TraversalEngine - chr1:4719991 1.82e+06 13.9 m 7.7 m 0.2% 6.4 d 6.4 d
but after a short while it gets totally stock, and even in the location 121485073 of chromosome 1, there is almost no progress at all, and the estimated finish time goes over 11 weeks, and still increasing.
Any idea what the reason for this could be, and how we can solve the problem? The same command runs successfully on small (less than 5gig) Bam files though
Thanks in advance. --Sina
Hello dear GATK Team,
when trying to run Haplotypecaller on my exome files prepared with ReduceReads i get the error stated below. As you can see the newest GATK Version is used. Also UnifiedGenotyper does not produce any errors on te exact same data (90 SOLiD exomes creatted according to Best Practice v4).
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Somehow the requested coordinate is not covered by the read. Too many deletions?
at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:447)
at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:396)
at org.broadinstitute.sting.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:392)
at org.broadinstitute.sting.gatk.walkers.annotator.DepthOfCoverage.annotate(DepthOfCoverage.java:56)
at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:24)
at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:223)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:429)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:104)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:249)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:204)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:179)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:136)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:29)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:74)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-3-gde33222):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?
##### ERROR ------------------------------------------------------------------------------------------
The Command line used (abbreviated):
java -Xmx30g -jar /home/common/GenomeAnalysisTK-2.2-3/GenomeAnalysisTK.jar \
-R /home/common/hg19/ucschg19/ucsc.hg19.fasta \
-T HaplotypeCaller \
-I ReduceReads/XXXXX.ontarget.MarkDups.nRG.reor.Real.Recal.reduced.bam [x90]\
--dbsnp /home/common/hg19/dbsnp_135.hg19.vcf \
-o 93Ind_ped_reduced_HC_snps.raw.vcf \
-ped familys.ped \
--pedigreeValidationType SILENT \
-stand_call_conf 20.0 \
-stand_emit_conf 10.0
Hi,
when I run ReduceReads I get the following exception just when it's supposed to finish:
java.util.NoSuchElementException at java.util.LinkedList$ListItr.next(Unknown Source) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.updateHeaderCounts(SlidingWindow.java:697) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.addRead(SlidingWindow.java:128) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.addAlignment(SingleSampleCompressor.java:73) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:70) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:347) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:86) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:107) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
I run it with the standard arguments: java -jar GenomAnalysisTK.jar \ --logging_level ERROR \ -R hg19.fa \ -T ReduceReads \ -I in.bam \ -o reduced.in.bam
Anny suggestions?
Thanks, Thomas
Hi,
I'm trying to use GATK release2.0 with my nine exome-seq samples, following the steps on best practice I generated per-sample, ready-to-process .bam files and then used -T ReduceReads to generate .reduced.bam files for the next step (-T UnifiedGenotyper). When using these .reduced.bam files as UG input I receive this error message: "##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Too many deletions?" if I take my original .bam files as input things work smoothly. Do you have any idea what causes the problem?
Thanks a lot, Samira
Here is the command lines I use:
java -Xmx4g -jar $GATKv4 \
-R $GATK_BUNDLE/ucsc.hg19.fasta \
-T ReduceReads \
-L $capture_library.bed \
-I $i.recal_s.bam \
-o $i.reduced.bam
java -jar $GATKv4 \
-T HaplotypeCaller \
-R $GATK_BUNDLE/ucsc.hg19.fasta \
-I InputReducedBams.list \
-L $capture_library.bed \
--dbsnp GATK_BUNDLE/dbsnp_135.hg19.vcf \
-o raw.snp.indel.UnifiedGenotyper.rsv.vcf
Hi, I'm running GATK version 2.1-8 with reads mapped to mm10. ReduceReads fails somewhere on chr2 with above message. From previous posts I understood that this bug has appeared already? Could you please help me to fix it? Thank you,
Ania
Hi, I'm just wondering if it is a good idea to run my pipeline again with ReduceReads. I skipped it originally as I only have four (mouse) samples but having re-read the documentation with the additional filters, I am now considering if it might add value. Any thoughts appreciated.
Hi there, I've tried to run ReduceReads for the first time and I got this error:
`$ java -Xmx8g -jar /lustre1/tools/bin/GenomeAnalysisTK-2.3-6.jar -T ReduceReads -R /lustre1/genomes/hg19/fa/hg19.fa -I filein.bam -o fileout.bam […]
java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.closeVariantRegions(SingleSampleCompressor.java:83) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.closeVariantRegionsInAllSamples(MultiSampleCompressor.java:94) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:76) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:387) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:87) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:226) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:215) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:254) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
` Is there something wrong with running RR with multiple samples?
d
I have read on your recent slides for "Data Compression with Reduce Reads" that "Tumor and Normal samples (or any set of samples) get co-‐reduced, meaning that every variable region triggered by one sample will be forced in every sample."
I have data from 4 variant strains of an organism, my samples in RG info, and 4 individuals for each strain, my libraries in RG info. Currently I have a bam file for each of the 16 different libraries.
If I want to run ReduceReads as I have quite high coverage, but preserve information across all of my samples where a site is not consensus in just one as there is no snp information available for this organism and I don't want to lose any important data. Should I merge all bam files for all samples before proceeding with ReduceReads with downsampling turned off? Or just leave out ReduceReads?
Thanks Anna
Hallo everyone, I have a question about ReduceReads when using scatter/gather. In the argument details of ReduceReads you write for the parameter -nocmp_names: "... If you scatter/gather there is no guarantee that read name uniqueness will be maintained -- in this case we recommend not compressing."
Do you mean, that if I use scatter/gather, I should use ReduceReads with the -nocmp_names option so that the read names will not be compressed OR do you mean that I should not use ReduceReads at all when scatter/gathering.
I assume the first is meant, I just wanted to make sure. Thank you for your time and effort. Eva
HI all,
I am analyzing some whole genome sequencing datas .After preprocessing by Queue got a large bam file on sample level (~ 200GB/sample ) and I wanted to use ReaduceReads module to reduce the bam file size. and running following command: /usr/java/latest/bin/java -Xmx16g -jar /path_to_GenomeAnalysisTK-2.3-9/GenomeAnalysisTK.jar -R /path_to_human_g1k_v37.fasta -T ReduceReads -I /path_to_Queue/project.sample.clean.dedup.recal.bam -o sample.reduced.bam --generate_md5
After 8 hours , the estimated time goes to 6.9 days.
INFO 20:02:25,508 ProgressMeter - 1:120660726 5.63e+07 6.5 h 7.0 m 3.9% 7.0 d 6.7 d INFO 20:03:25,509 ProgressMeter - 1:120660726 5.63e+07 6.5 h 7.0 m 3.9% 7.0 d 6.7 d INFO 20:04:25,510 ProgressMeter - 1:120660726 5.63e+07 6.6 h 7.0 m 3.9% 7.0 d 6.8 d INFO 20:05:25,511 ProgressMeter - 1:120660726 5.63e+07 6.6 h 7.0 m 3.9% 7.0 d 6.8 d INFO 20:06:25,512 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.0 m 3.9% 7.1 d 6.8 d INFO 20:07:25,528 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.0 m 3.9% 7.1 d 6.8 d INFO 20:08:25,529 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.1 m 3.9% 7.1 d 6.8 d INFO 20:09:25,530 ProgressMeter - 1:120677835 5.63e+07 6.6 h 7.1 m 3.9% 7.1 d 6.8 d INFO 20:10:25,531 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.1 m 3.9% 7.1 d 6.9 d INFO 20:11:25,532 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.1 m 3.9% 7.2 d 6.9 d INFO 20:12:25,533 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.1 m 3.9% 7.2 d 6.9 d INFO 20:13:25,534 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.2 m 3.9% 7.2 d 6.9 d INFO 20:14:25,535 ProgressMeter - 1:120677835 5.63e+07 6.7 h 7.2 m 3.9% 7.2 d 6.9 d
The tool version is GenomeAnalysisTK-2.3-9
Is there anything wrong with my command ? How could I speed up this procedure? Thanks a lot .
I'm working with ReduceReads and would like to use it in some kind of parallel mode. The presentation mentions that a 50x way run may drastically reduce run time but I'm not sure how to invoke this. I tried -nt and it complained. Should I be giving it multiple intervals and merging? If so, how does it deal with edge variants?
Thanks.
Hello all thank for the great work.
I have run into an issue with ReduceReads and I was hoping you could offer some insite. I'm getting the following stack trace issue (attached file). I looked around the forums, and others who were getting stack trace issue using ReduceReads were told code fixes would remedy the issue so I thought I would check with you. I also ran samtools flagstat.
ReduceReads is very slow for MT reads. After it gets by the MT, it runs much faster (See output below)
Any ideas why and what to do to speed it up?
John
INFO 23:32:37,536 HelpFormatter - --------------------------------------------------------------------------------- INFO 23:32:37,545 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.2-16-g9f648cb, Compiled 2012/12/04 03:46:58 INFO 23:32:37,545 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 23:32:37,545 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 23:32:37,551 HelpFormatter - Program Args: -R /unprotected/projects/genetics_program/resources/gatk_bundle/hg19/ucsc.hg19.fasta -T ReduceReads -I bam/LP6005113-DNA_E01.recal.bam -o LP6005113- DNA_E01.reduced.bam INFO 23:32:37,551 HelpFormatter - Date/Time: 2013/01/28 23:32:37 INFO 23:32:37,551 HelpFormatter - --------------------------------------------------------------------------------- INFO 23:32:37,551 HelpFormatter - --------------------------------------------------------------------------------- INFO 23:32:37,605 GenomeAnalysisEngine - Strictness is SILENT INFO 23:32:37,984 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 23:32:37,992 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 23:32:38,073 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08 INFO 23:32:38,113 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 23:32:38,114 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 23:33:14,010 ProgressMeter - chrM:1992 4.00e+04 35.9 s 15.0 m 0.0% 93.5 w 93.5 w INFO 23:35:18,038 ProgressMeter - chrM:2879 6.00e+04 2.7 m 44.4 m 0.0% 288.2 w 288.2 w INFO 23:37:06,320 ProgressMeter - chrM:3259 7.00e+04 4.5 m 63.9 m 0.0% 427.0 w 427.0 w INFO 23:39:03,457 ProgressMeter - chrM:3662 8.00e+04 6.4 m 80.3 m 0.0% 546.0 w 546.0 w INFO 23:41:16,174 ProgressMeter - chrM:4087 9.00e+04 8.6 m 95.9 m 0.0% 657.7 w 657.7 w INFO 23:43:46,243 ProgressMeter - chrM:4550 1.00e+05 11.1 m 111.3 m 0.0% 761.8 w 761.8 w INFO 23:46:47,501 ProgressMeter - chrM:4973 1.10e+05 14.2 m 2.1 h 0.0% 886.1 w 886.1 w INFO 23:49:57,085 ProgressMeter - chrM:5379 1.20e+05 17.3 m 2.4 h 0.0% 1002.1 w 1002.1 w INFO 23:52:52,173 ProgressMeter - chrM:5823 1.30e+05 20.2 m 2.6 h 0.0% 1081.7 w 1081.7 w INFO 23:54:28,697 ProgressMeter - chrM:7492 1.70e+05 21.8 m 2.1 h 0.0% 907.5 w 907.5 w INFO 23:55:45,484 ProgressMeter - chrM:7883 1.80e+05 23.1 m 2.1 h 0.0% 913.0 w 913.0 w INFO 23:57:16,597 ProgressMeter - chrM:8305 1.90e+05 24.6 m 2.2 h 0.0% 923.5 w 923.5 w INFO 23:59:07,109 ProgressMeter - chrM:8731 2.00e+05 26.5 m 2.2 h 0.0% 944.1 w 944.1 w INFO 00:01:16,623 ProgressMeter - chrM:9124 2.10e+05 28.6 m 2.3 h 0.0% 977.1 w 977.1 w INFO 00:04:12,150 ProgressMeter - chrM:9526 2.20e+05 31.6 m 2.4 h 0.0% 1031.4 w 1031.4 w INFO 00:06:51,054 ProgressMeter - chrM:9896 2.30e+05 34.2 m 2.5 h 0.0% 1076.2 w 1076.2 w INFO 00:09:31,477 ProgressMeter - chrM:10244 2.40e+05 36.9 m 2.6 h 0.0% 1120.9 w 1120.9 w INFO 00:12:57,847 ProgressMeter - chrM:10626 2.50e+05 40.3 m 2.7 h 0.0% 1181.3 w 1181.3 w INFO 00:16:48,872 ProgressMeter - chrM:11139 2.60e+05 44.2 m 2.8 h 0.0% 1234.5 w 1234.5 w INFO 00:20:54,282 ProgressMeter - chrM:11634 2.70e+05 48.3 m 3.0 h 0.0% 1291.4 w 1291.4 w INFO 00:25:23,381 ProgressMeter - chrM:12098 2.80e+05 52.8 m 3.1 h 0.0% 1357.2 w 1357.2 w INFO 00:30:01,695 ProgressMeter - chrM:12464 2.90e+05 57.4 m 3.3 h 0.0% 1433.2 w 1433.2 w INFO 00:34:41,008 ProgressMeter - chrM:12805 3.00e+05 62.0 m 3.4 h 0.0% 1508.2 w 1508.2 w INFO 00:39:41,462 ProgressMeter - chrM:13307 3.10e+05 67.1 m 3.6 h 0.0% 1568.4 w 1568.4 w INFO 00:45:21,827 ProgressMeter - chrM:13764 3.20e+05 72.7 m 3.8 h 0.0% 1644.6 w 1644.6 w INFO 00:51:15,645 ProgressMeter - chrM:14173 3.30e+05 78.6 m 4.0 h 0.0% 1726.7 w 1726.7 w INFO 00:57:38,039 ProgressMeter - chrM:14639 3.40e+05 85.0 m 4.2 h 0.0% 1807.2 w 1807.2 w INFO 01:06:06,413 ProgressMeter - chrM:15067 3.50e+05 93.5 m 4.5 h 0.0% 1930.9 w 1930.9 w INFO 01:15:07,742 ProgressMeter - chrM:15463 3.60e+05 102.5 m 4.7 h 0.0% 2063.0 w 2063.0 w INFO 01:23:17,067 ProgressMeter - chrM:15827 3.70e+05 110.6 m 5.0 h 0.0% 2176.0 w 2176.0 w INFO 01:31:08,225 ProgressMeter - chrM:16237 3.80e+05 118.5 m 5.2 h 0.0% 2271.6 w 2271.5 w INFO 01:32:08,631 ProgressMeter - chr1:3000534 1.17e+06 119.5 m 102.5 m 0.1% 12.3 w 12.3 w INFO 01:33:09,058 ProgressMeter - chr1:5169965 1.91e+06 2.0 h 63.2 m 0.2% 7.2 w 7.2 w INFO 01:34:09,530 ProgressMeter - chr1:7090404 2.65e+06 2.0 h 45.9 m 0.2% 5.3 w 5.3 w INFO 01:35:10,334 ProgressMeter - chr1:8806475 3.32e+06 2.0 h 37.0 m 0.3% 4.3 w 4.3 w INFO 01:36:10,654 ProgressMeter - chr1:10887467 4.08e+06 2.1 h 30.3 m 0.3% 3.5 w 3.5 w INFO 01:37:10,892 ProgressMeter - chr1:12756332 4.77e+06 2.1 h 26.1 m 0.4% 3.0 w 3.0 w INFO 01:38:11,087 ProgressMeter - chr1:14746000 5.29e+06 2.1 h 23.8 m 0.5% 18.5 d 18.4 d INFO 01:39:11,327 ProgressMeter - chr1:16699493 6.02e+06 2.1 h 21.0 m 0.5% 16.5 d 16.4 d INFO 01:40:11,606 ProgressMeter - chr1:18706430 6.86e+06 2.1 h 18.6 m 0.6% 14.8 d 14.8 d
I've tried using the output of the Reduced Bams as an input to Crest (after some preprocessing) but it hangs on chr7. Has anyone else used the reduced bam in other programs? Is this output meant to only be used in GATK?
Thanks!
New gatk version... trying out ReduceReads again.
6 of 8 exomes I tried were processed by ReduceReads just fine, but two throw the exception Removed too many insertions, header is now negative! (at different genomic locations).
I did not find any mention of this error in the GATK forums, is this a known problem?
Command line: java -Xmx6g -jar GenomeAnalysisTK.jar -R human_g1k_v37.fasta -T ReduceReads -o test.rr.bam -I rr-too-many-insertions.bam
java -v: java version "1.6.0_27" Java(TM) SE Runtime Environment (build 1.6.0_27-b07) Java HotSpot(TM) 64-Bit Server VM (build 20.2-b06, mixed mode)
Run log:
INFO 16:03:26,898 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:03:27,382 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-0-g9593e74, Compiled 2012/12/17 16:58:19
INFO 16:03:27,383 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:03:27,383 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:03:27,388 HelpFormatter - Program Args: -R human_g1k_v37.fasta -T ReduceReads -o test.rr.bam -I rr-too-many-insertions.bam
INFO 16:03:27,388 HelpFormatter - Date/Time: 2012/12/18 16:03:26
INFO 16:03:27,388 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:03:27,388 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:03:27,471 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:03:27,577 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO 16:03:27,585 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:03:27,620 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
INFO 16:03:27,656 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 16:03:27,657 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 16:03:27,714 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 16:03:27,717 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO 16:03:27,739 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO 16:03:28,739 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Removed too many insertions, header is now negative!
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.HeaderElement.removeInsertionToTheRight(HeaderElement.java:151)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.updateHeaderCounts(SlidingWindow.java:881)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.removeFromHeader(SlidingWindow.java:816)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.compressVariantRegion(SlidingWindow.java:604)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegion(SlidingWindow.java:623)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegions(SlidingWindow.java:643)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.closeVariantRegions(SingleSampleCompressor.java:83)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.closeVariantRegionsInAllSamples(MultiSampleCompressor.java:94)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:76)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:387)
at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:87)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:226)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsReduce.apply(TraverseReadsNano.java:215)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:254)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:94)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-0-g9593e74):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Removed too many insertions, header is now negative!
##### ERROR ------------------------------------------------------------------------------------------
(there is no progress listed here because this log is from after I bisected to find a narrow region where the problem is occuring).
I am using UnifiedGenotyper to call SNPs in certain regions from custom capture data. I previously had the pipeline working, but now I am trying with files that have been reduced using ReduceReads, and also changed to a newer version. I have many bam files, but I also get the error when I try with just two. See below for my script and the error message.
Many thanks.
java -Xmx20g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper \ -R human_g1k_v37.fasta \ -B:dbsnp,vcf dbsnp_132.b37.vcf \ -L baitgroupfile.picard \ -I file1.reduced.bam \ -I file2.reduced.bam \ -o out.vcf \ -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -G Standard \ -metrics out.metrics
net.sf.samtools.SAMFormatException: Unrecognized tag type: B
at net.sf.samtools.BinaryTagCodec.readValue(BinaryTagCodec.java:270)
at net.sf.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:220)
at net.sf.samtools.BAMRecord.decodeAttributes(BAMRecord.java:302)
at net.sf.samtools.BAMRecord.getAttribute(BAMRecord.java:282)
at net.sf.samtools.SAMRecord.getAttribute(SAMRecord.java:830)
at net.sf.picard.sam.MergingSamRecordIterator.next(MergingSamRecordIterator.java:132)
at net.sf.picard.sam.MergingSamRecordIterator.next(MergingSamRecordIterator.java:39)
at org.broadinstitute.sting.gatk.iterators.PrivateStringSAMCloseableIterator.next(StingSAMIteratorAdapter.java:100)
at org.broadinstitute.sting.gatk.iterators.PrivateStringSAMCloseableIterator.next(StingSAMIteratorAdapter.java:84)
at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource$ReleasingIterator.next(SAMDataSource.java:803)
at org.broadinstitute.sting.gatk.datasources.simpleDataSources.SAMDataSource$ReleasingIterator.next(SAMDataSource.java:769)
at org.broadinstitute.sting.gatk.iterators.ReadFormattingIterator.next(ReadFormattingIterator.java:77)
at org.broadinstitute.sting.gatk.iterators.ReadFormattingIterator.next(ReadFormattingIterator.java:19)
at org.broadinstitute.sting.gatk.filters.CountingFilteringIterator.getNextRecord(CountingFilteringIterator.java:106)
at org.broadinstitute.sting.gatk.filters.CountingFilteringIterator.
Hello,
Im trying to call variants using UnifiedGenotyper on ca 450 reduced bams in 100000 bp chunks. It works fine for some of the chunks, but for others I get the following error message:
Can anyone explain to me why there is a problem with a specific bam file when I call on for example chunk chr20:25400000-25500000 but not when I call on chunk chr20:10000000-10100000?
Thank you, Tota
We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?
Hi all, I am trying to use the new feature "reduceReads" and I get an error everytime. Can anyone tell me what is the problem? BTW, I am working on yeast's genome and not human, if it is matter.
INFO 14:21:07,687 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:21:07,688 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:17:07 INFO 14:21:07,688 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:21:07,688 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:21:07,689 HelpFormatter - Program Args: -R /home/mps/references/SK1_v2/fasta/SK1_v2.fixed.fa -T ReduceReads -I output.marked.realigned.fixed.recal.bam -o output.marked.realigned.fixed.recal.reduced.bam -l INFO INFO 14:21:07,689 HelpFormatter - Date/Time: 2012/08/09 14:21:07 INFO 14:21:07,689 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:21:07,690 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:21:07,759 GenomeAnalysisEngine - Strictness is SILENT INFO 14:21:07,791 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:21:07,804 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 14:21:08,076 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 14:21:08,076 TraversalEngine - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 14:21:38,548 TraversalEngine - SK1.chr01:63354 3.90e+04 30.5 s 13.0 m 0.5% 98.2 m 97.7 m INFO 14:22:08,706 TraversalEngine - SK1.chr01:79167 5.20e+04 60.6 s 19.4 m 0.6% 2.6 h 2.6 h INFO 14:22:38,976 TraversalEngine - SK1.chr01:98653 6.90e+04 90.9 s 22.0 m 0.8% 3.1 h 3.1 h INFO 14:23:10,903 TraversalEngine - SK1.chr01:114413 8.20e+04 2.0 m 25.0 m 0.9% 3.7 h 3.6 h INFO 14:23:43,523 TraversalEngine - SK1.chr01:125477 9.20e+04 2.6 m 28.2 m 1.0% 4.2 h 4.2 h INFO 14:24:15,215 TraversalEngine - SK1.chr01:145667 1.09e+05 3.1 m 28.6 m 1.2% 4.4 h 4.3 h INFO 14:24:45,785 TraversalEngine - SK1.chr01:163339 1.23e+05 3.6 m 29.5 m 1.3% 4.5 h 4.5 h INFO 14:25:17,660 TraversalEngine - SK1.chr01:179555 1.46e+05 4.2 m 28.5 m 1.5% 4.7 h 4.7 h INFO 14:25:49,088 TraversalEngine - SK1.chr01:213605 1.71e+05 4.7 m 27.4 m 1.7% 4.5 h 4.4 h INFO 14:25:51,716 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.ArithmeticException: / by zero at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.downsampleVariantRegion(SlidingWindow.java:539) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegion(SlidingWindow.java:498) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.closeVariantRegions(SlidingWindow.java:520) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SlidingWindow.close(SlidingWindow.java:562) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.SingleSampleCompressor.addAlignment(SingleSampleCompressor.java:64) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.MultiSampleCompressor.addAlignment(MultiSampleCompressor.java:70) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReadsStash.compress(ReduceReadsStash.java:67) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:344) at org.broadinstitute.sting.gatk.walkers.compression.reducereads.ReduceReads.reduce(ReduceReads.java:83) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:107) at org.broadinstitute.sting.gatk.traversals.TraverseReads.traverse(TraverseReads.java:52) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:71) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)