Note: only do this if you have been explicitly asked to do so.
You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit a detailed bug report.
A snippet file is a slice of the original BAM file which contains the problematic region and is sufficient to reproduce the error. We need it in order to reproduce the problem on our end, which is the first necessary step to finding and fixing the bug. We ask you to provide this as a snippet rather than the full file so that you don't have to upload (and we don't have to process) huge giga-scale files.
-Largument and progressively narrowing down the interval
-Lto write the problematic region (with 500 bp padding on either side) to a new file -- this is your snippet file.
As reported here, a bug was found in VariantsToBinaryPED. Briefly, VariantsToBinaryPed expected the fam file to describe the samples in the same order as the input VCF file: if they were not in the same order, it did not correctly map sample IDs with the genotypes in the output binary PED.
We expect that in most use cases, the order would be the same (because PLINK uses lexicographic order, as does the GATK), so the bug would not impact results. However, as the user report demonstrates, in cases where order was different, the bug would seriously impact results.
We therefore recommend that anyone who has used VariantsToBinaryPED check their results for any inconsistencies in the kinship coefficients. Our apologies for the inconvenience to anyone who is affected by this bug, and big thanks again to user TimHughes for reporting the bug.
Finally, we have fixed the bug in GATK and released the fixed version under version number 2.7-4.
As reported here:
If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.
Thank you for your patience while we work to fix this issue.
Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.
This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!
We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.
This has now been fixed in the most recent release.
To check if you are using a buggy version, run the following:
samtools view -H $BAM
This will produce the following output:
@PG ID:GATK ReduceReads VN:XXX
If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing.
Our most sincere apologies for the inconvenience.
I did search on the site and seems several have had a similar problem but I couldn't fix mine based on those. I've managed to get to this point succesfully with creating a realigned bam file with IndelRealigner. I'm using a non-model organism with self created masking list from a VCF run done before the realignment (list in bed format). I've had to use the -fixMisencodedQuals so far through out my runs. I checked the masking file so that the ranges do not go over the ranges that are specified in my list of regions to cover (-L option).
Is there anything else I would need to check still?
Command line view:
sulyba@hippu4:/fs/lustre/wrk/sulyba/stickleback_capture> java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals
INFO 12:34:10,502 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:34:10,510 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14 INFO 12:34:10,510 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:34:10,510 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:34:10,516 HelpFormatter - Program Args: -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals INFO 12:34:10,516 HelpFormatter - Date/Time: 2013/02/01 12:34:10 INFO 12:34:10,517 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:34:10,517 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:34:10,551 ArgumentTypeDescriptor - Dynamically determined type of Sites_to_Mask.bed to be BED INFO 12:34:10,563 GenomeAnalysisEngine - Strictness is SILENT INFO 12:34:11,309 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 12:34:11,319 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:34:11,392 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07 INFO 12:34:11,416 RMDTrackBuilder - Loading Tribble index from disk for file Sites_to_Mask.bed INFO 12:34:11,660 GenomeAnalysisEngine - Processing 20350670 bp from intervals INFO 12:34:11,672 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:34:11,674 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 12:34:11,946 BaseRecalibrator - The covariates being used here: INFO 12:34:11,947 BaseRecalibrator - ReadGroupCovariate INFO 12:34:11,947 BaseRecalibrator - QualityScoreCovariate INFO 12:34:11,947 BaseRecalibrator - ContextCovariate INFO 12:34:11,947 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 12:34:11,947 BaseRecalibrator - CycleCovariate INFO 12:34:11,952 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3] INFO 12:34:11,952 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 12:34:11,952 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3] INFO 12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3] INFO 12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3] INFO 12:34:11,954 NestedIntegerArray - Pre-allocating first 2 dimensions INFO 12:34:11,954 NestedIntegerArray - Done pre-allocating first 2 dimensions INFO 12:34:12,109 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 12:34:12,120 ReadShardBalancer$1 - Done loading BAM index data for next contig INFO 12:34:12,644 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 12:34:12,645 ReadShardBalancer$1 - Done loading BAM index data for next contig INFO 12:34:14,901 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: -2 at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158) at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:246) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595) at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530) at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: -2 ##### ERROR ------------------------------------------------------------------------------------------
Dear GATK Team,
I am running a pipeline on several high coverage human individuals that have been mapped using bwa and processed using samtools, picard and gatk. The bam-files pass ValidateSam from picard, but when I run the bqsr step some of them fails giving a Malformed read error (using -filterMBQ does not help in this case). I tracked down the error to bamfiles that ends with a paired end read where the mate maps in the beginning of the contig (in my case human mtDNA).
Eg, this will make it crash:
readX 177 MT 16558 37 7S2M2I10M80S = 294 -16176 GACCTGTGATCC...
readY 177 MT 16558 37 7S2M2I10M80S = 238 -16232 GACCTGTGATCC...
readZ 113 MT 16558 37 7S2M2I10M80S = 273 -16197 GACCTGTGATCC...
where a file ending like this wont crash:
readX 83 MT 16469 60 101M = 16246 -324 TGGGGGTAGCTAAAGTGAAC...
readY 147 MT 16469 60 101M = 16267 -303 TGGGGGTAGCTAAAGTGA...
readZ 147 MT 16469 60 101M = 16193 -377 TGGGGGTAGCTAAAGTGAAC...
I am running GATK v2.3-9-ge5ebf34, but the same error occurs using GATK v-2.2-3 (my previous version). I can genotype the files using UnifiedGenotyper without any problem as well.
This is the error:
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read? at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:380) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.
We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:
scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.
In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.
A reverse example:
scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.
Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.
I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.
I was using PrintReads as part of Base Quality Recalibration. The resulting BAM header is incorrect as there are two @PG entries for GATK PrintReads. The bam file I am using is internal to the broad institute and is picard aggregated. There is an entry for GATK PrintReads in the original bam. Would it be possible for the @PG header lines to include a .A-Z like the other headers?
I'm trying to apply my recalibrated quality values with the following command:
java -jar GenomeAnalysisTK.jar -T PrintReads -R genome_noHap.fa -I realigned.bam -BQSR realigned.recal_data.grp -o realigned.recal.bam -nct 25
And get the following error. I couldn't find it elsewhere. Would be happy if I could get some advice.
Let me know if you need more info.
java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.sting.utils.recalibration.BaseRecalibration.recalibrateRead(BaseRecalibration.java:151) at org.broadinstitute.sting.utils.recalibration.BQSRReadTransformer.apply(BQSRReadTransformer.java:93) at org.broadinstitute.sting.gatk.walkers.readutils.PrintReads.map(PrintReads.java:251) at org.broadinstitute.sting.gatk.walkers.readutils.PrintReads.map(PrintReads.java:98) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:230) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:218) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)
I'm running into a problem with vcfs that have single ended break ends. (These are produced by an old version of Strelka .) Tribble doesn't recognize "." as valid in alternative alleles.
Single break ends are valid in the vcf standard and the files validate according to Vcftools.
Others have run into this problem as well: https://groups.google.com/forum/#!searchin/strelka-discuss/gatk/strelka-discuss/gJfsyjZNZXA/ExDXZiVWW_kJ
##### ERROR stack trace org.broad.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1807: Unparsable vcf record with allele .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC, for input source: /cga/tcga-gsc/benchmark/Indels/strelkaPON/NA18606.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam-NA18608.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam/final.indels.vcf at org.broadinstitute.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:715) at org.broadinstitute.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:527) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:553) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:494) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:291) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:234) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:213) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:45) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35) at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:284) at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:264) at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:225) at org.broadinstitute.sting.tools.CatVariants.execute(CatVariants.java:239) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.tools.CatVariants.main(CatVariants.java:258) ##### ERROR ------------------------------------------------------------------------------------------
Example vcf line
19 36002413 . C .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC . PASS IHP=1;NT=ref;QSI=82;QSI_NT=82;SGT=ref->hom;SOMATIC;SVTYPE=BND;TQSI=1;TQSI_NT=1 DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50 49:49:42,44:0,0:7,6:43.72:0.85:0.00 11:11:0,0:6,6:5,5:14.61:0.48:0.0
A full vcf is available at: /humgen/gsa-scr1/pub/incoming/BreakendBug/breakend.vcf
Dear GATK team,
I think I found a bug as shown in the attached screenshot. The highlighted line of the vcf file after using UnifiedGenotyper, VariantRecalibrator, ApplyRecalibration, ReadBackedPhasing, SelectVariants, CombineVariants, and SnpEff (following Best Practices) shows X 307897 . C CGTGAAGG 9494.14 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=8.010;DP=18;EFF=DOWNSTREAM(MODIFIER||||PPP2R3B|protein_coding|CODING|ENST00000381625|),INTRON(MODIFIER||||PPP2R3B|protein_coding|CODING|ENST00000390665|4),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000445792|5),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000475859|2),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000477110|1),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000496630|3),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000468169|),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000477636|),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000484364|);FS=0.000;InbreedingCoeff=0.3019;MQ=45.55;MQ0=0;MQRankSum=-8.092;QD=9.55;ReadPosRankSum=2.020;VQSLOD=2.27;culprit=ReadPosRankSum;set=29_BOTH GT:AD:DP:GQ:PL ./. 0/1:0,14:9:99:367,42,0
To be sure, I checked the two samples' bams at the PrintReads stage but do not see such a long indel. Is this a potentially similar bug to http://gatkforums.broadinstitute.org/discussion/2516/rare-homozygous-indel-bug-in-gatk-2-4-9 ?
Thank you for your attention.