# Tagged with #bug 1 documentation article | 8 announcements | 31 forum discussions

Note: only do this if you have been explicitly asked to do so.

### Scenario:

You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit a detailed bug report.

### Here's what you need to provide:

• The exact command line that you used when you had the problem (in a text file)
• The full log output (program output in the console) from the start of the run to the end or error message (in a text file)
• A snippet of the BAM file if applicable and the index (.bai) file associated with it
• If a non-standard reference (i.e. not available in our resource bundle) was used, we need the .fasta, .fai, and .dict files for the reference
• Any other relevant files such as recalibration plots

A snippet file is a slice of the original BAM file which contains the problematic region and is sufficient to reproduce the error. We need it in order to reproduce the problem on our end, which is the first necessary step to finding and fixing the bug. We ask you to provide this as a snippet rather than the full file so that you don't have to upload (and we don't have to process) huge giga-scale files.

### Here's how you create a snippet file:

• Look at the error message and see if it cites a specific position where the error occurred
• If not, identify what region caused the problem by running with -L argument and progressively narrowing down the interval
• Once you have the region, use PrintReads with -L to write the problematic region (with 500 bp padding on either side) to a new file -- this is your snippet file.
• Test your command line on this snippet file to make sure you can still reproduce the error on it.

### And finally, here's how you send us the files:

• Put all those files into a .zip or .tar.gz archive
• Upload them onto our FTP server as explained here (make sure you use the proper UPLOAD credentials)
• Post in the original discussion thread that you have done this
• Be sure to tell us the name of your archive file!

### We will get back to you --hopefully with a bug fix!-- as soon as we can.

Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.

Because -ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.

Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.

### Genotype refinement workflow with all the trimmings

As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.

### Non-diploids, rejoice!

This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

### HaplotypeCaller gets physical

You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:

or this:

But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).

In a nutshell, phased records will look like this:

1   1372243  .  T  <NON_REF>  .  .  END=1372267  <snip>  <snip>
1   1372268  .  G  A,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip>
1   1372269  .  G  T,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip>
1   1372270  .  C  <NON_REF>  .  .  END=1372299  <snip>  <snip>


You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).

The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.

Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).

We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the --recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.

### Variant annotations finally make sense

Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.

Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as -ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.

### Working out the kinks in the "reference confidence model" workflow

The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).

You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as --includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).

### HaplotypeCaller now emits post-realignment coverage metrics

The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the -bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.

### R you up to date on your libraries?

We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.

### A sincere apology to GATK-based tool developers

We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.

In a nutshell, the base name of the codebase has changed from sting to gatk (which hopefully makes more sense), and the most common effect is that sting.gatk classpath segments are now gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.

We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.

The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.

This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.

To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.

We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.

This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.

Affected versions: 2.x versions up to 2.8 (not sure when it started)

Affected tool: SelectVariants

Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (-nt)

Effects: Genotype-level fields (such as AD) swapped among samples

This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.

As reported here, a bug was found in VariantsToBinaryPED. Briefly, VariantsToBinaryPed expected the fam file to describe the samples in the same order as the input VCF file: if they were not in the same order, it did not correctly map sample IDs with the genotypes in the output binary PED.

We expect that in most use cases, the order would be the same (because PLINK uses lexicographic order, as does the GATK), so the bug would not impact results. However, as the user report demonstrates, in cases where order was different, the bug would seriously impact results.

We therefore recommend that anyone who has used VariantsToBinaryPED check their results for any inconsistencies in the kinship coefficients. Our apologies for the inconvenience to anyone who is affected by this bug, and big thanks again to user TimHughes for reporting the bug.

Finally, we have fixed the bug in GATK and released the fixed version under version number 2.7-4.

As reported here:

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

### Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.

We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.

This has now been fixed in the most recent release.

To check if you are using a buggy version, run the following:

    samtools view -H BAM  This will produce the following output:  @PG ID:GATK ReduceReads VN:XXX  If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing. Our most sincere apologies for the inconvenience. ### We have discovered a bug that seriously impacts the results of BQSR/ BaseRecalibrator when it is run with the scatter-gather functionality of Queue. To be clear, the bug does NOT affect BaseRecalibrator runs performed "normally", i.e. WITHOUT Queue's scatter-gather. ### Consequences/ Solution: Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering. This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you! Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument. Here is what I ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf \ --filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \ --filterName “darwinfinchfilter” I got the following error message: "2: No such file or directory." However, I know my file path was set correctly as if i remove the last two arguments and only ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf It would run. I also tried Geraldine example filterExpression: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_geraldinefilter.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName “Geraldine_snp_filter" and I got the following error: Unmatched ". I am trying to reproduce running the GATK pipeline (best practices) on Power8, but during the RealignerTargetCreator phase, I run into the following problem: INFO 06:05:39,926 HelpFormatter - -------------------------------------------------------------------------------- INFO 06:05:39,928 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 06:05:39,929 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 06:05:39,929 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 06:05:39,933 HelpFormatter - Program Args: -T RealignerTargetCreator -R genome/hg19.fasta -I results/out_sorted_dup.bam -o results/out_target_intervals.interval_list -known vcf/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known vcf/1000G_phase1.indels.hg19.sites.vcf INFO 06:05:39,941 HelpFormatter - Executing as admin@power8-2xl on Linux 3.14.17-100.fc19.ppc64p7 ppc64; OpenJDK 64-Bit Server VM 1.7.0_65-mockbuild_2014_07_26_03_56-b00. INFO 06:05:39,942 HelpFormatter - Date/Time: 2015/03/09 06:05:39 INFO 06:05:39,942 HelpFormatter - -------------------------------------------------------------------------------- INFO 06:05:39,942 HelpFormatter - -------------------------------------------------------------------------------- INFO 06:05:40,016 GenomeAnalysisEngine - Strictness is SILENT INFO 06:05:40,150 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 06:05:40,159 SAMDataSourceSAMReaders - Initializing SAMRecords in serial INFO 06:05:40,182 SAMDataSourceSAMReaders - Done initializing BAM readers: total time 0.02 INFO 06:05:40,461 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 06:05:41,576 GenomeAnalysisEngine - Done preparing for traversal INFO 06:05:41,577 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 06:05:41,578 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 06:05:41,578 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 06:05:43,206 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: -94 at org.broadinstitute.gatk.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:216) at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:288) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.initializeReferenceSequence(LocusReferenceView.java:150) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.(LocusReferenceView.java:126) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:90) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: -94 This is the command I used: java jar ~Xmx32g -jar ~/workspace/GATK/tools/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R genome/hg19.fasta -I results/out_sorted_dup.bam -o results/out_target_intervals.interval_list -known vcf/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known vcf/1000G_phase1.indels.hg19.sites.vcf I tried to look for a similar problem on the forum, but could not find anything. Could you advise how to resolve this situation? The error message suggests this may be a bug.. Thanks in advance! Validating the vcf file which comes out of the VariantAnnotator is resulting in an error. The GC Content is a float and not an integer (which is stated in the INFO column). At the moment we are do a search and replace ourselves, but it would be nice if it can be fixed in a next version of GATK. I was pulling my hair out over this one. I was applying a hard filter to a genotyped gVCF using JEXL to access variant context attributes to decide what filter setting I would apply. The filter was "vc.getGenotype("%sample%").isHomRef() ? vc.getGenotype("%sample%").getAD().size() == 1 ? DP < 10 : ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum <= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 : false" In pseudocode it says:  if ( isHomRef ) then if ( getAD().size() == 1 ) then DP < 10 else ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum >= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 else ignore record  The idea being that for records where not all reads contained the reference allele, we would filter out those positions where there was evidence to suggest that the reads supporting an alternate allele were of a significantly better quality. However, running this filter I keep getting the warning (snipped for clarity): WARN [SNIP]... MQRankSum <= 3.2905 [SNIP]... : false;' undefined variable MQRankSum So I thought the filter was failing. However, just as a test, I changed the direction of MQRankSum from >=3.2905 to <=3.2905 (a bit nonsensical, it should basically apply the filter to almost all HomRef positions that had any reads supporting an alternate allele). I still get the warning but I found the filter was applied to variant records as it should be. e.g. the following went from PASS to BAD_HOMREF: Supercontig_1.1 87 . C A . BAD_HOMREF BaseQRankSum=2.79;DP=40;MQ=39.95;MQ0=0;MQRankSum=-2.710e+00;ReadPosRankSum=0.819;VariantType=SNP GT:AB:AD:DP 0/0:0.870:34,5:39 So the filter is being correctly applied, but I am not sure why all the warnings are being generated? Is this a bug? Have I done something wrong? Hello, I have a pb with HaplotypeCaller. I saw that this error has been reported as linked with -nct option, but I did not use it. Strangely enough, I think my command worked some times ago (before server restart ?) All reference files I used are from the GATK bundle. Sletort Hi, I'm using 3.3-0-g37228af. This command gets around a feature that looks like a bug to me:  gatk -T HaplotypeCaller -R ref.fa -I in.bam -o /dev/stdout 2> log.txt | grep -v -e PairHMM -e "JNI call" | bgzip -s - > out.vcf.gz  As you see, one has to remove three lines of the stdout stream written by PairHMM. Like all other logging information, they should go to stderr. The three lines look like this: Using un-vectorized C++ implementation of PairHMM Time spent in setup for JNI call : 3.3656100000000003E-4 Total compute time in PairHMM computeLikelihoods() : 0.008854789  Regard,s Ari Hi I'm using select variants to extract out variants with PASS filter, and those that are biallelic SNPs with the command below using gatk 3.2-2. java -Xmx{MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar -R${ref_dir}/${genome} \ -T SelectVariants \ --variant "1_"${VCF_FILE}"excCpGRepeat.vcf" \ -ef \ -L $CHROM \ --restrictAllelesTo BIALLELIC \ -selectType SNP \ -o "2"${VCF_FILE}"_PASSBiSNP_excCpGRepeat.vcf"

I split my vcf using snpsift - so I run it on each chromosome seperately. For 12 of the 38 chromosomes I get the error - the others work fine. I re-extracted one of the chromosomes that failed, to be sure it wasn't a snpsift issue, and it still fails. Any ideas on what the problem might be?

INFO 15:26:17,318 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,322 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 15:26:17,322 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:26:17,323 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:26:17,328 HelpFormatter - Program Args: -R /u/home/c/projectdata/c/canids/reference/canfam31/canfam31_chr1NOTchr01/canfam31_chr1NOTchr01.fa -T SelectVariants --variant 1.chr35_excCpGRepeat.vcf -ef -L chr35 --restrictAllelesTo BIALLELIC -selectType SNP -o 2_chr35_PASSBiSNP_excCpGRepeat.vcf INFO 15:26:17,338 HelpFormatter - Executing as @n263 on Linux 2.6.32-431.20.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_55-mockbuild_2014_04_16_12_11-b00. INFO 15:26:17,338 HelpFormatter - Date/Time: 2014/11/20 15:26:17 INFO 15:26:17,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,469 GenomeAnalysisEngine - Strictness is SILENT INFO 15:26:17,715 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:26:19,240 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

"

I hope it can be easily fixed since I desperately need my pipeline working for my current analysis
;)

Hi, I get an ArrayIndexOutOfBoundsException when using this tool (see attached log). Is there a possible workaround or a new version I could use?

Thanks, Johannes

In some cases BaseCoverageDistribution takes ages to finish. I ran 4 BaseCoverageDistribution commands in parallel, and the smallest sample (by far) still is not finished while the others are. Seems like a bug to me? I will attach a screenshot.

Gidday,

Not a question, more of a bug report - I'm using the new v3.1 best practices pipeline, so I'd successfully produced my per-sample (n=23 in total) gVCFs with no worries.

Then I used GenotypeGVCFs to combine them as follows, including a dbSNP ROD (the Sanger Mouse Genome Project's SNP calls for 17 samples, in vcf format... not gvcf!):

java -Djava.io.tmpdir=/tmp -Xmx28g -jar ./tmp/GenomeAnalysisTK_3.1-1/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 8 -R ./mm10.fa --dbsnp ./tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.snps.rsIDdbSNPv137.vcf.ordinalsorted.vcf -V GenotypeGVCFs.run1.sample.list -o ./CombinedGenotyping.run1.vcf -A InbreedingCoeff -A FisherStrand -A QualByDepth -A ChromosomeCounts

So I was very surprised to see that my output CombinedGenotyping vcf has 40 samples in it, not 23 - and of course, 23 + 17 = 40. Checking the VCF headers itself confirms that the genotype calls from the 17 Sanger strains have been included in the output vcf, not just the rsIDs as intended(?). I'm guessing that this combining of .g.vcfs and extra ROD isn't the expected behaviour...!

Cheers!

Using GATK 3.1-1, I seem to be unable to get the "AB" (AlleleBalance) annotation for the calls using the HaplotypeCaller -> GenotypeGVCFs pipeline, and I'm not sure how to get it. Our current pipeline (GATK 2.7-4, UnifiedGenotyper) requires this field to perform filtering, so this annotation is essential for us to upgrade to GATK 3.1.

My current pipeline of commands is as follows:

$GenomeAnalysisTK-3.1-1 -T HaplotypeCaller -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -I -L targets.GRCh37.bed -stand_emit_conf 10 -mbq 20 --downsample_to_coverage 300 -ERC GVCF -pairHMM VECTOR_LOGLESS_CACHING -variant_index_type LINEAR -variant_index_parameter 128000 -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -A AlleleBalance -A QualByDepth -o sample_gvcf.vcf.gz$ GenomeAnalysisTK-3.1-1 -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -V -L targets.GRCh37.bed -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -o joint_vcf.vcf.gz

Above, note that if "-A AlleleBalance" is given to GenotypeGVCFs, GATK crashes with a NullPointerException (AlleleBalance.java, line 66).

The command above is heavily adapted from the current pipeline; do you know what I might be doing wrong with the new and improved HaplotypeCaller?

Thanks so much for your help, and if you need any further information, please let me know.

I recently ran the HaplotypeCaller using both the new --pairHMM options and -nct 2, and I encountered a crash in the program (8 of 96 runs crashed). I have rerun the analysis without the "-nct 2" option, and it appears to work as expected, so I'm not in a hurry. I am currently using GATK 3.1-1; below is the specific command I used as well as the stack trace that resulted.

Command: GenomeAnalysisTK -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I HG00593@123910712.bam --dbsnp dbsnp_137.b37.vcf.gz -L targets.GRCh37.bed -A QualByDepth -stand_emit_conf 10 -mbq 20 --downsample_to_coverage 300 -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -A AlleleBalance -o HG00593.vcf.gz -sample_rename_mapping_file rename_file -nct 2 -ERC GVCF -pairHMM VECTOR_LOGLESS_CACHING -variant_index_type LINEAR -variant_index_parameter 128000

Stack trace:

##### ERROR stack trace

java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:394) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:413) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:412) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.addMiscellaneousAllele(GenotypingEngine.java:257) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:227) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) If you require anything else from me, please let me know. I was softclipping some 84-bp-long single end reads using the following command: java -Xmx2g -jar /GATK_Path/GenomeAnalysisTK.jar -T ClipReads -I imput_file.bam -o output_file.bam -R$ALIGNREF --cyclesToTrim "1-4,80-84" --clipRepresentation SOFTCLIP_BASES --fix_misencoded_quality_scores

There was no error while running, but when I looked at the file afterwards, every read had a cigar string of 84S. I'm fairly sure this is a bug. I was aligned to the human mitochondrial genome.

I am using GATK v2.3-9-ge5ebf34 on Mac OS 10.75

This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.)

All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination?

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty

Hi I've just been having a go with the new Haplotype Caller method and I've getting some odd or malformed lines in the VCF file for example:

For example the format line has declared we should have 4 fields for each Sample record but instead we have samples with 2 records.

Two examples are shown here:

See Block 1
GT:DP:GQ:PL 1/1:.:3:32,3,0 ./.:0


Or where the format block declares 5 fields and we get 3 instead:

GT:AD:DP:GQ:PL  0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24        ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43        ./.:.:3


Full blocks, Block 1

chr1    3489714 .       G       A       9667.23 .       AC=154;AF=1.00;AN=154;DP=0;FS=0.000;InbreedingCoeff=-0.0391;MLEAC=154;MLEAF=1.00;MQ=0.00;MQ0=0;EFF=INTRAGENIC(MODIFIER|||||TIAM1||CODING|||1),INTRON(MODIFIER||||649|TIAM1|protein_coding|CODING|ENSBTAT00000064124|1|1);CSQ=A|ENSBTAG00000017839|ENSBTAT00000064124|Transcript|intron_variant||||||||1/5||1|TIAM1|HGNC|      GT:DP:GQ:PL     1/1:.:3:32,3,0  1/1:.:18:207,18,0     1/1:.:6:78,6,0  1/1:.:12:140,12,0       1/1:.:9:101,9,0 ./.:0   1/1:.:9:97,9,0  1/1:.:9:96,9,0  1/1:.:21:244,21,0       1/1:.:12:138,12,0       1/1:.:12:124,12,0       1/1:.:9:105,9,0 1/1:.:15:164,15,0     1/1:.:15:153,15,0       1/1:.:27:265,27,0       1/1:.:12:125,12,0       1/1:.:18:214,18,0       ./.:0   1/1:.:9:108,9,0 1/1:.:15:169,15,0       ./.:0   1/1:.:6:76,6,0  1/1:.:6:66,6,0  1/1:.:12:140,12,0     1/1:.:3:28,3,0  1/1:.:3:10,3,0  1/1:.:12:128,12,0       ./.:0   1/1:.:18:181,18,0       1/1:.:9:98,9,0  1/1:.:15:161,15,0       1/1:.:15:185,15,0       1/1:.:12:133,12,0       1/1:.:15:175,15,0     1/1:.:18:178,18,0       1/1:.:12:133,12,0       1/1:.:9:105,9,0 1/1:.:12:141,12,0       1/1:.:15:166,15,0       1/1:.:9:108,9,0 1/1:.:15:160,15,0       1/1:.:27:267,27,0       1/1:.:21:218,21,0    1/1:.:9:107,9,0  1/1:.:3:28,3,0  1/1:.:9:80,9,0  1/1:.:6:46,6,0  ./.:0   1/1:.:6:61,6,0  1/1:.:21:241,21,0       1/1:.:15:161,15,0       1/1:.:6:82,6,0  1/1:.:12:143,12,0       1/1:.:9:109,9,0 1/1:.:21:249,21,0     1/1:.:6:40,6,0  1/1:.:9:94,9,0  1/1:.:15:185,15,0       1/1:.:12:129,12,0       1/1:.:12:132,12,0       ./.:0   1/1:.:21:207,21,0       1/1:.:12:136,12,0       1/1:.:12:109,12,0       1/1:.:18:192,18,0     ./.:0   1/1:.:9:68,9,0  1/1:.:12:138,12,0       1/1:.:6:73,6,0  1/1:.:9:105,9,0 1/1:.:9:98,9,0  1/1:.:6:65,6,0  ./.:0   1/1:.:6:65,6,0  ./.:0   1/1:.:6:58,6,0  1/1:.:12:131,12,0       ./.:0   ./.:01/1:.:3:38,3,0   1/1:.:3:37,3,0  1/1:.:21:227,21,0       1/1:.:12:131,12,0       1/1:.:6:66,6,0  1/1:.:9:100,9,0 1/1:.:21:209,21,0       1/1:.:6:63,6,0  1/1:.:6:69,6,0


Block 2

chr1    55248   .       ACCC    A,CCCC  179.69  .       AC=13,6;AF=0.100,0.046;AN=130;BaseQRankSum=0.736;ClippingRankSum=0.736;DP=347;FS=0.000;InbreedingCoeff=0.2231;MLEAC=10,4;MLEAF=0.077,0.031;MQ=53.55;MQ0=0;MQRankSum=0.736;QD=4.99;ReadPosRankSum=0.736;EFF=INTERGENIC(MODIFIER||||||||||1),INTERGENIC(MODIFIER||||||||||2);CSQ=-||||intergenic_variant|||||||||||||     GT:AD:DP:GQ:PL  0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24        ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43        0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,44,3,45,46        0/0:.:5:0:0,0,103,0,103,103     0/0:3,0,1,0,0,0,0,0,0:4:9:0,9,81,9,81,81        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   ./.:.:1 0/0:.:3:0:0,0,41,0,41,41        0/0:1,0,0,0,0,0,0,0,0:1:9:0,9,73,9,73,73        0/1:1,0,0,1,0,0,0,0,0:2:22:28,0,73,28,22,46     0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,25,4,25,25        0/0:2,0,0,0,0,0,0,0,0:2:21:0,21,273,21,273,273  0/1:5,0,0,1,0,0,0,0,0:6:11:11,0,235,26,158,175  ./.:0,0,0,0,1,0,0,0,0:1 0/0:.:4:0:0,0,46,0,46,46        ./.:.:3 ./.:.:2 0/1:1,0,1,1,0,0,0,0,0:3:21:28,0,44,31,21,52     0/0:.:4:0:0,0,77,0,77,77        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   0/0:.:2:6:0,6,51,6,51,51        0/0:.:6:2:0,2,151,2,151,151     0/0:2,0,1,0,0,0,0,0,0:3:7:0,7,74,7,74,74        ./.:.:0 0/0:2,0,0,0,0,0,0,0,0:2:5:0,7,59,5,37,35        1/2:0,0,0,1,0,0,0,0,0:1:1:27,1,40,26,0,25       0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83        ./.:.:0 2/2:0,0,0,0,0,3,0,0,0:3:9:59,59,59,9,9,0        ./.:.:16        0/0:2,0,0,0,0,0,0,0,0:2:7:0,7,59,7,59,59        0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83        0/0:2,0,0,0,0,0,0,0,0:2:11:0,11,68,11,68,68     0/2:6,0,0,0,0,2,0,0,0:8:18:18,39,384,0,346,340  ./.:.:1 0/1:3,0,0,1,0,0,0,0,0:4:25:25,0,96,34,100,134   0/0:2,0,0,0,0,0,0,0,0:2:12:0,12,105,12,105,105  0/1:1,0,0,1,0,0,0,0,0:2:17:17,0,94,21,26,44     0/0:.:2:6:0,6,64,6,64,64        0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,24,3,25,27        0/0:.:2:6:0,6,48,6,48,48        0/0:.:2:6:0,6,63,6,63,63        0/0:0,0,1,0,0,0,0,0,0:1:0:0,0,1,0,1,1   ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,6,3,8,9   0/0:1,0,0,0,0,0,0,0,0:1:15:0,15,124,15,124,124  ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,19,4,19,19        0/1:2,0,0,1,0,0,0,0,0:3:28:28,0,66,34,70,104    0/0:.:4:0:0,0,18,0,18,18        0/2:2,0,0,0,0,4,0,0,0:6:57:68,74,143,0,69,57    ./.:.:0 0/0:4,0,0,0,0,0,0,0,1:5:15:0,15,109,15,109,109  0/0:.:10:0:0,0,182,0,182,182    0/0:.:1:3:0,3,31,3,31,31        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   0/0:2,0,0,0,0,0,0,0,0:2:5:0,5,76,6,78,79        0/0:1,1,0,0,0,0,0,0,0:2:4:0,4,28,4,28,28        ./.:.:1 1/1:0,0,0,2,0,0,0,0,0:2:6:67,6,0,67,6,67        ./.:.:1 0/0:.:2:0:0,0,14,0,14,14        ./.:.:0 ./.:.:1 0/0:.:5:1:0,1,120,1,120,120     0/0:.:4:0:0,0,8,0,8,8   0/0:.:6:0:0,0,94,0,94,94        0/1:1,0,0,2,0,0,0,0,0:3:1:27,0,1,29,6,35        0/0:.:2:0:0,0,3,0,3,3   ./.:.:6 ./.:.:0 ./.:.:0 0/2:3,0,0,0,0,2,0,0,0:5:27:46,27,89,0,62,81     ./.:.:0 0/0:.:1:0:0,0,6,0,6,6   0/0:.:9:0:0,0,86,0,86,86        ./.:.:1 ./.:.:0 1/1:.:.:0:1,1,0,1,0,0   0/0:0,0,0,0,0,0,0,0,0:0:3:0,3,26,3,26,26        0/1:4,0,0,2,0,0,0,0,0:6:49:49,0,93,61,100,160   0/0:.:3:0:0,0,14,0,14,14        0/0:.:8:0:0,0,60,0,60,60        0/0:2,1,0,0,0,0,0,0,0:3:6:0,6,37,6,37,37        ./.:.:2 0/0:.:9:0:0,0,81,0,81,81        0/0:0,0,0,0,0,0,0,0,0:0:1:0,1,2,1,2,2


Any idea what the issue is?

Hi, I wanted to place what I believe is a bug report.

Bug: If the haplotype caller assembles reads into identical haplotypes, then depending on arbitrary factors, it will emit different variant events for this haplotype, meaning that gVCFs from samples with the same genotype will appear to have different genotypes when their VCFs are viewed.

Longer Explanation : Consider the alignment shown below or a reference genome (top) and a reconstructed haplotype aligned to it (aligned as in either the first or second case):

GACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCC
GACCCCGA--GGGCC---------------GGGCCCTCC
GACCCCGA---------GGGCC--------GGGCCCTCC


This alignment is tricky because the "GGGCC" can be positioned in one of two locations, as shown above, and both of these are equivalently optimal with an affine scoring matrix. Which alignment should be returned is thus questionable, but presumably the GATK should always give the same alignment.

However, the GATK does not, and because of this one sample will have a 2D5M15D mutation, while the other can have a 9D5M8D mutation listed, even if both have the exact same genotype. In particular, if the active region or length of the assembled regions change, which can easily happen, then even if every other base is matched, different variants will appear at this locus, based on how the SW alignment is done.

I have tried to track this bug down, and it seems due to how the SW aligner class selects between equally optimal alignments (I believe while constructing the backtrack matrix but have not verified).

Reproducible Example: The higher level problem is that VCFs emitted from identical samples but with slightly different reads/intervals show different variants at this site despite having the same underlying genotypes and assembled haplotypes. Tracking showed this was due to the length of assembled regions varying a bit. A simple example which simply shows the basic flaw in the relevant code is given below.

Does this seem like the type of thing that could be an easy fix? I can also put this code as a branch on gsa_unstable if it would make things easier. I am also assuming that the CombineGVCF step will not fix this error, but have not validated the behavior and for my purposes it would still be problematic.

Warm wishes, N

//Two examples of an assmebled haplotype and reference genome,
//exactly the same save the window size
byte[] fullRef="GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".getBytes();
byte[] fullHap = "GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA--GGGCC---------------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".replace("-","").getBytes();
byte[] longRef =                                                                            "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".getBytes();
byte[] longHap =                                                                           "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA---------GGGCC--------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".replace("-","").getBytes();
//a simplified version of the getCigar routine in the haplotype caller
final SmithWaterman alignment2 = new SWPairwiseAlignment( fullRef, fullHap, CigarUtils.NEW_SW_PARAMETERS );
//These cigars are not the same, but should be


I was working with FastaAlternateReferenceMaker today. I wanted to use it on two samples. The first sample worked fine. The second sample came up with an error about a few REF values not conforming (e.g. :A) in the vcf. I wrote a script to correct the errors and remove the : from the vcf file. I tried to run it again, but received memory allocation errors. So, I dealt with that and am now allocating 6g for a 15G VCF and a 188M fasta reference. I got it running (I think) and then started getting the NegativeArraySizeException java error. I saw this had come up as a bug in earlier versions of GATK, but it seemed like it had been resolved...perhaps not? Anyhow, this 'quick little run' has taken up the better part of my day and I would be super greatful for some assistance troubleshooting this problem.

Command: $java -Xmx6g -jar ../../shays/GATK/GenomeAnalysisTK.jar -l ERROR -R ../up/chrom6.2.fa -T FastaAlternateReferenceMaker -o D-chr6.cons.fa --variant D-c6.vcf Trace: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NegativeArraySizeException at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:101) at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:120) at org.broadinstitute.variant.vcf.VCFCodec.readHeader(VCFCodec.java:85) at org.broad.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:76) at org.broad.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:366) at org.broad.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:354) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:281) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:873) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:725) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:259) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.5-2-gf57256b): ##### ERROR ##### ERROR Please check the documentation guide to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ I've run into the following bug while running GenotypeGVCFs: ##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample <ID redacted> does not have likelihoods at position 1:1115551 The input file in question is a gVCF produced by merging a large number of smaller gVCFs using CombineGVCFs (all tasks were run using version 3.1). What's happening is that the position 1115551 doesn't exist in that particular sample: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT <Sample_ID> 1 1115550 . AC A,<NON_REF> 118.73 . BaseQRankSum=-0.377;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.72;MQ0=0;MQRankSum=-1.093;ReadPosRankSum=-0.811 GT:AD:DP:GQ:PL:SB 0/1:7,6,0:13:99:156,0,188,177,207,384:4,3,3,3 1 1115552 . C <NON_REF> . . END=1115552 GT:DP:GQ:MIN_DP:PL 0/0:15:0:15:0,0,31  But when the sample is combined with other samples, that position gets filled in with a simple "0/0", without any PLs (or any of the other fields, including AD, DP, GQ, etc.), which causes the GenotypeGVCFs to choke. I can imagine there might be other scenarios that will result in a "0/0" genotype field, so perhaps the easiest way to fix this would be to make sure that any "0/0" actually gets output as "./.:.:.:.:.". Thanks, Grace I tried specifying intervals using a vcf file to HaplotypeCaller and I'm not getting the appropriate hits, because it appears that GATK is interpreting the vcf file as 0-based. When GATK finds a read for which a corresponding @RG tag is missing in the header, the error message given implies that the read itself is lacking an RG tag rather than the header. Could this be fixed please so that the two error conditions are differentiated? It will save people time when debugging their pipelines if they don't have to go looking at the wrong thing. ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/lustre/blah/DDD_MAIN5247030.bam} is malformed: Read HS7_7515:4:2101:12189:66438#2 is missing the read group (RG) tag, which is required by the GATK. Please use to fix this problem The reads have the RG tag but an @RG tag matching their ID does not exist in the header. 901282:HS7_7515:4:2101:12189:66438#2 99 1 37000590 60 75M = 37000629 114 * * X0:i:1 X1:i:0 BC:Z:CGATGTAT BD:Z:* MD:Z:75 PG:Z:MarkDuplicates RG:Z:1#2 BI:Z:* AM:i:37 NM:i:0 SM:i:37 MQ:i:60 QT:Z:BCAADFFE XT:A:U BQ:Z:* 901283:HS7_7515:4:2101:12189:66438#2 147 1 37000629 60 75M = 37000590 -114 * * X0:i:1 X1:i:0 BD:Z:* MD:Z:75 PG:Z:MarkDuplicates RG:Z:1#2 BI:Z:* AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:* P.S. your spam filter is stopping me posting discussions with URLs in, could you whitelist any gatkforums dot broad institute dot org urls? I'm running into a problem with vcfs that have single ended break ends. (These are produced by an old version of Strelka .) Tribble doesn't recognize "." as valid in alternative alleles. Single break ends are valid in the vcf standard and the files validate according to Vcftools. Others have run into this problem as well: https://groups.google.com/forum/#!searchin/strelka-discuss/gatk/strelka-discuss/gJfsyjZNZXA/ExDXZiVWW_kJ example error ##### ERROR stack trace org.broad.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1807: Unparsable vcf record with allele .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC, for input source: /cga/tcga-gsc/benchmark/Indels/strelkaPON/NA18606.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam-NA18608.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam/final.indels.vcf at org.broadinstitute.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:715) at org.broadinstitute.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:527) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:553) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:494) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:291) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:234) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:213) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:45) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35) at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:284)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:264) at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:225)
##### ERROR ------------------------------------------------------------------------------------------


Example vcf line

19  36002413    .   C   .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC    .   PASS    IHP=1;NT=ref;QSI=82;QSI_NT=82;SGT=ref->hom;SOMATIC;SVTYPE=BND;TQSI=1;TQSI_NT=1  DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   49:49:42,44:0,0:7,6:43.72:0.85:0.00 11:11:0,0:6,6:5,5:14.61:0.48:0.0


A full vcf is available at: /humgen/gsa-scr1/pub/incoming/BreakendBug/breakend.vcf

Hi,

I was using PrintReads as part of Base Quality Recalibration. The resulting BAM header is incorrect as there are two @PG entries for GATK PrintReads. The bam file I am using is internal to the broad institute and is picard aggregated. There is an entry for GATK PrintReads in the original bam. Would it be possible for the @PG header lines to include a .A-Z like the other headers?

Thanks Aaron

##### ERROR MESSAGE: SAM/BAM file /btl/projects/aaron/Mouse/G42214/TET-OT-827ctl.recalibrated.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.

Dear GATK team,

I think I found a bug as shown in the attached screenshot. The highlighted line of the vcf file after using UnifiedGenotyper, VariantRecalibrator, ApplyRecalibration, ReadBackedPhasing, SelectVariants, CombineVariants, and SnpEff (following Best Practices) shows X 307897 . C CGTGAAGG 9494.14 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=8.010;DP=18;EFF=DOWNSTREAM(MODIFIER||||PPP2R3B|protein_coding|CODING|ENST00000381625|),INTRON(MODIFIER||||PPP2R3B|protein_coding|CODING|ENST00000390665|4),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000445792|5),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000475859|2),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000477110|1),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000496630|3),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000468169|),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000477636|),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000484364|);FS=0.000;InbreedingCoeff=0.3019;MQ=45.55;MQ0=0;MQRankSum=-8.092;QD=9.55;ReadPosRankSum=2.020;VQSLOD=2.27;culprit=ReadPosRankSum;set=29_BOTH GT:AD:DP:GQ:PL ./. 0/1:0,14:9:99:367,42,0

To be sure, I checked the two samples' bams at the PrintReads stage but do not see such a long indel. Is this a potentially similar bug to http://gatkforums.broadinstitute.org/discussion/2516/rare-homozygous-indel-bug-in-gatk-2-4-9 ?

Jamie

Hi,

I'm trying to apply my recalibrated quality values with the following command:

java -jar GenomeAnalysisTK.jar -T PrintReads -R genome_noHap.fa -I realigned.bam -BQSR realigned.recal_data.grp -o realigned.recal.bam -nct 25

And get the following error. I couldn't find it elsewhere. Would be happy if I could get some advice.

Best,

Karl

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.sting.utils.recalibration.BaseRecalibration.recalibrateRead(BaseRecalibration.java:151) at org.broadinstitute.sting.utils.recalibration.BQSRReadTransformer.apply(BQSRReadTransformer.java:93) at org.broadinstitute.sting.gatk.walkers.readutils.PrintReads.map(PrintReads.java:251) at org.broadinstitute.sting.gatk.walkers.readutils.PrintReads.map(PrintReads.java:98) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:230) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:218) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

##### ERROR ------------------------------------------------------------------------------------------

Hi,

I did search on the site and seems several have had a similar problem but I couldn't fix mine based on those. I've managed to get to this point succesfully with creating a realigned bam file with IndelRealigner. I'm using a non-model organism with self created masking list from a VCF run done before the realignment (list in bed format). I've had to use the -fixMisencodedQuals so far through out my runs. I checked the masking file so that the ranges do not go over the ranges that are specified in my list of regions to cover (-L option).

Is there anything else I would need to check still?

Command line view:

sulyba@hippu4:/fs/lustre/wrk/sulyba/stickleback_capture> java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals

INFO  12:34:10,502 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,510 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14
INFO  12:34:10,510 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  12:34:10,516 HelpFormatter - Program Args: -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals
INFO  12:34:10,516 HelpFormatter - Date/Time: 2013/02/01 12:34:10
INFO  12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,551 ArgumentTypeDescriptor - Dynamically determined type of Sites_to_Mask.bed to be BED
INFO  12:34:10,563 GenomeAnalysisEngine - Strictness is SILENT
INFO  12:34:11,309 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  12:34:11,319 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:34:11,392 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
INFO  12:34:11,660 GenomeAnalysisEngine - Processing 20350670 bp from intervals
INFO  12:34:11,672 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  12:34:11,946 BaseRecalibrator - The covariates being used here:
INFO  12:34:11,947 BaseRecalibrator -   QualityScoreCovariate
INFO  12:34:11,947 BaseRecalibrator -   ContextCovariate
INFO  12:34:11,947 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3
INFO  12:34:11,947 BaseRecalibrator -   CycleCovariate
INFO  12:34:11,952 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO  12:34:11,952 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,952 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO  12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO  12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3]
INFO  12:34:11,954 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,954 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:12,109 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 12:34:12,120 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  12:34:12,644 ReadShardBalancer$1 - Loading BAM index data for next contig INFO 12:34:12,645 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  12:34:14,901 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: -2
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: -2
##### ERROR ------------------------------------------------------------------------------------------


Dear GATK Team,

I am running a pipeline on several high coverage human individuals that have been mapped using bwa and processed using samtools, picard and gatk. The bam-files pass ValidateSam from picard, but when I run the bqsr step some of them fails giving a Malformed read error (using -filterMBQ does not help in this case). I tracked down the error to bamfiles that ends with a paired end read where the mate maps in the beginning of the contig (in my case human mtDNA).

Eg, this will make it crash:

readX 177 MT 16558 37 7S2M2I10M80S = 294 -16176 GACCTGTGATCC...
readY 177 MT 16558 37 7S2M2I10M80S = 238 -16232 GACCTGTGATCC...
readZ 113 MT 16558 37 7S2M2I10M80S = 273 -16197 GACCTGTGATCC...
[END]

where a file ending like this wont crash:

readX 83 MT 16469 60 101M = 16246 -324 TGGGGGTAGCTAAAGTGAAC...
readY 147 MT 16469 60 101M = 16267 -303 TGGGGGTAGCTAAAGTGA...
readZ 147 MT 16469 60 101M = 16193 -377 TGGGGGTAGCTAAAGTGAAC...
[END]

I am running GATK v2.3-9-ge5ebf34, but the same error occurs using GATK v-2.2-3 (my previous version). I can genotype the files using UnifiedGenotyper without any problem as well.

This is the error:

##### ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read? at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:380) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor\$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)