Tagged with #bug
1 documentation article | 10 announcements | 34 forum discussions



Created 2012-11-28 16:47:50 | Updated 2014-06-26 14:32:56 | Tags: official forum bug
Comments (0)

Note: only do this if you have been explicitly asked to do so.

Scenario:

You posted a question about a problem you had with GATK tools, we answered that we think it's a bug, and we asked you to submit a detailed bug report.

Here's what you need to provide:

  • The exact command line that you used when you had the problem (in a text file)
  • The full log output (program output in the console) from the start of the run to the end or error message (in a text file)
  • A snippet of the BAM file if applicable and the index (.bai) file associated with it
  • If a non-standard reference (i.e. not available in our resource bundle) was used, we need the .fasta, .fai, and .dict files for the reference
  • Any other relevant files such as recalibration plots

A snippet file is a slice of the original BAM file which contains the problematic region and is sufficient to reproduce the error. We need it in order to reproduce the problem on our end, which is the first necessary step to finding and fixing the bug. We ask you to provide this as a snippet rather than the full file so that you don't have to upload (and we don't have to process) huge giga-scale files.

Here's how you create a snippet file:

  • Look at the error message and see if it cites a specific position where the error occurred
  • If not, identify what region caused the problem by running with -L argument and progressively narrowing down the interval
  • Once you have the region, use PrintReads with -L to write the problematic region (with 500 bp padding on either side) to a new file -- this is your snippet file.
  • Test your command line on this snippet file to make sure you can still reproduce the error on it.

And finally, here's how you send us the files:

  • Put all those files into a .zip or .tar.gz archive
  • Upload them onto our FTP server as explained here (make sure you use the proper UPLOAD credentials)
  • Post in the original discussion thread that you have done this
  • Be sure to tell us the name of your archive file!

We will get back to you --hopefully with a bug fix!-- as soon as we can.


Created 2015-07-09 23:14:09 | Updated 2015-07-09 23:14:27 | Tags: patch bug bug-fixed topstory
Comments (3)

This patch release fixes several issues that we felt were annoying enough to warrant an interim version release; see detailed list below.

Oh, and in case you're curious about the patch number, the *-46 numeral refers to the number of code commits made since the 3.4-0 release. Out of this, about a dozen are related to the bug fixes listed above; another dozen relate to new work that is not yet publicly available (gotta wait 'til 3.5...) and the rest are automated commits produced when merging completed work into the master codebase.


Modifications related to the new "spanning deletion" allele

In version 3.4, we introduced some functionality to handle cases where a site that is variant in one sample is covered by a deletion in another sample (see release notes for details). Some of that functionality involved using a new symbolic <*:DEL> notation to represent this in multi-sample VCFs. Unfortunately we later realized that this was not the best way to represent this, so we have now switched to a more spec-compliant notation, which is simply *. We've also added some refinements to correctly handle cases where alleles need to be remapped, and/or where there are multiple overlapping spanning deletions in the population (which we encountered twice in a population of ~10K samples, to give you an idea of how rarely that happens in humans). The new code is fully backward-compatible, so if you have any files that contain the <*:DEL> notation, these will be understood correctly by GATK. Note however that they will not be converted to the new notation -- they will just be passed through unchanged.


VectorLoglessPairHMM accelerated library

Some annoying gcc compiler version issues caused many of you to be unable to utilize Intel's accelerated PairHMM library to speed up HaplotypeCaller. We've sorted those out so now everyone should be able to utilize it provided their computing platform is running reasonably recent (i.e. not Jurassic-era) software.


MNP merging in ReadBackedPhasing

There was a minor bug in the ReadBackedPhasing tool that was causing some variants to be merged into MNPs when they shouldn't be; that's now fixed.


Speed issues in DepthOfCoverage

A user reported a dramatic slowdown in the DepthOfCoverage and shortly thereafter, provided a code patch that resolved to problem. Self-fixing bug reports are our favorite kind, thank you @mnw21cam!


Enhanced variant selection/filtering

OK, technically not a bug fix, so this shouldn't be in a patch, according to my esteemed software engineer colleagues, but I couldn't wait to share it with y'all. This is something people often ask how to do and until now it was beyond painful. But hark! You can now tell SelectVariants to filter out variants where a given number or proportion of samples have filtered genotypes. See the usage example in the tool doc (toward the end of the list). Now tell me you're not happy that we included it in the patch. C'mon, I dare you.


New read filter for overly clipped reads

Again with a not-a-bug-fix thing, sorry -- there's a new read filter to get rid of reads that have too much soft-clipping going on to be honest. This is a sign that there's something seriously wrong with them, as we recently found out while analyzing a bunch of cheek swab samples that turned out to be heavily contaminated with bacterial DNA. It's always fun to see some seriously messed-up samples go by... and it's nice that now we can do something to rescue them.


Various small fry

We made some minor changes to documentation (VQSLOD, QD) and error handling (clarified error message for Contigs Out of Order error, made GVCF indexing parameters not required for gzipped output).


Created 2015-05-22 01:58:47 | Updated 2015-07-09 22:27:40 | Tags: release bug version-highlights topstory
Comments (32)

Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.

None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.


Key changes to the joint discovery tools

Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension .g.vcf so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.

Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of -dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).

Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument --breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use -L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!

And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.

The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic <*:DEL> allele representing the deletion.

The new annotation is called RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the -allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.


New RNAseq tool: ASEReadCounter

This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.


New documentation features: “Common Problems” and “Issue Tracker”

We’ve added two new documentation resources to the Guide.

One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.

The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!


Created 2014-10-23 23:09:56 | Updated | Tags: Troll release bug version-highlights topstory
Comments (18)

Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.

Because -ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.

Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.


Genotype refinement workflow with all the trimmings

As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.


Non-diploids, rejoice!

This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.


HaplotypeCaller gets physical

You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:

or this:

But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).

In a nutshell, phased records will look like this:

1   1372243  .  T  <NON_REF>  .  .  END=1372267  <snip>  <snip>
1   1372268  .  G  A,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip>
1   1372269  .  G  T,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip>
1   1372270  .  C  <NON_REF>  .  .  END=1372299  <snip>  <snip>

You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).

The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.


Heads or tails

Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).

We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the --recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.


Variant annotations finally make sense

Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.


Created 2014-07-30 20:26:12 | Updated | Tags: release bug version-highlights topstory reference-model gatk-3-2
Comments (2)

Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as -ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.


Working out the kinks in the "reference confidence model" workflow

The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).

You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as --includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).

HaplotypeCaller now emits post-realignment coverage metrics

The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the -bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.

R you up to date on your libraries?

We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.

A sincere apology to GATK-based tool developers

We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.

In a nutshell, the base name of the codebase has changed from sting to gatk (which hopefully makes more sense), and the most common effect is that sting.gatk classpath segments are now gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.


Created 2014-06-11 21:20:08 | Updated | Tags: best-practices bug error rnaseq topstory
Comments (2)

We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.

The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.

This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.

To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.

We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.


Created 2014-04-11 05:00:50 | Updated | Tags: selectvariants bug multithreading ad bug-fixed nt
Comments (1)

This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.

Affected versions: 2.x versions up to 2.8 (not sure when it started)

Affected tool: SelectVariants

Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (-nt)

Effects: Genotype-level fields (such as AD) swapped among samples

This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.


Created 2013-10-11 13:36:20 | Updated | Tags: variantstobinaryped bug bug-fixed
Comments (0)

As reported here, a bug was found in VariantsToBinaryPED. Briefly, VariantsToBinaryPed expected the fam file to describe the samples in the same order as the input VCF file: if they were not in the same order, it did not correctly map sample IDs with the genotypes in the output binary PED.

We expect that in most use cases, the order would be the same (because PLINK uses lexicographic order, as does the GATK), so the bug would not impact results. However, as the user report demonstrates, in cases where order was different, the bug would seriously impact results.

We therefore recommend that anyone who has used VariantsToBinaryPED check their results for any inconsistencies in the kinship coefficients. Our apologies for the inconvenience to anyone who is affected by this bug, and big thanks again to user TimHughes for reporting the bug.

Finally, we have fixed the bug in GATK and released the fixed version under version number 2.7-4.


Created 2013-03-14 12:29:25 | Updated 2014-12-01 17:34:07 | Tags: unifiedgenotyper official haplotypecaller printreads bug
Comments (36)

As reported here:

  • http://gatkforums.broadinstitute.org/discussion/2342/unifiedgenotyper-causes-arrayindexoutofboundsexception-3-how-to-fix-it
  • http://gatkforums.broadinstitute.org/discussion/2343/printreads-yet-another-arrayindexoutofboundsexception
  • http://gatkforums.broadinstitute.org/discussion/2345/arrayindexoutofboundsexception-in-haplotypecaller

If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.

Thank you for your patience while we work to fix this issue.

Latest update: we found that the three tools (PrintRead, HaplotypeCaller and UnifiedGenotyper) had different issues that only produced the same symptom (the ArrayIndexOutOfBoundsException error). The underlying issues have all been fixed in version 2.5. If you encounter an ArrayIndexOutOfBoundsException in later versions, it is certainly caused by a different problem, so please open a new thread to report the issue.


Created 2012-09-24 20:29:35 | Updated 2012-09-24 20:29:35 | Tags: reducereads bug
Comments (0)

We have identified a major bug in ReduceReads -- GATK versions 2.0 and 2.1. The effect of the bug is that variant regions with more than 100 reads and fewer than 250 reads get downsampled to 0 reads.

This has now been fixed in the most recent release.

To check if you are using a buggy version, run the following:

    samtools view -H $BAM

This will produce the following output:

    @PG ID:GATK ReduceReads VN:XXX

If XXX is 2.0 or 2.1, any results obtained with your current version are suspect, and you will need to upgrade to the most recent version then rerun your processing.

Our most sincere apologies for the inconvenience.


Created 2012-09-20 20:43:12 | Updated 2012-09-20 20:43:12 | Tags: bqsr baserecalibrator queue scatter-gather bug
Comments (0)

We have discovered a bug that seriously impacts the results of BQSR/ BaseRecalibrator when it is run with the scatter-gather functionality of Queue. To be clear, the bug does NOT affect BaseRecalibrator runs performed "normally", i.e. WITHOUT Queue's scatter-gather.

Consequences/ Solution:

Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.

This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!


Created 2015-08-27 18:54:14 | Updated 2015-08-27 18:54:44 | Tags: haplotypecaller bug
Comments (0)

Haplotype Caller output this record, how can it have an AD of 0,0? 7 21584892 . T TAA 257.77 . AC=1;AF=0.500;AN=2;DP=118;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=70.00;SOR=4.804 GT:AD:GQ:PL 0/1:0,0:99:125,0,112

GATK version is 3.4-46


Created 2015-06-23 16:19:39 | Updated | Tags: vcf developer bam bug error
Comments (4)

I hava a question wish to get help from the developers: I am using GATK with two modles: 1, I just use the UnifiedGenotyper to call the variants from a prepared bam file, then I get a vcf file. (Call it A.vcf) 2, I run the UnifiedGenotype by Chr, one by one, say, by using the "-L" arg, then I get a sets of small vcf files.

But, the result of ChrY is confusing me a lot.... the ChrY part in the A.vcf is QUITE different from the small vcf file that generated by "-L chrY", the difference seems to be larger than 50%. That means, the result is DIFFERENT for chrY. However, I have also checked the other Chromosomes, the difference is slight. ONLY the ChrY has this problem.

Our script pasted here:

java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T RealignerTargetCreator \ -I ${sampleName}.bam \ -o ${sampleName}.intervals \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf

java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T IndelRealigner \ -targetIntervals ${sampleName}.intervals \ -I ${sampleName}.bam \ -o ${sampleName}.realigned.bam \ -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -known 1000G_phase1.indels.hg19.sites.vcf

samtools index ${sampleName}.realigned.bam

java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T BaseRecalibrator \ -nct 8 \ -I ${sampleName}.realigned.bam \ -knownSites dbsnp_138.hg19.vcf \ -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ -knownSites 1000G_phase1.indels.hg19.sites.vcf \ -o ${sampleName}.recal_data.grp

java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T PrintReads \ -nct 8 \ -I ${sampleName}.realigned.bam \ -BQSR ${sampleName}.recal_data.grp \ -o ${sampleName}.realigned.recal.bam

samtools index ${sampleName}.realigned.recal.bam

java -Xmx30g -jar /data/SG/Env/software_installed/GenomeAnalysisTK.jar \ -L xx \ # I add -L option here when I do step 2. when I generate A.vcf ,I didn't add -L here -R ucsc.hg19.fasta \ -T UnifiedGenotyper \ -nct 8 \ -glm BOTH \ -I ${sampleName}.realigned.recal.bam \ -D dbsnp_138.hg19.vcf \ -o ${sampleName}.vcf \ #here A.vcf or small vcf generated -stand_call_conf 50.0 \ -stand_emit_conf 10.0 \ -dcov 200 \ -A AlleleBalance -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A RMSMappingQuality -A InbreedingCoeff -A Coverage

Wish you guys can offer me some help.

Thanks,


Created 2015-05-23 05:08:22 | Updated 2015-05-23 05:12:25 | Tags: haplotypecaller bug gatk
Comments (7)

Hi All,

I cannot perform joint genotype calling using gVCFs using the GenotypeGVCF command on my dataset containing 352 mtDNA samples.

Does any one has run into this problem before, or has suggestions on how to get this working.

I have attached the standard output for a hanging run.

I can generate VCF files individually no problems.

INFO 17:06:45,732 GenomeAnalysisEngine - Strictness is SILENT INFO 17:06:45,867 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:06:47,721 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 20 processors available on this machine INFO 17:06:47,791 GenomeAnalysisEngine - Preparing for traversal INFO 17:06:47,793 GenomeAnalysisEngine - Done preparing for traversal INFO 17:06:47,794 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:06:47,794 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:06:47,795 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:06:48,436 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files INFO 17:07:17,800 ProgressMeter - gi|251831106|ref|NC_012920.1|:1 0.0 30.0 s 49.6 w 0.0% 15250.3 w 15250.3 w Waiting for data... (interrupt to abort) GATK VERSION: The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 java version "1.7.0_51"

Thanks for your time. James Boocock


Created 2015-03-20 17:30:23 | Updated | Tags: variantfiltration bug
Comments (14)

Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument. Here is what I ran:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf \ --filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \ --filterName “darwinfinchfilter”

I got the following error message: "2: No such file or directory."

However, I know my file path was set correctly as if i remove the last two arguments and only ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf

It would run.

I also tried Geraldine example filterExpression:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_geraldinefilter.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName “Geraldine_snp_filter"

and I got the following error: Unmatched ".


Created 2015-03-09 10:08:14 | Updated 2015-03-09 10:10:04 | Tags: realignertargetcreator bug gatk-runtime-error
Comments (1)

I am trying to reproduce running the GATK pipeline (best practices) on Power8, but during the RealignerTargetCreator phase, I run into the following problem: INFO 06:05:39,926 HelpFormatter - -------------------------------------------------------------------------------- INFO 06:05:39,928 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 06:05:39,929 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 06:05:39,929 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 06:05:39,933 HelpFormatter - Program Args: -T RealignerTargetCreator -R genome/hg19.fasta -I results/out_sorted_dup.bam -o results/out_target_intervals.interval_list -known vcf/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known vcf/1000G_phase1.indels.hg19.sites.vcf INFO 06:05:39,941 HelpFormatter - Executing as admin@power8-2xl on Linux 3.14.17-100.fc19.ppc64p7 ppc64; OpenJDK 64-Bit Server VM 1.7.0_65-mockbuild_2014_07_26_03_56-b00. INFO 06:05:39,942 HelpFormatter - Date/Time: 2015/03/09 06:05:39 INFO 06:05:39,942 HelpFormatter - -------------------------------------------------------------------------------- INFO 06:05:39,942 HelpFormatter - -------------------------------------------------------------------------------- INFO 06:05:40,016 GenomeAnalysisEngine - Strictness is SILENT INFO 06:05:40,150 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 06:05:40,159 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 06:05:40,182 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 06:05:40,461 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 06:05:41,576 GenomeAnalysisEngine - Done preparing for traversal INFO 06:05:41,577 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 06:05:41,578 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 06:05:41,578 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 06:05:43,206 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.ArrayIndexOutOfBoundsException: -94 at org.broadinstitute.gatk.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:216) at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:288) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.initializeReferenceSequence(LocusReferenceView.java:150) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.(LocusReferenceView.java:126) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:90) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: -94

This is the command I used: java jar ~Xmx32g -jar ~/workspace/GATK/tools/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R genome/hg19.fasta -I results/out_sorted_dup.bam -o results/out_target_intervals.interval_list -known vcf/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known vcf/1000G_phase1.indels.hg19.sites.vcf

I tried to look for a similar problem on the forum, but could not find anything. Could you advise how to resolve this situation? The error message suggests this may be a bug..

Thanks in advance!


Created 2015-02-20 09:51:54 | Updated | Tags: bug
Comments (5)

Validating the vcf file which comes out of the VariantAnnotator is resulting in an error. The GC Content is a float and not an integer (which is stated in the INFO column). At the moment we are do a search and replace ourselves, but it would be nice if it can be fixed in a next version of GATK.


Created 2015-01-27 10:30:58 | Updated | Tags: variantfiltration jexl bug variantcontext hardfilters
Comments (8)

I was pulling my hair out over this one.

I was applying a hard filter to a genotyped gVCF using JEXL to access variant context attributes to decide what filter setting I would apply. The filter was

"vc.getGenotype("%sample%").isHomRef() ? vc.getGenotype("%sample%").getAD().size() == 1 ? DP < 10 : ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum <= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 : false"

In pseudocode it says:

 `if ( isHomRef ) then
   if ( getAD().size() == 1 ) then DP < 10 else
      ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum >= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 else ignore record`

The idea being that for records where not all reads contained the reference allele, we would filter out those positions where there was evidence to suggest that the reads supporting an alternate allele were of a significantly better quality. However, running this filter I keep getting the warning (snipped for clarity):

WARN [SNIP]... MQRankSum <= 3.2905 [SNIP]... : false;' undefined variable MQRankSum

So I thought the filter was failing. However, just as a test, I changed the direction of MQRankSum from >=3.2905 to <=3.2905 (a bit nonsensical, it should basically apply the filter to almost all HomRef positions that had any reads supporting an alternate allele).

I still get the warning but I found the filter was applied to variant records as it should be. e.g. the following went from PASS to BAD_HOMREF:

Supercontig_1.1 87 . C A . BAD_HOMREF BaseQRankSum=2.79;DP=40;MQ=39.95;MQ0=0;MQRankSum=-2.710e+00;ReadPosRankSum=0.819;VariantType=SNP GT:AB:AD:DP 0/0:0.870:34,5:39

So the filter is being correctly applied, but I am not sure why all the warnings are being generated? Is this a bug? Have I done something wrong?


Created 2015-01-23 08:37:49 | Updated | Tags: haplotypecaller bug
Comments (8)

Hello,

I have a pb with HaplotypeCaller.
I saw that this error has been reported as linked with -nct option, but I did not use it.

Strangely enough, I think my command worked some times ago (before server restart ?)

All reference files I used are from the GATK bundle.

Sletort


Created 2014-12-12 12:51:21 | Updated | Tags: haplotypecaller bug pairhmm stdout
Comments (1)

Hi,

I'm using 3.3-0-g37228af.

This command gets around a feature that looks like a bug to me:

 gatk -T HaplotypeCaller -R ref.fa -I in.bam -o /dev/stdout 2> log.txt | grep -v -e PairHMM -e "JNI call" | bgzip -s - > out.vcf.gz

As you see, one has to remove three lines of the stdout stream written by PairHMM. Like all other logging information, they should go to stderr.

The three lines look like this:

Using un-vectorized C++ implementation of PairHMM
Time spent in setup for JNI call : 3.3656100000000003E-4
Total compute time in PairHMM computeLikelihoods() : 0.008854789

Regard,s Ari


Created 2014-11-20 23:34:28 | Updated | Tags: selectvariants bug
Comments (1)

Hi I'm using select variants to extract out variants with PASS filter, and those that are biallelic SNPs with the command below using gatk 3.2-2.

java -Xmx${MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${ref_dir}/${genome} \ -T SelectVariants \ --variant "1_"${VCF_FILE}"excCpGRepeat.vcf" \ -ef \ -L $CHROM \ --restrictAllelesTo BIALLELIC \ -selectType SNP \ -o "2"${VCF_FILE}"_PASSBiSNP_excCpGRepeat.vcf"

I split my vcf using snpsift - so I run it on each chromosome seperately. For 12 of the 38 chromosomes I get the error - the others work fine. I re-extracted one of the chromosomes that failed, to be sure it wasn't a snpsift issue, and it still fails. Any ideas on what the problem might be?

INFO 15:26:17,318 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,322 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 15:26:17,322 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:26:17,323 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:26:17,328 HelpFormatter - Program Args: -R /u/home/c/projectdata/c/canids/reference/canfam31/canfam31_chr1NOTchr01/canfam31_chr1NOTchr01.fa -T SelectVariants --variant 1.chr35_excCpGRepeat.vcf -ef -L chr35 --restrictAllelesTo BIALLELIC -selectType SNP -o 2_chr35_PASSBiSNP_excCpGRepeat.vcf INFO 15:26:17,338 HelpFormatter - Executing as @n263 on Linux 2.6.32-431.20.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_55-mockbuild_2014_04_16_12_11-b00. INFO 15:26:17,338 HelpFormatter - Date/Time: 2014/11/20 15:26:17 INFO 15:26:17,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,469 GenomeAnalysisEngine - Strictness is SILENT INFO 15:26:17,715 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:26:19,240 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:189) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:336) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:320) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:279) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:225) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:148) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:990) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:772) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:285) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:185) ... 14 more Caused by: java.io.EOFException at htsjdk.tribble.util.LittleEndianInputStream.readFully(LittleEndianInputStream.java:138) at htsjdk.tribble.util.LittleEndianInputStream.readLong(LittleEndianInputStream.java:80) at htsjdk.tribble.index.linear.LinearIndex$ChrIndex.read(LinearIndex.java:271) at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:363) at htsjdk.tribble.index.linear.LinearIndex.(LinearIndex.java:101) ... 19 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: java.lang.reflect.InvocationTargetException
ERROR ------------------------------------------------------------------------------------------

Created 2014-11-05 23:06:28 | Updated | Tags: bug
Comments (2)

This used to work, but it's broken in the latest release at least. CombineVariants with the --assumeIdenticalSamples option complains if there are multiple vcfs with duplicate sample names.


Created 2014-10-10 21:40:48 | Updated | Tags: haplotypecaller bug
Comments (10)

I am attempting to use the HaplotypeCaller to call indels and SNPs in genomes from the Illumina Platinum Genomes pedigree (http://www.illumina.com/platinumgenomes/). I did these steps before running haplotype caller:

  • alignment with BWA mem
  • mark duplicates with Picard
  • Indel realignment

I am now trying to run haplotype caller (version 3.2-2-gec30cee) with the realigned BAM files and am getting the following error:

##### ERROR stack trace
org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Somehow the requested coordinate is not covered by the read. Alignment 10357 | 86M15H
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:573)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinate(ReadUtils.java:429)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(ReadUtils.java:425)
at org.broadinstitute.gatk.tools.walkers.annotator.BaseQualityRankSumTest.getElementForRead(BaseQualityRankSumTest.java:76)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.getElementForRead(RankSumTest.java:200)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.fillQualsFromLikelihoodMap(RankSumTest.java:179)
at org.broadinstitute.gatk.tools.walkers.annotator.RankSumTest.annotate(RankSumTest.java:102)
at org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation.annotate(InfoFieldAnnotation.java:49)
at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContextForActiveRegion(VariantAnnotatorEngine.java:217)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:265)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:941)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Somehow the requested coordinate is not covered by the read. Alignment 10357 | 86M15H
##### ERROR ------------------------------------------------------------------------------------------

Notably, I don’t get this error when analyzing the same BAMs with UnifiedGenotyper. I have isolated a BAM file with ~1300 reads that can recreate the problem but couldn’t narrow it down further. The command I used is:

java -jar GenomeAnalysisTK.jar \
 -T HaplotypeCaller  \
 -R Homo_sapiens_assembly19_withchr.fasta \
 -I del.bam \
 -o del.hc.vcf \
 --output_mode EMIT_ALL_SITES

Created 2014-09-19 11:13:28 | Updated 2014-09-19 11:17:23 | Tags: variantrecalibrator bug
Comments (6)

As adviced by the own GATK program I post this potential bug in the forum:

This is the command I used to call the program from my Perl pipeline:

java -Xmx4g -Djava.io.tmpdir=/tmp -jar /opt/gatk/gatk3.2-3/GenomeAnalysisTK.jar 
-T VariantRecalibrator 
-R /storage/Genomes/GATK/hg19.fasta -input "$path"/"$name".recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 
/storage/Genomes/GATK/Mills_and_1000G_gold_standard.indels.hg19.vcf 
-an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL 
-tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 
-recalFile "$path"/"$name".recalibrate_INDEL.recal 
-tranchesFile "$path"/"$name".recalibrate_INDEL.tranches 
-rscriptFile "$path"/"$name".recalibrate_INDEL_plots.R 

"INFO 13:04:24,773 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:04:24,776 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 13:04:24,776 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 13:04:24,777 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 13:04:24,783 HelpFormatter - Program Args: -T VariantRecalibrator -R /storage/Genomes/GATK/hg19.fasta -input /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrated_snps_raw_indels.vcf -resource:mills,known=true,training=true,truth=true,prior=12.0 /storage/Genomes/GATK/Mills_and_1000G_gold_standard.indels.hg19.vcf -an DP -an FS -an MQRankSum -an ReadPosRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL.recal -tranchesFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL.tranches -rscriptFile /storage/Runs/CIC/ANALYSIS/IVI_01_ExomeReseq_Sept2014/FASTQ/trio/LR1_8233_PE.recalibrate_INDEL_plots.R INFO 13:04:24,786 HelpFormatter - Executing as jllavin@nextera on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_21-b11. INFO 13:04:24,787 HelpFormatter - Date/Time: 2014/09/19 13:04:24 INFO 13:04:24,787 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:04:24,787 HelpFormatter - -------------------------------------------------------------------------------- INFO 13:04:29,468 GenomeAnalysisEngine - Strictness is SILENT INFO 13:04:30,018 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 13:04:30,244 GenomeAnalysisEngine - Preparing for traversal INFO 13:04:30,271 GenomeAnalysisEngine - Done preparing for traversal INFO 13:04:30,271 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:04:30,272 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:04:30,272 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 13:04:30,279 TrainingSet - Found mills track: Known = true Training = true Truth = true Prior = Q12.0 INFO 13:05:00,823 ProgressMeter - chr5:94965875 1168956.0 30.0 s 26.0 s 31.1% 96.0 s 66.0 s INFO 13:05:24,266 VariantDataManager - DP: mean = 16.62 standard deviation = 10.59 INFO 13:05:24,268 VariantDataManager - FS: mean = 1.14 standard deviation = 2.63 INFO 13:05:24,270 VariantDataManager - MQRankSum: mean = -0.16 standard deviation = 1.11 INFO 13:05:24,273 VariantDataManager - ReadPosRankSum: mean = -0.04 standard deviation = 0.99 INFO 13:05:24,300 VariantDataManager - Annotations are now ordered by their information content: [DP, FS, MQRankSum, ReadPosRankSum] INFO 13:05:24,301 VariantDataManager - Training with 1829 variants after standard deviation thresholding. INFO 13:05:24,305 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 13:05:24,475 VariantRecalibratorEngine - Finished iteration 0. INFO 13:05:24,535 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.20003 INFO 13:05:24,564 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 4.20358 INFO 13:05:24,582 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.00341 INFO 13:05:24,593 VariantRecalibratorEngine - Convergence after 18 iterations! INFO 13:05:24,611 VariantRecalibratorEngine - Evaluating full set of 3202 variants... INFO 13:05:24,612 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 13:05:26,140 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: No data found. at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:392) at org.broadinstitute.gatk.tools.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:138) at org.broadinstitute.gatk.engine.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: No data found.
ERROR ------------------------------------------------------------------------------------------

"

I hope it can be easily fixed since I desperately need my pipeline working for my current analysis
;)


Created 2014-08-21 12:56:21 | Updated | Tags: bug
Comments (11)

Hi, I get an ArrayIndexOutOfBoundsException when using this tool (see attached log). Is there a possible workaround or a new version I could use?

Thanks, Johannes


Created 2014-06-18 18:42:35 | Updated 2014-06-18 18:43:39 | Tags: bug basecoveragedistribution
Comments (5)

In some cases BaseCoverageDistribution takes ages to finish. I ran 4 BaseCoverageDistribution commands in parallel, and the smallest sample (by far) still is not finished while the others are. Seems like a bug to me? I will attach a screenshot.


Created 2014-06-16 13:08:44 | Updated | Tags: bug genotypegvcfs bug-report-or-wrong-command
Comments (6)

Gidday,

Not a question, more of a bug report - I'm using the new v3.1 best practices pipeline, so I'd successfully produced my per-sample (n=23 in total) gVCFs with no worries.

Then I used GenotypeGVCFs to combine them as follows, including a dbSNP ROD (the Sanger Mouse Genome Project's SNP calls for 17 samples, in vcf format... not gvcf!):

java -Djava.io.tmpdir=/tmp -Xmx28g -jar ./tmp/GenomeAnalysisTK_3.1-1/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 8 -R ./mm10.fa --dbsnp ./tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.snps.rsIDdbSNPv137.vcf.ordinalsorted.vcf -V GenotypeGVCFs.run1.sample.list -o ./CombinedGenotyping.run1.vcf -A InbreedingCoeff -A FisherStrand -A QualByDepth -A ChromosomeCounts

So I was very surprised to see that my output CombinedGenotyping vcf has 40 samples in it, not 23 - and of course, 23 + 17 = 40. Checking the VCF headers itself confirms that the genotype calls from the 17 Sanger strains have been included in the output vcf, not just the rsIDs as intended(?). I'm guessing that this combining of .g.vcfs and extra ROD isn't the expected behaviour...!

Cheers!


Created 2014-06-03 16:25:34 | Updated | Tags: allelebalance allelebalancebysample haplotypecaller bug genotypegvcfs
Comments (10)

Using GATK 3.1-1, I seem to be unable to get the "AB" (AlleleBalance) annotation for the calls using the HaplotypeCaller -> GenotypeGVCFs pipeline, and I'm not sure how to get it. Our current pipeline (GATK 2.7-4, UnifiedGenotyper) requires this field to perform filtering, so this annotation is essential for us to upgrade to GATK 3.1.

My current pipeline of commands is as follows:

$ GenomeAnalysisTK-3.1-1 -T HaplotypeCaller -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -I -L targets.GRCh37.bed -stand_emit_conf 10 -mbq 20 --downsample_to_coverage 300 -ERC GVCF -pairHMM VECTOR_LOGLESS_CACHING -variant_index_type LINEAR -variant_index_parameter 128000 -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -A AlleleBalance -A QualByDepth -o sample_gvcf.vcf.gz

$ GenomeAnalysisTK-3.1-1 -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --dbsnp dbsnp_137.b37.vcf.gz -V -L targets.GRCh37.bed -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -o joint_vcf.vcf.gz

Above, note that if "-A AlleleBalance" is given to GenotypeGVCFs, GATK crashes with a NullPointerException (AlleleBalance.java, line 66).

The command above is heavily adapted from the current pipeline; do you know what I might be doing wrong with the new and improved HaplotypeCaller?

Thanks so much for your help, and if you need any further information, please let me know.


Created 2014-06-03 14:36:12 | Updated | Tags: haplotypecaller bug crash
Comments (1)

I recently ran the HaplotypeCaller using both the new --pairHMM options and -nct 2, and I encountered a crash in the program (8 of 96 runs crashed). I have rerun the analysis without the "-nct 2" option, and it appears to work as expected, so I'm not in a hurry. I am currently using GATK 3.1-1; below is the specific command I used as well as the stack trace that resulted.

Command: GenomeAnalysisTK -T HaplotypeCaller -R human_g1k_v37_decoy.fasta -I HG00593@123910712.bam --dbsnp dbsnp_137.b37.vcf.gz -L targets.GRCh37.bed -A QualByDepth -stand_emit_conf 10 -mbq 20 --downsample_to_coverage 300 -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A GCContent -A AlleleBalanceBySample -A AlleleBalance -o HG00593.vcf.gz -sample_rename_mapping_file rename_file -nct 2 -ERC GVCF -pairHMM VECTOR_LOGLESS_CACHING -variant_index_type LINEAR -variant_index_parameter 128000

Stack trace:

ERROR stack trace

java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:394) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:413) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:412) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.addMiscellaneousAllele(GenotypingEngine.java:257) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoods(GenotypingEngine.java:227) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:880) at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:141) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

If you require anything else from me, please let me know.


Created 2014-06-02 22:01:51 | Updated | Tags: clipreads bug softclip fixmisencodedquals
Comments (1)

I was softclipping some 84-bp-long single end reads using the following command:

java -Xmx2g -jar /GATK_Path/GenomeAnalysisTK.jar -T ClipReads -I imput_file.bam -o output_file.bam -R $ALIGNREF --cyclesToTrim "1-4,80-84" --clipRepresentation SOFTCLIP_BASES --fix_misencoded_quality_scores

There was no error while running, but when I looked at the file afterwards, every read had a cigar string of 84S. I'm fairly sure this is a bug. I was aligned to the human mitochondrial genome.

I am using GATK v2.3-9-ge5ebf34 on Mac OS 10.75


Created 2014-04-30 17:38:38 | Updated | Tags: variantrecalibrator bug indels no-data-found
Comments (5)

This crops up when running with "-mode INDEL". Not sure why there is no data. (See attached log file with stack trace.)

All input files are non-empty (except the .R file). A similar execution using "-mode SNP" completes with no problems. Since I'm simply looking to get the scripting and flags correct, I've used a public data set. Could it be that I'm unlucky and chose something that has no indels from the reference, which is causing the error? Could there be a more graceful method of termination?


Created 2014-04-16 14:11:37 | Updated | Tags: unifiedgenotyper haplotypecaller bug gatk
Comments (12)

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

Please see the VCF table:

There are three things I want to highlight: 1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other. 2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing. 3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty


Created 2014-04-15 20:38:58 | Updated 2014-04-15 20:46:35 | Tags: vcf bug hapllotypecaller
Comments (12)

Hi I've just been having a go with the new Haplotype Caller method and I've getting some odd or malformed lines in the VCF file for example:

For example the format line has declared we should have 4 fields for each Sample record but instead we have samples with 2 records.

Two examples are shown here:

See Block 1
GT:DP:GQ:PL 1/1:.:3:32,3,0 ./.:0

Or where the format block declares 5 fields and we get 3 instead:

GT:AD:DP:GQ:PL  0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24        ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43        ./.:.:3

Full blocks, Block 1

chr1    3489714 .       G       A       9667.23 .       AC=154;AF=1.00;AN=154;DP=0;FS=0.000;InbreedingCoeff=-0.0391;MLEAC=154;MLEAF=1.00;MQ=0.00;MQ0=0;EFF=INTRAGENIC(MODIFIER|||||TIAM1||CODING|||1),INTRON(MODIFIER||||649|TIAM1|protein_coding|CODING|ENSBTAT00000064124|1|1);CSQ=A|ENSBTAG00000017839|ENSBTAT00000064124|Transcript|intron_variant||||||||1/5||1|TIAM1|HGNC|      GT:DP:GQ:PL     1/1:.:3:32,3,0  1/1:.:18:207,18,0     1/1:.:6:78,6,0  1/1:.:12:140,12,0       1/1:.:9:101,9,0 ./.:0   1/1:.:9:97,9,0  1/1:.:9:96,9,0  1/1:.:21:244,21,0       1/1:.:12:138,12,0       1/1:.:12:124,12,0       1/1:.:9:105,9,0 1/1:.:15:164,15,0     1/1:.:15:153,15,0       1/1:.:27:265,27,0       1/1:.:12:125,12,0       1/1:.:18:214,18,0       ./.:0   1/1:.:9:108,9,0 1/1:.:15:169,15,0       ./.:0   1/1:.:6:76,6,0  1/1:.:6:66,6,0  1/1:.:12:140,12,0     1/1:.:3:28,3,0  1/1:.:3:10,3,0  1/1:.:12:128,12,0       ./.:0   1/1:.:18:181,18,0       1/1:.:9:98,9,0  1/1:.:15:161,15,0       1/1:.:15:185,15,0       1/1:.:12:133,12,0       1/1:.:15:175,15,0     1/1:.:18:178,18,0       1/1:.:12:133,12,0       1/1:.:9:105,9,0 1/1:.:12:141,12,0       1/1:.:15:166,15,0       1/1:.:9:108,9,0 1/1:.:15:160,15,0       1/1:.:27:267,27,0       1/1:.:21:218,21,0    1/1:.:9:107,9,0  1/1:.:3:28,3,0  1/1:.:9:80,9,0  1/1:.:6:46,6,0  ./.:0   1/1:.:6:61,6,0  1/1:.:21:241,21,0       1/1:.:15:161,15,0       1/1:.:6:82,6,0  1/1:.:12:143,12,0       1/1:.:9:109,9,0 1/1:.:21:249,21,0     1/1:.:6:40,6,0  1/1:.:9:94,9,0  1/1:.:15:185,15,0       1/1:.:12:129,12,0       1/1:.:12:132,12,0       ./.:0   1/1:.:21:207,21,0       1/1:.:12:136,12,0       1/1:.:12:109,12,0       1/1:.:18:192,18,0     ./.:0   1/1:.:9:68,9,0  1/1:.:12:138,12,0       1/1:.:6:73,6,0  1/1:.:9:105,9,0 1/1:.:9:98,9,0  1/1:.:6:65,6,0  ./.:0   1/1:.:6:65,6,0  ./.:0   1/1:.:6:58,6,0  1/1:.:12:131,12,0       ./.:0   ./.:01/1:.:3:38,3,0   1/1:.:3:37,3,0  1/1:.:21:227,21,0       1/1:.:12:131,12,0       1/1:.:6:66,6,0  1/1:.:9:100,9,0 1/1:.:21:209,21,0       1/1:.:6:63,6,0  1/1:.:6:69,6,0

Block 2

chr1    55248   .       ACCC    A,CCCC  179.69  .       AC=13,6;AF=0.100,0.046;AN=130;BaseQRankSum=0.736;ClippingRankSum=0.736;DP=347;FS=0.000;InbreedingCoeff=0.2231;MLEAC=10,4;MLEAF=0.077,0.031;MQ=53.55;MQ0=0;MQRankSum=0.736;QD=4.99;ReadPosRankSum=0.736;EFF=INTERGENIC(MODIFIER||||||||||1),INTERGENIC(MODIFIER||||||||||2);CSQ=-||||intergenic_variant|||||||||||||     GT:AD:DP:GQ:PL  0/0:1,1,0,0,0,0,0,0,0:2:3:0,3,24,3,24,24        ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,40,3,41,43        0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,44,3,45,46        0/0:.:5:0:0,0,103,0,103,103     0/0:3,0,1,0,0,0,0,0,0:4:9:0,9,81,9,81,81        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   ./.:.:1 0/0:.:3:0:0,0,41,0,41,41        0/0:1,0,0,0,0,0,0,0,0:1:9:0,9,73,9,73,73        0/1:1,0,0,1,0,0,0,0,0:2:22:28,0,73,28,22,46     0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,25,4,25,25        0/0:2,0,0,0,0,0,0,0,0:2:21:0,21,273,21,273,273  0/1:5,0,0,1,0,0,0,0,0:6:11:11,0,235,26,158,175  ./.:0,0,0,0,1,0,0,0,0:1 0/0:.:4:0:0,0,46,0,46,46        ./.:.:3 ./.:.:2 0/1:1,0,1,1,0,0,0,0,0:3:21:28,0,44,31,21,52     0/0:.:4:0:0,0,77,0,77,77        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   0/0:.:2:6:0,6,51,6,51,51        0/0:.:6:2:0,2,151,2,151,151     0/0:2,0,1,0,0,0,0,0,0:3:7:0,7,74,7,74,74        ./.:.:0 0/0:2,0,0,0,0,0,0,0,0:2:5:0,7,59,5,37,35        1/2:0,0,0,1,0,0,0,0,0:1:1:27,1,40,26,0,25       0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83        ./.:.:0 2/2:0,0,0,0,0,3,0,0,0:3:9:59,59,59,9,9,0        ./.:.:16        0/0:2,0,0,0,0,0,0,0,0:2:7:0,7,59,7,59,59        0/0:2,0,0,0,0,0,0,0,0:2:9:0,9,83,9,83,83        0/0:2,0,0,0,0,0,0,0,0:2:11:0,11,68,11,68,68     0/2:6,0,0,0,0,2,0,0,0:8:18:18,39,384,0,346,340  ./.:.:1 0/1:3,0,0,1,0,0,0,0,0:4:25:25,0,96,34,100,134   0/0:2,0,0,0,0,0,0,0,0:2:12:0,12,105,12,105,105  0/1:1,0,0,1,0,0,0,0,0:2:17:17,0,94,21,26,44     0/0:.:2:6:0,6,64,6,64,64        0/0:1,0,0,0,0,0,0,0,0:1:2:0,2,24,3,25,27        0/0:.:2:6:0,6,48,6,48,48        0/0:.:2:6:0,6,63,6,63,63        0/0:0,0,1,0,0,0,0,0,0:1:0:0,0,1,0,1,1   ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:1:0,1,6,3,8,9   0/0:1,0,0,0,0,0,0,0,0:1:15:0,15,124,15,124,124  ./.:.:0 0/0:1,0,0,0,0,0,0,0,0:1:4:0,4,19,4,19,19        0/1:2,0,0,1,0,0,0,0,0:3:28:28,0,66,34,70,104    0/0:.:4:0:0,0,18,0,18,18        0/2:2,0,0,0,0,4,0,0,0:6:57:68,74,143,0,69,57    ./.:.:0 0/0:4,0,0,0,0,0,0,0,1:5:15:0,15,109,15,109,109  0/0:.:10:0:0,0,182,0,182,182    0/0:.:1:3:0,3,31,3,31,31        0/0:0,0,1,0,0,0,0,0,0:1:1:0,1,2,1,2,2   0/0:2,0,0,0,0,0,0,0,0:2:5:0,5,76,6,78,79        0/0:1,1,0,0,0,0,0,0,0:2:4:0,4,28,4,28,28        ./.:.:1 1/1:0,0,0,2,0,0,0,0,0:2:6:67,6,0,67,6,67        ./.:.:1 0/0:.:2:0:0,0,14,0,14,14        ./.:.:0 ./.:.:1 0/0:.:5:1:0,1,120,1,120,120     0/0:.:4:0:0,0,8,0,8,8   0/0:.:6:0:0,0,94,0,94,94        0/1:1,0,0,2,0,0,0,0,0:3:1:27,0,1,29,6,35        0/0:.:2:0:0,0,3,0,3,3   ./.:.:6 ./.:.:0 ./.:.:0 0/2:3,0,0,0,0,2,0,0,0:5:27:46,27,89,0,62,81     ./.:.:0 0/0:.:1:0:0,0,6,0,6,6   0/0:.:9:0:0,0,86,0,86,86        ./.:.:1 ./.:.:0 1/1:.:.:0:1,1,0,1,0,0   0/0:0,0,0,0,0,0,0,0,0:0:3:0,3,26,3,26,26        0/1:4,0,0,2,0,0,0,0,0:6:49:49,0,93,61,100,160   0/0:.:3:0:0,0,14,0,14,14        0/0:.:8:0:0,0,60,0,60,60        0/0:2,1,0,0,0,0,0,0,0:3:6:0,6,37,6,37,37        ./.:.:2 0/0:.:9:0:0,0,81,0,81,81        0/0:0,0,0,0,0,0,0,0,0:0:1:0,1,2,1,2,2

Any idea what the issue is?


Created 2014-04-15 18:07:28 | Updated | Tags: haplotypecaller bug smithwaterman
Comments (5)

Hi, I wanted to place what I believe is a bug report.

Bug: If the haplotype caller assembles reads into identical haplotypes, then depending on arbitrary factors, it will emit different variant events for this haplotype, meaning that gVCFs from samples with the same genotype will appear to have different genotypes when their VCFs are viewed.

Longer Explanation : Consider the alignment shown below or a reference genome (top) and a reconstructed haplotype aligned to it (aligned as in either the first or second case):

GACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCC
GACCCCGA--GGGCC---------------GGGCCCTCC
GACCCCGA---------GGGCC--------GGGCCCTCC

This alignment is tricky because the "GGGCC" can be positioned in one of two locations, as shown above, and both of these are equivalently optimal with an affine scoring matrix. Which alignment should be returned is thus questionable, but presumably the GATK should always give the same alignment.

However, the GATK does not, and because of this one sample will have a 2D5M15D mutation, while the other can have a 9D5M8D mutation listed, even if both have the exact same genotype. In particular, if the active region or length of the assembled regions change, which can easily happen, then even if every other base is matched, different variants will appear at this locus, based on how the SW alignment is done.

I have tried to track this bug down, and it seems due to how the SW aligner class selects between equally optimal alignments (I believe while constructing the backtrack matrix but have not verified).

Reproducible Example: The higher level problem is that VCFs emitted from identical samples but with slightly different reads/intervals show different variants at this site despite having the same underlying genotypes and assembled haplotypes. Tracking showed this was due to the length of assembled regions varying a bit. A simple example which simply shows the basic flaw in the relevant code is given below.

Does this seem like the type of thing that could be an easy fix? I can also put this code as a branch on gsa_unstable if it would make things easier. I am also assuming that the CombineGVCF step will not fix this error, but have not validated the behavior and for my purposes it would still be problematic.

Warm wishes, N

//Two examples of an assmebled haplotype and reference genome,
//exactly the same save the window size
byte[] fullRef="GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".getBytes();
    byte[] fullHap = "GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA--GGGCC---------------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".replace("-","").getBytes();
    byte[] longRef =                                                                            "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".getBytes();
    byte[] longHap =                                                                           "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA---------GGGCC--------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".replace("-","").getBytes();
    //a simplified version of the getCigar routine in the haplotype caller
    final String SW_PAD = "NNNNNNNNNN";
    final String paddedRef = SW_PAD + new String(fullRef) + SW_PAD;
    final String paddedPath = SW_PAD + new String(fullHap) + SW_PAD;
    final SmithWaterman alignment = new SWPairwiseAlignment( paddedRef.getBytes(), paddedPath.getBytes(), CigarUtils.NEW_SW_PARAMETERS );
    final SmithWaterman alignment2 = new SWPairwiseAlignment( fullRef, fullHap, CigarUtils.NEW_SW_PARAMETERS );
    //These cigars are not the same, but should be
    Cigar rawPadded = alignment.getCigar();
    Cigar notPadded= alignment2.getCigar();

Created 2014-04-07 14:25:26 | Updated | Tags: bug java7 fastaalternatereferencemaker negativearraysizeexception
Comments (1)

I was working with FastaAlternateReferenceMaker today. I wanted to use it on two samples. The first sample worked fine. The second sample came up with an error about a few REF values not conforming (e.g. :A) in the vcf. I wrote a script to correct the errors and remove the : from the vcf file. I tried to run it again, but received memory allocation errors. So, I dealt with that and am now allocating 6g for a 15G VCF and a 188M fasta reference. I got it running (I think) and then started getting the NegativeArraySizeException java error. I saw this had come up as a bug in earlier versions of GATK, but it seemed like it had been resolved...perhaps not? Anyhow, this 'quick little run' has taken up the better part of my day and I would be super greatful for some assistance troubleshooting this problem.

Command: $ java -Xmx6g -jar ../../shays/GATK/GenomeAnalysisTK.jar -l ERROR -R ../up/chrom6.2.fa -T FastaAlternateReferenceMaker -o D-chr6.cons.fa --variant D-c6.vcf

Trace:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NegativeArraySizeException at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:101) at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:120) at org.broadinstitute.variant.vcf.VCFCodec.readHeader(VCFCodec.java:85) at org.broad.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:76) at org.broad.tribble.index.IndexFactory$FeatureIterator.readHeader(IndexFactory.java:366) at org.broad.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:354) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:281) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:873) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:725) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:259) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.5-2-gf57256b):
ERROR
ERROR Please check the documentation guide to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Created 2014-04-04 15:34:10 | Updated | Tags: bug genotypegvcfs combinegvcfs
Comments (10)

I've run into the following bug while running GenotypeGVCFs:

##### ERROR MESSAGE: cannot merge genotypes from samples without PLs; sample <ID redacted> does not have likelihoods at position 1:1115551

The input file in question is a gVCF produced by merging a large number of smaller gVCFs using CombineGVCFs (all tasks were run using version 3.1). What's happening is that the position 1115551 doesn't exist in that particular sample:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  <Sample_ID>
1       1115550 .       AC      A,<NON_REF>     118.73  .       BaseQRankSum=-0.377;DP=15;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.72;MQ0=0;MQRankSum=-1.093;ReadPosRankSum=-0.811      GT:AD:DP:GQ:PL:SB       0/1:7,6,0:13:99:156,0,188,177,207,384:4,3,3,3
1       1115552 .       C       <NON_REF>       .       .       END=1115552     GT:DP:GQ:MIN_DP:PL      0/0:15:0:15:0,0,31

But when the sample is combined with other samples, that position gets filled in with a simple "0/0", without any PLs (or any of the other fields, including AD, DP, GQ, etc.), which causes the GenotypeGVCFs to choke.

I can imagine there might be other scenarios that will result in a "0/0" genotype field, so perhaps the easiest way to fix this would be to make sure that any "0/0" actually gets output as "./.:.:.:.:.".

Thanks,

Grace


Created 2014-01-31 17:16:23 | Updated | Tags: bug
Comments (1)

I tried specifying intervals using a vcf file to HaplotypeCaller and I'm not getting the appropriate hits, because it appears that GATK is interpreting the vcf file as 0-based.


Created 2014-01-15 12:46:24 | Updated 2014-01-15 12:47:55 | Tags: bug
Comments (1)

When GATK finds a read for which a corresponding @RG tag is missing in the header, the error message given implies that the read itself is lacking an RG tag rather than the header. Could this be fixed please so that the two error conditions are differentiated? It will save people time when debugging their pipelines if they don't have to go looking at the wrong thing.

ERROR MESSAGE: SAM/BAM file SAMFileReader{/lustre/blah/DDD_MAIN5247030.bam} is malformed: Read HS7_7515:4:2101:12189:66438#2 is missing the read group (RG) tag, which is required by the GATK. Please use to fix this problem

The reads have the RG tag but an @RG tag matching their ID does not exist in the header.

901282:HS7_7515:4:2101:12189:66438#2 99 1 37000590 60 75M = 37000629 114 * * X0:i:1 X1:i:0 BC:Z:CGATGTAT BD:Z:* MD:Z:75 PG:Z:MarkDuplicates RG:Z:1#2 BI:Z:* AM:i:37 NM:i:0 SM:i:37 MQ:i:60 QT:Z:BCAADFFE XT:A:U BQ:Z:* 901283:HS7_7515:4:2101:12189:66438#2 147 1 37000629 60 75M = 37000590 -114 * * X0:i:1 X1:i:0 BD:Z:* MD:Z:75 PG:Z:MarkDuplicates RG:Z:1#2 BI:Z:* AM:i:37 NM:i:0 SM:i:37 MQ:i:60 XT:A:U BQ:Z:*

P.S. your spam filter is stopping me posting discussions with URLs in, could you whitelist any gatkforums dot broad institute dot org urls?


Created 2013-10-24 15:22:35 | Updated | Tags: vcfcodec tribble bug strelka breakend
Comments (7)

I'm running into a problem with vcfs that have single ended break ends. (These are produced by an old version of Strelka .) Tribble doesn't recognize "." as valid in alternative alleles.

Single break ends are valid in the vcf standard and the files validate according to Vcftools.

Others have run into this problem as well: https://groups.google.com/forum/#!searchin/strelka-discuss/gatk/strelka-discuss/gJfsyjZNZXA/ExDXZiVWW_kJ

example error

##### ERROR stack trace
org.broad.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1807: Unparsable vcf record with allele .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC, for input source: /cga/tcga-gsc/benchmark/Indels/strelkaPON/NA18606.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam-NA18608.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam/final.indels.vcf
at org.broadinstitute.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:715)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:527)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:553)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:494)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:291)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:234)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:213)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:45)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:284)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:264)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:225)
at org.broadinstitute.sting.tools.CatVariants.execute(CatVariants.java:239)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.tools.CatVariants.main(CatVariants.java:258)
##### ERROR ------------------------------------------------------------------------------------------

Example vcf line

19  36002413    .   C   .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC    .   PASS    IHP=1;NT=ref;QSI=82;QSI_NT=82;SGT=ref->hom;SOMATIC;SVTYPE=BND;TQSI=1;TQSI_NT=1  DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   49:49:42,44:0,0:7,6:43.72:0.85:0.00 11:11:0,0:6,6:5,5:14.61:0.48:0.0

A full vcf is available at: /humgen/gsa-scr1/pub/incoming/BreakendBug/breakend.vcf


Created 2013-09-13 15:46:18 | Updated | Tags: printreads bam bug
Comments (3)

Hi,

I was using PrintReads as part of Base Quality Recalibration. The resulting BAM header is incorrect as there are two @PG entries for GATK PrintReads. The bam file I am using is internal to the broad institute and is picard aggregated. There is an entry for GATK PrintReads in the original bam. Would it be possible for the @PG header lines to include a .A-Z like the other headers?

Thanks Aaron

ERROR A USER ERROR has occurred (version 2.4-9-g532efad):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: SAM/BAM file /btl/projects/aaron/Mouse/G42214/TET-OT-827ctl.recalibrated.bam is malformed: Program record with group id GATK PrintReads already exists in SAMFileHeader!

Created 2013-07-02 06:38:26 | Updated | Tags: gatk2 vcf pipeline bug
Comments (9)

Hello all,

We've just started using GATK in order to perform variant calling in a non-model teleost fish. The fish genome is highly repetitive (>65%), and also suffers from the whole genome duplication event common in teleosts (e.g. zebrafish). Additionally, the fish strain we use is highly inbred, which should result in a highly homogenous genome. We have generated a genome assembly and a de novo repeat library based on NGS data (manuscript submitted) before mapping the reads from four individuals (male and female) to the genome via bowtie2. Variants were called using UnifiedGenotyper.

We generally get a very good list of variants, but it seems that we're getting a number of false positives and negatives when calling variants. Some of these appear to be due to paralogues, but some seem to be errors in the actual genotype call. For example:

scaffold00001 1199020 . T G 44.35 . AC=1;AF=0.167;AN=6;BaseQRankSum=-7.420;DP=110;Dels=0.00;FS=152.859;HaplotypeScore=3.6965;MLEAC=1;MLEAF=0.167;MQ=42.00;MQ0=0;MQRankSum=-1.972;QD=1.53;ReadPosRankSum=-2.777;SB=-4.096e+00 GT:AD:DP:GQ:PL 0/1:20,9:29:79:79,0,588 0/0:16,7:23:12:0,12,447 0/0:39,18:57:65:0,65,1426 ./.

In this case, individual 3 has a homozygous reference genotype, despite having a 31% minor allele frequency. Individual 1 also has a 31% minor allele frequency, but is called heterozygous.Some of the bases used to call the G allele are of low quality (when looking more closely using IGV), but I would still expect the genotype to be heterozygous.

A reverse example:

scaffold00458 298207 . A G 64.81 . AC=2;AF=0.333;AN=6;BaseQRankSum=3.027;DP=64;Dels=0.00;FS=5.080;HaplotypeScore=0.0000;MLEAC=2;MLEAF=0.333;MQ=16.26;MQ0=0;MQRankSum=3.177;QD=1.16;ReadPosRankSum=-3.252;SB=0.439 GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,207 0/1:20,1:21:13:13,0,152 0/1:31,4:35:90:90,0,102 ./.

Here, individual 2 is called heterozygous, but there is only a single read which supports the minor allele. Additionally, when looking at IGV, you can see that the read in question has a number of mismatches, suggesting it originates from another area of the genome.

I've also uploaded screenshots of IGV if that I hope will help clarify the problems we're having. We have used default parameters of GATK in almost all cases, and we did not used VQSR, as we did not have a list of high confidence SNPs at the time.


Created 2013-04-24 08:42:51 | Updated | Tags: bug
Comments (1)

Dear GATK team,

I think I found a bug as shown in the attached screenshot. The highlighted line of the vcf file after using UnifiedGenotyper, VariantRecalibrator, ApplyRecalibration, ReadBackedPhasing, SelectVariants, CombineVariants, and SnpEff (following Best Practices) shows X 307897 . C CGTGAAGG 9494.14 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=8.010;DP=18;EFF=DOWNSTREAM(MODIFIER||||PPP2R3B|protein_coding|CODING|ENST00000381625|),INTRON(MODIFIER||||PPP2R3B|protein_coding|CODING|ENST00000390665|4),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000445792|5),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000475859|2),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000477110|1),INTRON(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000496630|3),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000468169|),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000477636|),UPSTREAM(MODIFIER||||PPP2R3B|retained_intron|CODING|ENST00000484364|);FS=0.000;InbreedingCoeff=0.3019;MQ=45.55;MQ0=0;MQRankSum=-8.092;QD=9.55;ReadPosRankSum=2.020;VQSLOD=2.27;culprit=ReadPosRankSum;set=29_BOTH GT:AD:DP:GQ:PL ./. 0/1:0,14:9:99:367,42,0

To be sure, I checked the two samples' bams at the PrintReads stage but do not see such a long indel. Is this a potentially similar bug to http://gatkforums.broadinstitute.org/discussion/2516/rare-homozygous-indel-bug-in-gatk-2-4-9 ?

Thank you for your attention.

Jamie


Created 2013-03-14 09:24:14 | Updated | Tags: printreads bug arrayindexoutofboundsexception
Comments (9)

Hi,

I'm trying to apply my recalibrated quality values with the following command:

java -jar GenomeAnalysisTK.jar -T PrintReads -R genome_noHap.fa -I realigned.bam -BQSR realigned.recal_data.grp -o realigned.recal.bam -nct 25

And get the following error. I couldn't find it elsewhere. Would be happy if I could get some advice.

Let me know if you need more info.

Best,

Karl

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.sting.utils.recalibration.BaseRecalibration.recalibrateRead(BaseRecalibration.java:151) at org.broadinstitute.sting.utils.recalibration.BQSRReadTransformer.apply(BQSRReadTransformer.java:93) at org.broadinstitute.sting.gatk.walkers.readutils.PrintReads.map(PrintReads.java:251) at org.broadinstitute.sting.gatk.walkers.readutils.PrintReads.map(PrintReads.java:98) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:230) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:218) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.4-3-g2a7af43):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: 0
ERROR ------------------------------------------------------------------------------------------

Created 2013-02-01 10:44:13 | Updated 2013-02-04 14:38:40 | Tags: baserecalibrator bug arrayindexoutofboundsexception
Comments (4)

Hi,

I did search on the site and seems several have had a similar problem but I couldn't fix mine based on those. I've managed to get to this point succesfully with creating a realigned bam file with IndelRealigner. I'm using a non-model organism with self created masking list from a VCF run done before the realignment (list in bed format). I've had to use the -fixMisencodedQuals so far through out my runs. I checked the masking file so that the ranges do not go over the ranges that are specified in my list of regions to cover (-L option).

Is there anything else I would need to check still?

Command line view:

sulyba@hippu4:/fs/lustre/wrk/sulyba/stickleback_capture> java -jar ./GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals

INFO  12:34:10,502 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,510 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14
INFO  12:34:10,510 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  12:34:10,510 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  12:34:10,516 HelpFormatter - Program Args: -T BaseRecalibrator -R gasAcu_combinedbac_inv7.fa -I realigned_FF1_inv7c.bam -knownSites Sites_to_Mask.bed -L Capture_Target_Regions.intervals -o recalc_FF1_inv7c.grp -plots recal_FF1_plots.grp.pdf -fixMisencodedQuals
INFO  12:34:10,516 HelpFormatter - Date/Time: 2013/02/01 12:34:10
INFO  12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,517 HelpFormatter - --------------------------------------------------------------------------------
INFO  12:34:10,551 ArgumentTypeDescriptor - Dynamically determined type of Sites_to_Mask.bed to be BED
INFO  12:34:10,563 GenomeAnalysisEngine - Strictness is SILENT
INFO  12:34:11,309 GenomeAnalysisEngine - Downsampling Settings: No downsampling
INFO  12:34:11,319 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  12:34:11,392 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
INFO  12:34:11,416 RMDTrackBuilder - Loading Tribble index from disk for file Sites_to_Mask.bed
INFO  12:34:11,660 GenomeAnalysisEngine - Processing 20350670 bp from intervals
INFO  12:34:11,672 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  12:34:11,674 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
INFO  12:34:11,946 BaseRecalibrator - The covariates being used here:
INFO  12:34:11,947 BaseRecalibrator -   ReadGroupCovariate
INFO  12:34:11,947 BaseRecalibrator -   QualityScoreCovariate
INFO  12:34:11,947 BaseRecalibrator -   ContextCovariate
INFO  12:34:11,947 ContextCovariate -           Context sizes: base substitution model 2, indel substitution model 3
INFO  12:34:11,947 BaseRecalibrator -   CycleCovariate
INFO  12:34:11,952 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 3]
INFO  12:34:11,952 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,952 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 3]
INFO  12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1012, 3]
INFO  12:34:11,953 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:11,953 NestedIntegerArray - Creating NestedIntegerArray with dimensions [1, 94, 1002, 3]
INFO  12:34:11,954 NestedIntegerArray - Pre-allocating first 2 dimensions
INFO  12:34:11,954 NestedIntegerArray - Done pre-allocating first 2 dimensions
INFO  12:34:12,109 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO  12:34:12,120 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  12:34:12,644 ReadShardBalancer$1 - Loading BAM index data for next contig
INFO  12:34:12,645 ReadShardBalancer$1 - Done loading BAM index data for next contig
INFO  12:34:14,901 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: -2
        at org.broadinstitute.sting.utils.baq.BAQ.calcEpsilon(BAQ.java:158)
        at org.broadinstitute.sting.utils.baq.BAQ.hmm_glocal(BAQ.java:246)
        at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:542)
        at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:595)
        at org.broadinstitute.sting.utils.baq.BAQ.calcBAQFromHMM(BAQ.java:530)
        at org.broadinstitute.sting.utils.baq.BAQ.baqRead(BAQ.java:663)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateBAQArray(BaseRecalibrator.java:428)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:243)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:248)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:219)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:91)
        at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:55)
        at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281)
        at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: -2
##### ERROR ------------------------------------------------------------------------------------------

Created 2013-01-29 21:22:05 | Updated | Tags: baserecalibrator bug
Comments (11)

Dear GATK Team,

I am running a pipeline on several high coverage human individuals that have been mapped using bwa and processed using samtools, picard and gatk. The bam-files pass ValidateSam from picard, but when I run the bqsr step some of them fails giving a Malformed read error (using -filterMBQ does not help in this case). I tracked down the error to bamfiles that ends with a paired end read where the mate maps in the beginning of the contig (in my case human mtDNA).

Eg, this will make it crash:

readX 177 MT 16558 37 7S2M2I10M80S = 294 -16176 GACCTGTGATCC...
readY 177 MT 16558 37 7S2M2I10M80S = 238 -16232 GACCTGTGATCC...
readZ 113 MT 16558 37 7S2M2I10M80S = 273 -16197 GACCTGTGATCC...
[END]

where a file ending like this wont crash:

readX 83 MT 16469 60 101M = 16246 -324 TGGGGGTAGCTAAAGTGAAC...
readY 147 MT 16469 60 101M = 16267 -303 TGGGGGTAGCTAAAGTGA...
readZ 147 MT 16469 60 101M = 16193 -377 TGGGGGTAGCTAAAGTGAAC...
[END]

I am running GATK v2.3-9-ge5ebf34, but the same error occurs using GATK v-2.2-3 (my previous version). I can genotype the files using UnifiedGenotyper without any problem as well.

This is the error:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Array length mismatch detected. Malformed read? at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.calculateFractionalErrorArray(BaseRecalibrator.java:380) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:246) at org.broadinstitute.sting.gatk.walkers.bqsr.BaseRecalibrator.map(BaseRecalibrator.java:112) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:203) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:191) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler$MapReduceJob.run(NanoScheduler.java:468) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.3-9-ge5ebf34):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Array length mismatch detected. Malformed read?
ERROR ------------------------------------------------------------------------------------------

Cheers,

Simon