Created 2014-07-30 20:26:12 | Updated | Tags: release bug version-highlights topstory reference-model gatk-3-2

Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as -ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.

### Working out the kinks in the "reference confidence model" workflow

The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).

You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as --includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).

### HaplotypeCaller now emits post-realignment coverage metrics

The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the -bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.

### R you up to date on your libraries?

We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.

### A sincere apology to GATK-based tool developers

We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.

In a nutshell, the base name of the codebase has changed from sting to gatk (which hopefully makes more sense), and the most common effect is that sting.gatk classpath segments are now gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs gatk-3-2

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

## Haplotype Caller

• Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
• Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
• The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
• Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
• Fixed small bug where indel discovery was inconsistent based on the active-region size.
• Removed scary warning messages for "VectorPairHMM".
• Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
• When we subset PLs because alleles are removed during genotyping we now also subset the AD.
• Fixed bug where reference sample depth was dropped in the DP annotation.

## Variant Recalibrator

• The -mode argument is now required.
• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

## AnalyzeCovariates

• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

## Variant Annotator

• SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

## Genotype GVCFs

• Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
• Fixed recurring error about missing PLs.
• No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
• --includeNonVariantSites should now be working.

## Select Variants

• The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

## Indel Realigner

• Now throws a user error when it encounters reads with I operators greater than the number of read bases.
• Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

## CalculateGenotypePosteriors

• Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
• Now computes genotype posteriors using likelihoods from all members of the trio.
• Added annotations for calling potential de novo mutations.
• Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

## Cat Variants

• Can now process .list files with -V.
• Can now handle BCF and Block-Compressed VCF files.

## Validate Variants

• Now works with gVCF files.
• By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

## FastaAlternateReferenceMaker

• Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

## Miscellaneous

• Refactored maven directories and java packages replacing "sting" with "gatk".
• Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
• Added a new read transformer that refactors NDN cigar elements to one N element.
• Now a Tabix index is created for block-compressed output formats.
• Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
• Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
• We now check that output files are specified in a writeable location.
• We now allow blank lines in a (non-BAM) list file.
• Added legibility improvements to the Progress Meter.
• Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
• Made IntervalSharder respect the IntervalMergingRule specified on the command line.
• Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-09-04 21:50:27 | Updated | Tags: haplotypecaller gatk-3-2 concurrentmodificationexception

Hi,

I encountered a ConcurrentModificationException on several bam files using HaplotypeCaller 3.2-2 that ran just fine on 3.1 using the same number of threads. Here is the stack trace:

java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:390) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:409) at java.util.LinkedHashMap$EntryIterator.next(LinkedHashMap.java:408) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.addMiscellaneousAllele(HaplotypeCallerGenotypingEngine.java:292) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:255) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:941) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) And here are the program args for one of them: Program Args: -T HaplotypeCaller -R human_genome/human_g1k_v37.fasta -I <my_bam> -ERC GVCF -nct 16 --dbsnp /tmp/4437712/dbsnp_135.b37.vcf --variant_index_type LINEAR --variant_index_parameter 128000 -o /tmp/4437712/<my_sample>.vcf.gz Let me know if you need more information. Created 2014-09-04 21:45:51 | Updated 2014-09-04 21:50:59 | Tags: haplotypecaller nullpointerexception gatk-3-2 null-pointer-exception Hi, I encountered a NullPointerException in 3.2-2 using the HaplotypeCaller on a BAM that run just fine on 3.1. Here is the stack trace: java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:421) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.PairHMMLikelihoodCalculationEngine.computeDiploidHaplotypeLikelihoods(PairHMMLikelihoodCalculationEngine.java:395) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.calculateGLsForThisEvent(HaplotypeCallerGenotypingEngine.java:421) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:257) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:941) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:708) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:704) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor\$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722)

And here are the program args:

Program Args: -T HaplotypeCaller -R human_genome/human_g1k_v37.fasta -I <my_bam> -ERC GVCF -nct 16 --dbsnp /tmp/4437712/dbsnp_135.b37.vcf --variant_index_type LINEAR --variant_index_parameter 128000 -o /tmp/4437712/<my_sample>.vcf.gz