Version History
The current version is 3.5

These articles highlight the key improvements in major and minor version releases (for example, 3.4) and explain their significance. To view a complete list of changes per release (including minor changes and bug fixes), please see the release notes (next tab).

Created 2015-11-25 07:37:00 | Updated 2015-11-25 14:21:18 | Tags: haplotypecaller release mutect version-highlights topstory mutect2
Comments (3)

The last GATK 3.x release of the year 2015 has arrived!

The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.

In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.

In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.

We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.

Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).

Introducing MuTect2 (beta): calling somatic SNPs and Indels natively in GATK

MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.

The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.

This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.

As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.

Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.

Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.

Annotate this: new and improved variant context annotations

Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).

So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!

We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.

These Three Awesome File Hacks Will Restore Your Faith In Humanity’s Ability To Free Up Some Disk Space

CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!

Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.

Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the --disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.

Created 2015-05-22 01:58:47 | Updated 2015-07-09 22:27:40 | Tags: release bug version-highlights topstory
Comments (35)

Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.

None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.

Key changes to the joint discovery tools

Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension .g.vcf so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.

Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of -dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).

Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument --breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use -L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!

And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.

The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic <*:DEL> allele representing the deletion.

The new annotation is called RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the -allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.

New RNAseq tool: ASEReadCounter

This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.

New documentation features: “Common Problems” and “Issue Tracker”

We’ve added two new documentation resources to the Guide.

One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.

The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!

Created 2014-10-23 23:09:56 | Updated | Tags: Troll release bug version-highlights topstory
Comments (18)

Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.

Because -ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.

Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.

Genotype refinement workflow with all the trimmings

As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.

Non-diploids, rejoice!

This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.

HaplotypeCaller gets physical

You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:

or this:

But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).

In a nutshell, phased records will look like this:

1   1372243  .  T  <NON_REF>  .  .  END=1372267  <snip>  <snip>
1   1372268  .  G  A,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip>
1   1372269  .  G  T,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip>
1   1372270  .  C  <NON_REF>  .  .  END=1372299  <snip>  <snip>

You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).

The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.

Heads or tails

Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).

We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the --recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.

Variant annotations finally make sense

Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.

Created 2014-07-30 20:26:12 | Updated | Tags: release bug version-highlights topstory reference-model gatk-3-2
Comments (2)

Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as -ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.

Working out the kinks in the "reference confidence model" workflow

The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).

You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as --includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).

HaplotypeCaller now emits post-realignment coverage metrics

The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the -bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.

R you up to date on your libraries?

We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.

A sincere apology to GATK-based tool developers

We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.

In a nutshell, the base name of the codebase has changed from sting to gatk (which hopefully makes more sense), and the most common effect is that sting.gatk classpath segments are now This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.

Created 2014-03-18 00:36:21 | Updated 2014-03-20 14:10:47 | Tags: release performance version-highlights topstory pairhmm
Comments (26)

This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.

Intel inside GATK

So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.

We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!

In practice

Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.

Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command: -pairHMM VECTOR_LOGLESS_CACHING.

Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.

Linux kernel version Architecture / Processor Expected speedup Instruction set
Any 64-bit Linux Any x86 64-bit 1-1.5x Non-vector
Linux 2.6 or newer Penryn (Core 2 or newer) 1.3-1.8x SSE 4.1
Linux 2.6.30 or newer SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer) 2-2.5x AVX

To find out exactly which processor is in your machine, you can run this command in the terminal:

$ cat /proc/cpuinfo | grep "model name"                                                                                    
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.

Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:

Penryn processors


Sandy Bridge processors


Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:

  • SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).

  • The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.

  • The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.

Created 2014-03-17 23:32:16 | Updated | Tags: release rnaseq version-highlights multisample gatk-3-0 reference-model joint-analysis
Comments (0)

Better late than never, here is the now-traditional "Highlights" document for GATK version 3.0, which was released two weeks ago. It will be a very short one since we've already gone over the new features in detail in separate articles --but it's worth having a recap of everything in one place. So here goes.

Work smarter, not harder

We are delighted to present our new Best Practices workflow for variant calling in which multisample calling is replaced by a winning combination of single-sample calling in gVCF mode and joint genotyping analysis. This allows us to both bypass performance issues and solve the so-called "N+1 problem" in one fell swoop. For full details of why and how this works, please see this document. In the near future, we will update our Best Practices page to make it clear that the new workflow is now the recommended way to go for calling variants on cohorts of samples. We've already received some pretty glowing feedback from early adopters, so be sure to try it out for yourself!

Jumping on the RNAseq bandwagon

All the cool kids were doing it, so we had to join the party. It took a few months of experimentation, a couple of new tools and some tweaks to the HaplotypeCaller, but you can now call variants on RNAseq with GATK! This document details our Best Practices recommendations for doing so, along with a non-trivial number of caveats that you should keep in mind as you go.

Goodbye to ReduceReads

Nice try, but no. This tool is obsolete now that we have the gVCF/reference model pipeline (see above). Note that this means that GATK 3.0 will not support BAM files that were processed using ReduceReads!

Changes for developers

We've switched the build system from Ant to Maven, which should make it much easier to use GATK as a library against which you can develop your own tools. And on a related note, we're also making significant changes to the internal structure of the GATK codebase. Hopefully this will not have too much impact on external projects, but there will be a doc very shortly describing how the new build system works and how the codebase is structured.

Hardware optimizations held for 3.1

For reasons that will be made clear in the near future, we decided to hold the previously announced hardware optimizations until version 3.1, which will be released very soon. Stay tuned!

Created 2013-12-19 21:17:54 | Updated 2014-02-07 18:37:02 | Tags: official release version-highlights topstory
Comments (0)

Better late than never, here are the highlights of the most recent version release, GATK 2.8. This should be short and sweet because as releases go, 2.8 is light on new features, and is best described as a collection of bug fixes, which are all* dutifully listed in the corresponding release notes document. That said, two of the changes we've made deserve some additional explanation.

* Up to now (this release included) we have not listed updates/patches to Queue in the release notes, but will start doing so from the next version onward.

VQSR & bad variants: no more guessing games

In the last release (2.7, for those of you keeping score at home) we trumpeted that the old -percentBad argument of VariantRecalibrator had been replaced by the shiny new -numBad argument, and that this was going to be awesome for all sorts of good reasons, improve stability and whatnot. Weeeeeeell it turned out that wasn't quite the case. It worked really well on the subset of analyses that we tested it on initially, but once we expanded to different datasets (and the complaints started rolling in on the forum) we realized that it actually made things worse in some cases because the default value was less appropriate than what -percentBad would have produced. This left people guessing as to what value would work for their particular dataset, with a great big range to choose from and very little useful information to assist in the choice.

So, long story short, we (and by "we" I mean Ryan) built in a new function that allows the VariantRecalibrator to determine for itself the amount of variants that is appropriate to use for the "bad" model depending on the data. So the short-lived -numBad argument is gone too, replaced by... nothing. No new argument to specify; just let the VariantRecalibrator do its thing.

Of course if you really want to, you can override the default behavior and tweak the internal thresholds. See the tool doc here; and remember that a good rule of thumb is that if you can't figure out which arguments are involved based on that doc, you probably shouldn't be messing with this advanced functionality.

Reference calculation model

This is still a rather experimental feature, so we're still making changes as we go. The two big changes worth mentioning here are that you can now run this on reduced reads, and that we've changed the indexing routine to optimize the compression level. The latter shouldn't have any immediate impact on normal users, but it was necessary for a new feature project we've been working on behind the scenes (the single-sample-to-joint-discovery pipeline we have been alluding to in recent forum discussions). The reason we're mentioning it now is that if you use -ERC GVCF output, you'll need to specify a couple of new arguments as well (-variant_index_type LINEAR and -variant_index_parameter 128000, with those exact values). This useful little fact didn't quite make it into the documentation before we released, and not specifying them leads to an error message, so... there you go. No error message for you!

What's up, doc?

That's all for tool changes. In addition to those, we have made a number of corrections in the tool documentation pages, updated the Best Practices (mostly layout, tiny bit of content update related to the VQSR -numBad deprecation) and made some minor changes to the website, e.g. updated the list of publications that cite the GATK and improved the Guide index somewhat (but that's still a work in progress).

 Note: There are no version highlights available for versions earlier than 2.2.

These are the release notes issued for all major and minor version releases (for example, 3.4). At this time, we do not provide release notes for subversion changes (for example, 3.4-46) but they are typically accompanied by a blog post, and you can view the latest changes in the Change log (next tab).

Created 2015-11-25 07:10:45 | Updated 2015-11-25 14:27:50 | Tags: Promote haplotypecaller release-notes mutect gatk3 mutect2
Comments (2)

GATK 3.5 was released on November 25, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

New tools

  • MuTect2: somatic SNP and indel caller based on HaplotypeCaller and the original MuTect.
  • ContEst: estimation of cross-sample contamination (primarily for use in somatic variant discovery).
  • GatherBqsrReports: utility to gather recalibration tables from scatter-parallelized BaseRecalibrator runs.

Variant Context Annotations

  • Added allele-specific version of existing annotations: AS_BaseQualityRankSumTest, AS_FisherStrand, AS_MappingQualityRankSumTest, AS_RMSMappingQuality, AS_RankSumTest, AS_ReadPosRankSumTest, AS_StrandOddsRatio, AS_QualByDepth and AS_InbreedingCoeff.

  • Added BaseCountsBySample annotation. Intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller. Can only be requested from HaplotypeCaller, not VariantAnnotator.

  • Added ExcessHet annotation. Estimates excess heterozygosity in a population of samples. Related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.

  • Added FractionInformativeReads annotation. Reports the number of reads that were considered informative by HaplotypeCaller (over all samples).

  • Enforced calculating GenotypeAnnotations before InfoFieldAnnotations. This ensures that the AD value is available to use in the QD calculation.

  • Reorganized standard annotation groups processing to ensure that all default annotations always get annotated regardless of what is specified on the command line. This fixes a bug where default annotations were getting dropped when the command line included annotation requests.

  • Made GenotypeGVCFs subset StrandAlleleCounts intelligently, i.e. subset the SAC values to the called alleles. Previously, when the StrandAlleleCountsBySample (SAC) annotation was present in GVCFs, GenotypeGVCFs carried it over to the final VCF essentially unchanged. This was problematic because SAC includes the counts for all alleles originally present (including NON-REF) even when some are not called in the final VCF. When the full list of original alleles is no longer available, parsing SAC could become difficult if not impossible.

  • Added new MQ jittering functionality to improve how VQSR handles MQ.

  • VariantAnnotator can now annotate FILTER field from an external resource. Usage: --resource:foo resource.vcf --expression foo.FILTER

  • VariantAnnotator can now check allele concordance when annotating with an external resource. Usage: --resourceAlleleConcordance

  • Bug fix: The annotation framework was improved to allow for the collection of sufficient statistics during GVCF creation which are then used to compute the final annotation during the genotyping. This avoids the use of median as the representative annotation from the collection of values (one from each sample). TL;DR annotations will be more accurate when using the GVCF workflow for joint discovery.

Variant manipulation tools

  • Allowed overriding hard-coded cutoff for allele length in ValidateVariants and in LeftAlignAndTrimVariants. Usage: --reference_window_stop N where N is the desired cutoff.

  • Also in LeftAlignAndTrimVariants, trimming multiallelic alleles is now the default behavior.

  • Fixed ability to mask out snps with --snpmask in FastaAlternateReferenceMaker.

  • Also in FastaAlternateReferenceMaker, fixed merging of contiguous intervals properly, and made the tool produce more informative contig names.

  • Fixed a bug in CombineVariants that occurred when one record has a spanning deletion and needs a padded reference allele.

  • Added a new VariantEval evaluation module, MetricsCollection, that summarizes metrics from several EV modules.

  • Enabled family-level stratification in MendelianViolationEvaluator of VariantEval (if a ped file is provided), making it possible to count Mendelian violations for each family in a callset with multiple families.

GVCF tools

  • Various improvements to the tools’ performance, especially HaplotypeCaller, by making the code more efficient and cutting out crud.

  • GenotypeGVCFs now emits a no-call (./.) when the evidence is too ambiguous to make a call at all (e.g. all the PLs are zero). Previously this would have led to a hom-ref call with RGQ=0.

  • Fixed a bug in GenotypeGVCFs that sometimes generated invalid VCFs for haploid callsets. The tool was carrying over the AD from alleles that had been trimmed out, causing field length mismatches.

  • Changed the genotyping implementation for haploid organisms to address performance problems reported when running GenotypeGVCFs on haploid callsets. Note that this change may lead to a slight loss of sensitivity at low-coverage sites -- let us know if you observe anything dramatic.

Genotyping engine tweaks

  • Ensured inputPriors get used if they are specified to the genotyper (previously they were ignored). Also improved docs on --heterozygosity and --indel_ heterozygosity priors.

  • Fixed bug that affected the --ignoreInputSamples behavior of CalculateGenotypePosteriors.

  • Limited emission of the scary warning message about max number of alleles (“this tool is set to genotype at most x alleles but we found more; only x will be used”) to a single occurrence unless DEBUG logging mode is activated. Otherwise it fills up our output logs.

Miscellaneous tool fixes

  • Added option to OverclippedReadFilter to not require soft-clips on both ends. Contributed by Jacob Silterra.

  • Fixed a bug in IndelRealigner where the tool was incorrectly "fixing" mates when supplementary alignments are present. The patch involves ignoring supplementary alignments.

  • Fixed a bug in CatVariants. Previously, VCF files were being sorted solely on the base pair position of the first record, ignoring the chromosome. This can become problematic when merging files from different chromosomes, especially if you have multiple VCFs per chromosome. Contributed by John Wallace.

Engine-level behaviors and capabilities

  • Support for reading and writing CRAM files. Some improvements are still expected in htsjdk. Contributed by Vadim Zalunin at EBI and collaborators at the Sanger Institute.

  • Added the ability to enforce 4.2 version output of the VCF spec when processing older files. Use case: the 4.2 spec specifies that GQ must be an integer; by default we don’t enforce it (so if reading an older file that used decimals, we don’t change it) but the new argument --forceValidOutput converts the values on request. Not made default because of some performance slowdown -- so writing VCFs is now fast by default, compliant by choice.

  • Made interval-list output format dependent on the file extension (for RealignerTargetCreator). If the extension is .interval_list, output will be formatted as a proper Picard interval list (with sequence dictionary). Otherwise it will be a basic GATK interval list as previously.

  • Adding static binning capability for base recalibration (BQSR).


  • Added a new JobRunner called ParallelShell that will run jobs locally on one node concurrently as specified by the DAG, with the option to limit the maximum number of concurrently running jobs using the flag maximumNumberOfJobsToRunConcurrently. Contributed by Johan Dahlberg.

  • Updated extension for Picard CalculateHsMetrics to include PER_TARGET_COVERAGE argument and added extension for Picard CollectWgsMetrics.

Deprecation notice


  • BeagleOutputToVCF, VariantsToBeagleUnphased, ProduceBeagleInput. These are tools for handling Beagle data. The latest versions of Beagle support VCF input and output, so there is no longer any reason for us to provide converters.
  • ReadAdaptorTrimmer and VariantValidationAssessor. These were experimental tools which we think are not useful and not operating on a sufficiently sound basis.
  • BaseCoverageDistribution and CoveredByNSamplesSites. These tools were redundant with DiagnoseTargets and/or DepthOfCoverage.
  • LiftOverVariants, FilterLiftedVariants and The Picard liftover tool LiftoverVCF works better and is easier to operate.
  • Use Picard SortVCF instead.
  • ListAnnotations. This was intended as a utility for listing annotations easily from command line, but it has not proved useful.


  • Made various documentation improvements.
  • Updated date and street address in license text.
  • Moved htsjdk & picard to version 1.141

Created 2015-05-15 04:52:05 | Updated 2015-11-25 07:08:50 | Tags: haplotypecaller release-notes genotypegvcfs gatk3
Comments (27)

GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.

New tool

  • ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.

HaplotypeCaller & GenotypeGVCFs

  • Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
  • Deprecated --mergeVariantsViaLD argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step.
  • Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
  • Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
  • Fixed the implementation of allowNonUniqueKmersInRef so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions.
  • Initialize annotations so that --disableDithering actually works.
  • Automatic selection of indexing strategy based on .g.vcf file extension. See Highlights for more details.
  • Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
  • Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
  • Removed GenotypeSummaries from default annotations.
  • Added -uniquifySamples to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample.
  • Disallow changing -dcov setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details.
  • Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
  • Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
  • Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
  • Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
  • Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
  • Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).


  • Added option to break blocks at every N sites. Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way -L works in the GATK for VCF records with the END tag). See Highlights for more details.
  • Fixed a bug that caused the tool to stop processing after the first contig.
  • Fixed a bug where the wrong REF allele was output to the combined gVCF.


  • Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
  • VQSR VCF header command line now contains annotations and tranche levels.


  • Added -trim argument to trim (simplify) alleles to a minimal representation.
  • Added -trimAlternates argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites.
  • Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the -noTrim argument to preserve original alleles.
  • Added --keepOriginalDP argument.


  • Improvements to the allele trimming functionalities.
  • Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.


  • Fixed user-reported bug featuring "trio" family with two children, one parent.
  • Added error handling for genotypes that are called but have no PLs.

Various tools

  • BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
  • DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
  • SplitSamFile: Fixed a bug that caused a NullPointerException.
  • SplitNCigarReads: Fixed issue to make -fixNDN flag fully functional.
  • IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
  • CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need -genotypeMergeOption.
  • VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
  • GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
  • ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
  • CatVariants: Now allows different input / output file types, and spaces in directory names.
  • VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when -SMA is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.

Read Filters

  • Added erroneous CIGAR length to criteria for BadCigarFilter.
  • Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
  • Handle X and = CIGAR operators appropriately
  • Added -drf argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).


  • Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
  • Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
  • Changed GC Content value type from Integer to Float.
  • Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
  • Made annotators emit a warning if they can't be applied.

GATK Engine & common features

  • Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
  • Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
  • Handle X and = CIGAR operators appropriately.
  • Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
  • Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
  • Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
  • Fixed BCF writing when FORMAT annotations contain arrays.


  • Added -qsub-broad argument. When -qsub-broad is specified instead of -qsub, Queue will use the h_vmem parameter instead of h_rss to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode.
  • Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
  • Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.


  • Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.

For developers

  • Upgraded SLF4J to allow new convenient logging syntaxes.
  • Patched maven pom file for slf4j-log4j12 version (contributed by user Biocyberman).
  • Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
  • Collected VCF IDs and header lines into one place (GATKVCFConstants).
  • Made various changes that lead to reduced build times.

Created 2014-10-23 18:53:52 | Updated 2015-05-12 17:24:14 | Tags: Troll haplotypecaller ploidy release-notes genotypegvcfs gatk3 genotyperefinement
Comments (2)

GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

Haplotype Caller

  • Improved the accuracy of dangling head merging in the HC assembler (now enabled by default).
  • Physical phasing information is output by default in new sample-level PID and PGT tags.
  • Added the --sample_name argument. This is a shortcut for people who have multi-sample BAMs but would like to use -ERC GVCF mode with a particular one of those samples.
  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Fixed IndexOutOfBounds error associated with tail merging.

Variant Recalibrator

  • New --ignore_all_filters option. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.


  • Support added for generalized ploidy. The global ploidy is specified with the -ploidy argument.
  • Bug fix for the case when we assumed ADs were in the same order if the number of alleles matched.
  • Changed the default GVCF GQ Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution.
  • Bug fix in the exact model when calling multi-allelic variants. QUAL field is now more accurate.

RNAseq analysis

  • Bug fixes for working with unmapped reads.


  • New annotation for low- and high-confidence possible de novos (only annotates biallelics).
  • FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations.
  • Restricted population priors based on discovered allele count to be valid for 10 or more samples.


  • Fixed rare bug triggered by hash collision between sample names.


  • Updated the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection.


  • Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header.
  • The performance penalty associated with filtering by read group has been essentially eliminated.


  • StrandOddsRatio is now a standard annotation that is output by default.
  • We used to output zero for FS if there was no data available at a site, now we omit FS.
  • Extensive rewrite of the annotation documentation.


  • Fixed Queue bug with bad localhost addresses.
  • Fixed issue related to spaces in job names that were fine in GridEngine 6 but break in (Son of) GE8.
  • Improved scatter contigs algorithm to be fairer when splitting many contigs into few parts (contributed by @smowton)


  • We now generate PHP files instead of HTML.
  • We now output a JSON version of the tool documentation that can be used to generate wrappers for GATK commands.


  • Output arguments --no_cmdline_in_header, --sites_only, and --bcf for VCF files, and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files moved to the engine level.
  • htsjdk updated to version 1.120.1620

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs gatk-3-2
Comments (13)

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.


  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.


  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.


  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.


  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-03-17 16:52:43 | Updated 2014-03-19 15:13:51 | Tags: variantrecalibrator haplotypecaller randomlysplitvariants release-notes fastaalternatereferencemaker
Comments (0)

GATK 3.1 was released on March 18, 2014. Highlights are listed below. Read the detailed version history overview here:

Haplotype Caller

  • Added new capabilities to the Haplotype Caller to use hardware-based optimizations. Can be enabled with --pair_hmm_implementation VECTOR_LOGLESS_CACHING. Please see the 3.1 Version Highlights for more details about expected speed ups and some background on the collaboration that made these possible.
  • Fixed bugs in computing the weights of edges in the assembly graph. This was causing bad genotypes to be output when running the Haplotype Caller over multiple samples simultaneously (as opposed to creating gVCFs in the new recommended pipeline, which was working as expected).

Variant Recalibrator

  • Fixed issue where output could be non-deterministic with very large data sets.


  • Fixed several bugs where bad input were causing the tool to crash instead of gracefully exiting with an error message.


  • RandomlySplitVariants can now output splits comprised of more than 2 output files.
  • FastaAlternateReferenceMaker can now output heterozygous sites using IUPAC ambiguity encoding.
  • Picard, Tribble, and Variant jars updated to version 1.109.1722.

Created 2014-03-04 06:37:01 | Updated 2014-03-17 22:48:07 | Tags: readbackedphasing reducereads release-notes genotypegvcfs combinegvcfs splitncigarreads
Comments (2)

GATK 3.0 was released on March 5, 2014. Highlights are listed below. Read the detailed version history overview here:

One important change for those who prefer to build from source is that we now use maven instead of ant. See the relevant documentation for building the GATK with our new build system.


  • This is a new GATK tool to be used for variant calling in RNA-seq data. Its purpose is to split reads that contain N Cigar operators (due to a limitation in the GATK that we will eventually handle internally) and to trim (and generally clean up) imperfect alignments.

Haplotype Caller

  • Fixed bug where dangling tail merging in the assembly graph occasionally created a cycle.
  • Added experimental code to retrieve dangling heads in the assembly graph, which is needed for calling variants in RNA-seq data.
  • Generally improved gVCF output by making it more accurate. This includes many updates so that the single sample gVCFs can be accurately genotyped together by GenotypeGVCFs.
  • Fixed a bug in the PairHMM class where the transition probability was miscalculated resulting in probabilities larger than 1.
  • Fixed bug in the function to find the best paths from an alignment graph which was causing bad genotypes to be emitted when running with multiple samples together.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline with large cohorts. Its purpose is to combine any number of gVCF files into a single merged gVCF. One would use this tool for hierarchical merges of the data when there are too many samples in the project to throw at all at once to GenotypeGVCFs.


  • This is a new GATK tool to be used in the Haplotype Caller pipeline. Its purpose is to take any number of gVCF files and to genotype them in order to produce a VCF with raw SNP and indel calls.


  • This is a new GATK tool that might be useful to some. Given a VCF file, this tool will generate simulated reads that support the variants present in the file.

Unified Genotyper

  • Fixed bug when clipping long reads in the HMM; some reads were incorrectly getting clipped.

Variant Recalibrator

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Reduce Reads

  • Removed from the GATK. It was a valiant attempt, but ultimately we found a better way to process large cohorts. Reduced BAMs are no longer supported in the GATK.

Variant Annotator

  • Improved the FisherStrand (FS) calculation when used in large cohorts. When the table gets too large, we normalize it down to values that are more reasonable. Also, we don't include a particular sample's contribution unless we observe both ref and alt counts for it. We expect to improve on this even further in a future release.
  • Improved the QualByDepth (QD) calculation when used in large cohorts. Now, when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1. Note that this only works in the gVCF pipeline for now.
  • In addition, fixed the normalization for indels in QD (which was over-penalizing larger events).

Combine Variants

  • Added the capability to pass in a single file containing a list of VCFs (must end in ".list") instead of having to enumerate all of the files on the command-line. Duplicate entries are not allowed in the list (but the same file can be present in separate lists).

Select Variants

  • Fixed a huge bug where selecting out a subset of samples while using multi-threading (-nt) caused genotype-level fields (e.g. AD) to get swapped among samples. This was a bad one.
  • Fixed a bug where selecting out a subset of samples at multi-allelic sites occasionally caused the alternate alleles to be re-ordered but the AD values were not updated accordingly.


  • Fixed bug where it wasn't checking for underflow and occasionally produced bad likelihoods.
  • It no longer strips out the AD annotation from genotypes.
  • AC/AF/AN counts are updated after fixing genotypes.
  • Updated to handle cases where the AC (and MLEAC) annotations are not good (e.g. they are greater than AN somehow).

Indel Realigner

  • Fixed bug where a realigned read can sometimes get partially aligned off the end of the contig.

Read Backed Phasing

  • Updated the tool to use the VCF 4.1 framework for phasing; it now uses HP tags instead of '|' to convey phase information.


  • Thanks to Phillip Dexheimer for several Queue related fixes and patches.
  • Thanks to Nicholas Clarke for patches to the timer which occasionally had negative elapsed times.
  • Providing an empty BAM list no results in a user error.
  • Fixed a bug in the gVCF writer where it was dropping the first few reference blocks at the beginnings of all but the first chromosome. Also, several unnecessary INFO field annotations were dropped from the output.
  • Logger output now goes to STDERR instead of STDOUT.
  • Picard, Tribble, and Variant jars updated to version 1.107.1683.
Comments (2)

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here:

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.


  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.

 Note: There are no release notes available for versions earlier than 2.0.

These are the latest commit messages logged in the Github repository. Commit messages are short summaries that describe the changes made to the codebase. You can view the complete development history here.

Commit dateSummary
2nd November 2015 Update pom versions for the 3.5 release
2nd November 2015 Merge remote-tracking branch 'unstable/master'
2nd November 2015 Merge pull request #1236 from broadinstitute/gvda_prep_M2_release_1201
2nd November 2015 Prep MuTect2 and ContEst for release
2nd November 2015 Merge pull request #1243 from broadinstitute/gvda_rodwalkers_dont_downsample_1179
2nd November 2015 Fix bug where gatkdocs of RodWalkers reported default LocusWalker downsampling settings
2nd November 2015 Merge pull request #1231 from broadinstitute/mf_fixBQSRIntegrationTest
2nd November 2015 Merge pull request #1239 from broadinstitute/gvda_straggler_doc_fixes_1237
2nd November 2015 Merge pull request #1241 from broadinstitute/gvda_undeprecate_FindCoveredIntervals_1240
2nd November 2015 Yet more doc improvements prior to 3.5 release
2nd November 2015 Restore FindCoveredIntervals + add docs
2nd November 2015 Merge pull request #1233 from broadinstitute/rhl_fix_example_bcf_dict
2nd November 2015 Make the header sequence dictionary match reference
2nd November 2015 Merge pull request #1235 from broadinstitute/gvda_deprecate_useless_tools_1192
2nd November 2015 Merge pull request #1209 from broadinstitute/gvda_yet_more_doc_fixes_1195
2nd November 2015 Deprecate tools that were outdated or redundant
2nd November 2015 Assorted documentation fixes, enhancements and reorganization.
2nd November 2015 Merge pull request #1230 from broadinstitute/rhl_htsjdk_picard_1.141
2nd November 2015 Move htsjdk & picard to version 1.141
2nd November 2015 Merge pull request #1223 from broadinstitute/rhl_dict_check

Return to top

Click here to view older changes on Github

For 2.x releases, we collected the Guide documentation into a Guide Book in PDF format. Each Guide Book release was versioned, so if you performed some analyses with an older version of the GATK, you can go back and look at the documentation that matched that version exactly. Note however that the Tool Documentation (GATKDocs, containing detailed argument lists) was not included in the versioned Guide Book since it can be generated directly from the source code. We currently have no immediate plans to generate PDF Guide Books for 3.x versions due to low apparent interest from the community, but let us know if you think we should resurrect this feature.

 Note: There are no PDF files of the Guide Book available for versions earlier than 2.3-9.