Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as
-ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.
The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).
You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as
--includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).
The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the
-bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.
We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.
We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.
In a nutshell, the base name of the codebase has changed from
gatk (which hopefully makes more sense), and the most common effect is that
sting.gatk classpath segments are now
gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.
We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.
The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.
This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.
To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.
We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.
Calling all Belgians! (and immediate neighbors)
In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.
This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!
That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.
See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.
Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.
In a nutshell: if you're using a version of GATK older than 2.7, you need to request a key to disable Phone Home (if you don't already have one). See below for a full explanation.
As you may know, the GATK includes a reporting mechanism called Phone Home that sends us runtime statistics about usage and bugs. These statistics (which you can read more about here) help us make development decisions, e.g. prioritize bug fixes and new features, as well as track adoption of new versions and tools.
The system uses an AWS cloud service as a data repository, called a "bucket", which GATK accesses using an encryption key. We currently have two active AWS keys for the Phone Home bucket. GATK versions 2.6 and older use one AWS key, and versions 2.7 and later use another AWS key.
For practical reasons, we need to deactivate the old AWS key, which means that GATK jobs run using a version older than 2.7 may end with a Phone Home failure unless the system is deactivated. To be clear, the GATK analysis itself will complete successfully, but the GATK may exit with a fail code. This may cause issues in pipelines and potentially fill up error logs.
The best way to avoid these problems is to upgrade to the latest version of GATK, which you should seriously consider anyway in order to get the best possible results out of your analysis. However, if you are unable to upgrade to a recent version, we recommend that you disable the Phone Home system. To do so, you will need to follow these two simple steps:
Request a GATK key using this online form. In the "justification" field, please write "AWS key deprecation", indicate which version of GATK you are using, and if possible let us know why you are unable to upgrade to a recent version. You can expect to receive your key within 1 business day.
Once you have the key, apply it to all GATK jobs by adding
-et NO_ET -K your.key (where
your.key is the path to the key file you obtained from us) to every GATK command line.
We expect this workaround will suppress any issues EXCEPT in versions prior to 1.5, in which the Phone Home system cannot be deactivated. For versions 1.4 and older, there is nothing we can do, and you will have to either put up with the errors, or upgrade to a newer version (which you should really really do anyway!).
Let us know in the comments if you have any trouble or questions.
This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.
So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.
We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!
Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.
Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command:
Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.
|Linux kernel version||Architecture / Processor||Expected speedup||Instruction set|
|Any 64-bit Linux||Any x86 64-bit||1-1.5x||Non-vector|
|Linux 2.6 or newer||Penryn (Core 2 or newer)||1.3-1.8x||SSE 4.1|
|Linux 2.6.30 or newer||SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer)||2-2.5x||AVX|
To find out exactly which processor is in your machine, you can run this command in the terminal:
$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.
Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:
Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:
SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).
The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.
The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.
Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.
You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:
The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.
The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.
So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.
See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.
Let us know what you think!
We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.
This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.
Please see the new document here for full details about how to run this workflow in practice.
In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.
Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.
Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.
For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.
Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.
We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!
Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:
At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.
The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).
We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.
Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.
The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.
The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.
We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.
Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.
Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.
We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.
Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.
Yep, you read that right, the next release of GATK is going to be the big Three-Oh!
You may have noticed that the 2.8 release was really slim. We explained in the release notes, perhaps a tad defensively, that it was because we’d been working on some ambitious new features that just weren’t ready for prime time. And that was true. Now we’ve got a couple of shiny new toys to show for it that we think you’re really going to like.
But GATK 3.0 is not really about the new features (otherwise we’d just call it 2.9). It’s about a shift in the way we approach the problems that we want to solve -- and to some extent, a shift in the scope of problems we choose to tackle.
We’ll explain what this entails in much more detail in a series of blog posts over the next few days, but let me reassure you right now on one very important point: there is nothing in the upcoming release that will disrupt your existing workflows. What it will do is offer you new paths for discovery that we believe will empower research on a scale that has previously not been possible.
And lest you think this is all just vaporware, here’s a sample of what we have in hand right now: variant calling on RNA-Seq, and a multisample variant discovery workflow liberated from the shackles of time and scaling issues.
Stay tuned for details!
Better late than never, here are the highlights of the most recent version release, GATK 2.8. This should be short and sweet because as releases go, 2.8 is light on new features, and is best described as a collection of bug fixes, which are all* dutifully listed in the corresponding release notes document. That said, two of the changes we've made deserve some additional explanation.
* Up to now (this release included) we have not listed updates/patches to Queue in the release notes, but will start doing so from the next version onward.
In the last release (2.7, for those of you keeping score at home) we trumpeted that the old
-percentBad argument of VariantRecalibrator had been replaced by the shiny new
-numBad argument, and that this was going to be awesome for all sorts of good reasons, improve stability and whatnot. Weeeeeeell it turned out that wasn't quite the case. It worked really well on the subset of analyses that we tested it on initially, but once we expanded to different datasets (and the complaints started rolling in on the forum) we realized that it actually made things worse in some cases because the default value was less appropriate than what
-percentBad would have produced. This left people guessing as to what value would work for their particular dataset, with a great big range to choose from and very little useful information to assist in the choice.
So, long story short, we (and by "we" I mean Ryan) built in a new function that allows the VariantRecalibrator to determine for itself the amount of variants that is appropriate to use for the "bad" model depending on the data. So the short-lived
-numBad argument is gone too, replaced by... nothing. No new argument to specify; just let the VariantRecalibrator do its thing.
Of course if you really want to, you can override the default behavior and tweak the internal thresholds. See the tool doc here; and remember that a good rule of thumb is that if you can't figure out which arguments are involved based on that doc, you probably shouldn't be messing with this advanced functionality.
This is still a rather experimental feature, so we're still making changes as we go. The two big changes worth mentioning here are that you can now run this on reduced reads, and that we've changed the indexing routine to optimize the compression level. The latter shouldn't have any immediate impact on normal users, but it was necessary for a new feature project we've been working on behind the scenes (the single-sample-to-joint-discovery pipeline we have been alluding to in recent forum discussions). The reason we're mentioning it now is that if you use
-ERC GVCF output, you'll need to specify a couple of new arguments as well (
-variant_index_type LINEAR and
-variant_index_parameter 128000, with those exact values). This useful little fact didn't quite make it into the documentation before we released, and not specifying them leads to an error message, so... there you go. No error message for you!
That's all for tool changes. In addition to those, we have made a number of corrections in the tool documentation pages, updated the Best Practices (mostly layout, tiny bit of content update related to the VQSR -numBad deprecation) and made some minor changes to the website, e.g. updated the list of publications that cite the GATK and improved the Guide index somewhat (but that's still a work in progress).
Heads up, people: our generous overlords at the Broad Institute are giving us all time off from December 23 until January 1st (included). So we will effectively be off-duty starting tomorrow evening (Friday Dec 20) for the entire Christmas and New Year period, only to return on January 2nd. During that time, we will all be busy stuffing ourselves with food and enjoying the company of our loved ones, and we hope many of you will have the opportunity to do the same, regardless of your cultural affiliations (I for one will be raising a glass of mulled wine in honor of my Gaul ancestors and the winter solstice). For those of you who will be working, I'm afraid no-one from the GATK team will be around to answer forum questions (unless one of us really needs an excuse to get away from the in-laws) so I encourage you to try to answer each other's questions in the meantime. To all, good luck, happy holidays and/or our deepest sympathy, as applicable. See you next year!
We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.
We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.