Better late than never, here is the now-traditional "Highlights" document for GATK version 3.0, which was released two weeks ago. It will be a very short one since we've already gone over the new features in detail in separate articles --but it's worth having a recap of everything in one place. So here goes.
We are delighted to present our new Best Practices workflow for variant calling in which multisample calling is replaced by a winning combination of single-sample calling in gVCF mode and joint genotyping analysis. This allows us to both bypass performance issues and solve the so-called "N+1 problem" in one fell swoop. For full details of why and how this works, please see this document. In the near future, we will update our Best Practices page to make it clear that the new workflow is now the recommended way to go for calling variants on cohorts of samples. We've already received some pretty glowing feedback from early adopters, so be sure to try it out for yourself!
All the cool kids were doing it, so we had to join the party. It took a few months of experimentation, a couple of new tools and some tweaks to the HaplotypeCaller, but you can now call variants on RNAseq with GATK! This document details our Best Practices recommendations for doing so, along with a non-trivial number of caveats that you should keep in mind as you go.
Nice try, but no. This tool is obsolete now that we have the gVCF/reference model pipeline (see above). Note that this means that GATK 3.0 will not support BAM files that were processed using ReduceReads!
We've switched the build system from Ant to Maven, which should make it much easier to use GATK as a library against which you can develop your own tools. And on a related note, we're also making significant changes to the internal structure of the GATK codebase. Hopefully this will not have too much impact on external projects, but there will be a doc very shortly describing how the new build system works and how the codebase is structured.
For reasons that will be made clear in the near future, we decided to hold the previously announced hardware optimizations until version 3.1, which will be released very soon. Stay tuned!
Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.
You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:
The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.
The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.
So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.
See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.
Let us know what you think!
Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:
At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.
The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).
We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.
Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.
The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.
The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.
We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.
Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.
Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.
We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.
Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.
I am trying to run the HaplotypeCaller on a bam file with multiple samples. It runs successfully without the ERC GVCF option, e.g.
java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam
But when I try running it with the ERC GVCF option, I get an error:
java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC
I am using Java 1.7. I have validated the bam file with Picard. The bam file has the appropriate header, with tab-separated read groups that look like this: @RG ID:3 SM:TCGGCTGAGAAC PL:Illumina
The stack trace is below. If anyone can help I would really appreciate it! I am running this on an interactive node on the Broad cluster, in case it helps with debugging. Thanks!
hw-uger-1001:~/data/csmillie/test $ java -jar /home/unix/csmillie/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC INFO 09:56:10,853 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:10,855 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 INFO 09:56:10,855 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 09:56:10,855 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 09:56:10,859 HelpFormatter - Program Args: -T HaplotypeCaller -R ref.fasta -I test.bam --emitRefConfidence GVCF --sample_name TCGGCTGAGAAC INFO 09:56:10,877 HelpFormatter - Executing as email@example.com on Linux 2.6.32-573.12.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14. INFO 09:56:10,877 HelpFormatter - Date/Time: 2016/02/08 09:56:10 INFO 09:56:10,878 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:10,878 HelpFormatter - -------------------------------------------------------------------------------- INFO 09:56:11,500 GenomeAnalysisEngine - Strictness is SILENT INFO 09:56:12,598 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 09:56:12,606 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 09:56:12,760 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.15 INFO 09:56:12,972 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 09:56:13,128 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 09:56:13,732 GenomeAnalysisEngine - Done preparing for traversal INFO 09:56:13,733 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 09:56:13,733 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 09:56:13,734 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 09:56:13,734 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 09:56:13,735 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output INFO 09:56:14,806 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isGVCF(HaplotypeCaller.java:1251) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.initializeReferenceConfidenceModel(HaplotypeCaller.java:728) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.initialize(HaplotypeCaller.java:659) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)