Since the GATK first started gaining traction in the research community ca. 2010, its development has sustained a fairly rapid pace, with a new major version (1, 2 and 3 so far) coming out about every 2 years. Each major version was a straight continuation of the same codebase, distinguished by significantly new tools and capabilities (e.g. HaplotypeCaller in 2.0, the GVCF workflow in 3.0).
This year, we're on track for a new major version, but we're breaking the mold of the classic GATK codebase. Over the past 18 months, in parallel to the ongoing 3.x development effort, we built a brand new GATK engine that is faster, more scalable and can support new types of analysis that weren't possible in the original GATK framework. Now we're hard at work porting the classic GATK tools over to the new framework, as well as developing some new ones (copy number!). The resulting toolkit will be formally released as GATK 4 later this year. If you're keen to try it out, it's already available as an alpha preview; I'll follow up on this with more details soon.
But that's not all. At the same time, we kept working on the GATK 3.x package in order to continue delivering improvements to the research community. Now we're just about ready to release version 3.6 -- nothing yuuugely different, but quite a few bug fixes and feature enhancements (especially in the GVCF workflow tools) that have been widely requested. Again, details to follow. Oh, and it supports Java 8! Which previous versions do not -- it may look like they do because they run on it without crashing, but there could be silent correctness errors. Which are the worst; I prefer a good honest in-your-face run-busting error any day.
So there you have it; version 3.6 coming out sometime next week (-ish), and GATK 4 coming out later this year, probably Fall timeframe. In between, we'll have one last 3.x release, either a patch release (3.6-x) or a proper minor release (3.7) depending on how substantial are the changes involved, to immortalize the last state of the classic GATK before it gets encased in amber.
Today, several members of our extended group are talking at the BioIT World meeting in Boston, and the Broad mothership is putting out a handful of announcements that are related to GATK. Among other communications there's a press release accompanied by a blog post on the Broad Institute blog, which unveil a landmark agreement we have reached with several major cloud vendors. I'd like to take a few minutes to discuss what is at stake, both in terms of what we're doing, and of how this will affect the wider GATK community.
These announcements all boil down to two things: we built a platform to run the Broad's GATK analysis pipelines in the cloud instead of our local cluster, and we're making that platform accessible to the wider community following a "Software as a Service" (SaaS) model.
Now, before we get any further into discussing what that entails, I want to reassure everyone that we will continue to provide the GATK software as a downloadable executable that can be used anywhere, whether locally on your laptop, on your institution's server farm or computing cluster, or on a cloud platform if you've already got that set up for yourself. The cloud-based service we're announcing is just one more option that we're making available for running GATK. And it should go without saying that we'll continue to provide the same level of support as we have in the past to everyone through the GATK forum; our commitment to that mission is absolute and unwavering.
Alright, so what's happening exactly? Read on to find out!
As discussed recently on this very blog, we've been migrating a substantial portion of the Broad's production genomic analysis pipelines to the cloud. This move was motivated in large part by a need for greater elasticity to deal with the onslaught of massive projects periodically hammering our datacenter (I'm looking at you, @dgmacarthur) as well as a drive toward increased cost-efficiency. But it was also a recognition that the mind-boggling rate at which genomic data is generated (roughly doubling every 8 months!) means we have to adapt how we share and interact with these frankly staggering amounts of data.
To that end, we've been working elbow to elbow with Google engineers for the past eighteen months; in short, they taught us how to cloud and we taught them how to genome. Together we built a system capable of operating our GATK Best Practices production pipelines at scale on the Google Cloud Platform (GCP), using Cromwell and WDL to define and execute the actual workflows. We've also been working closely with a team from the Intel Life Sciences division to solve some of the key challenges involved in scaling up to the next order of dataset magnitude, resulting in a new kind of database that will enable us to perform joint calling on tens, even hundreds of thousands of genomes at a time.
We're already running the Broad's whole genomes on this new platform, and eventually we plan to migrate most if not all our research pipelines (exomes, RNA etc) as well. As a corollary, all of the analysis results produced by the cloud-based pipeline are delivered to researchers through cloud-based workspaces within which they can kick off further analyses. That way, what happens on the cloud stays on the cloud, as far into the process as possible (in part to minimize egress charges).
From my perspective the most immediate upshot of this is that it finally puts us within reach of the holy grail of reproducibility: given the pipeline WDL scripts and resource datasets (both of which we plan to share freely) anyone will be able to reproduce our pipeline processing on their own instance of the Google Cloud with complete independence.
That being said, standing up and administering your own cloud-based service is not exactly trivial, and we know there's a lot of demand for push-button solutions, so we built our system to double as a Software as a Service (SaaS) platform that we can make publicly available for the convenience of the wider community. We plan to make this service accessible to everyone, Broadies and non-Broadies alike, including commercial/for-profit organizations, under the same conditions. Exact pricing has yet to be determined, but it will certainly include the cloud vendor's compute costs, and there will be no separate licensing cost for for-profit use.
We're also opening resale of GATK as a service to commercial SaaS vendors in order to maximize the options available to the community. Illumina has signed on as the first to offer GATK as a service through BaseSpace via Cromwell+WDL, and we're working with all the major cloud computing vendors mentioned above to ensure that the Cromwell+WDL pipelining solution will work as seamlessly and cost-effectively on their platforms as it does today on Google Cloud Platform.
Our ultimate goal here is to reduce the amount of effort that goes into standing up and maintaining implementations of GATK Best Practices worldwide, so that all those resources can be refocused on more interesting work. Personally, I expect that these new developments will contribute to making the GATK Best Practices more readily accessible and affordable to all, and I'm looking forward to being able to announce the availability of the new service later this year!
Today I'm delighted to introduce WDL, pronounced widdle, a new workflow description language that is designed from the ground up as a human-readable and -writable way to express tasks and workflows.
As a lab-grown biologist (so to speak), I think analysis pipelines are amazing. The thought that you can write a script that describes a complex chain of processing and analytical operations, then all you need to do every time you have new data is feed it through the pipe, and out come results -- that's awesome. Back in the day, I learned to run BLAST jobs by copy-pasting gene sequences into the NCBI BLAST web browser window. When I found out you could write a script that takes a collection of sequences, submit them directly to the BLAST server, then extracts the results and does some more computation on them… mind blown. And this was in Perl, so there was pain involved, but it was worth it -- for a few weeks afterward I almost had a social life! In grad school! Then it was back to babysitting cell cultures day and night until I could find another excuse to do some "in silico" work. Where "in silico" was the hand-wavy way of saying we were doing some stuff with computers. Those were simpler times.
So it's been kind of a disappointment for the last decade or so that writing pipelines was still so flipping hard. I mean smart, robust pipelines that can understand things like parallelism, dependencies of inputs and outputs between tasks, and resume intelligently if they get interrupted. Sure, in the GATK world we have Queue, which is very useful in many respects, but it's tailored to specific computing cluster environments like LSF, and writing scala scripts is really not trivial if you're new to it. I wouldn't call Queue user-friendly by a long shot. Plus, we only really use it internally for development work -- to run our production pipelines, we've actually been using a more robust and powerful system called Zamboni. It's a real workhorse, and has gracefully handled the exponential growth of sequencer output up to this point, but it's more complicated to use -- I have to admit I haven't managed to wrap my head around how it works.
Fortunately I don't have to try anymore: our engineers have developed a new pipelining solution that involves WDL, the Workflow Description Language I mentioned earlier, and an execution engine called Cromwell that can run WDL scripts anywhere, whether locally or on the cloud.
It's portable, super user-friendly by design, and open-source (under BSD) -- we’re eager to share it with the world!
In a nutshell, WDL was designed to make it easy to write scripts that describe analysis tasks and chain those tasks into workflows, with built-in support for advanced features like scatter-gather parallelism. This came at the cost of some additional headaches for the engineering team when they built Cromwell, the execution engine that runs WDL scripts -- but that was a conscious design decision, to push off the complexity onto the engineers rather than leave it in the lap of pipeline authors.
You can learn more about WDL at https://software.broadinstitute.org/wdl/, a website we put together to provide detailed documentation on how to write WDL scripts and run them on Cromwell. It's further enriched by a number of tutorials and real GATK-focused workflows provided as examples, and in the near future we'll also share our GATK Best Practices production pipeline scripts.
Yes, I just said that, we're going to be sharing full-on production scripts! In the past we refused to share scripts because they weren't portable across systems and they were difficult to read, so we were concerned that sharing them would generate more support burden than we could handle. But now with that we have a solution that is both portable and user-friendly, we're not so worried -- and we're very excited about this as an opportunity to improve reproducibility among those who adopt our Best Practices in their work.
Of course it won't be the first time someone comes up with a revolutionary way to do things that ends up not working all that well in the real world. To that I say -- have you ever heard of the phrase "eating your own dog food", or "dogfooding" for short? It refers to a practice in software engineering where developers use their own product for running critical systems. I'm happy to say that we are now dogfooding Cromwell and WDL in our own production pipelines; a substantial part of the Broad's own analysis work, which involves processing terabytes worth of whole genome sequencing data, is now done on the cloud using Cromwell to execute pipelines written entirely in WDL.
That's how we know it works at scale, it's solid, and it does what we need it to do. That being said, there are a few additional features that are defined in WDL but not yet fully implemented in Cromwell, as well as a wishlist of convenience functions requested by our own methods development team, which the engineering team will pursue over the coming months. So its capabilities will keep growing as we add convenience functions to address our own and the wider community's pipelining needs.
Finally, I want to reiterate that WDL itself is meant to be very accessible to people who are not engineers or programmers; our own methods development team and various scientific collaborators have all commented that it's very easy to write and understand. So please try it out, and let us know what you think in the WDL forum!
Cancer pipelines, copy number variation, GATK in the Cloud... We have lots of exciting new features coming down the pipe, and you'll be hearing more about all of them in the weeks and months to come. There are one or two announcements that I guarantee will knock some socks off!
Yet in all that excitement (alright, I'm the excited one, you're just trying to see through the thick mist of vaporware I just spouted) let's not forget that we currently have in hand a toolkit that is used by thousands of you every day, containing dozens of tools that perform complicated operations on ridiculously large datasets.
Are they perfect? No. They're pretty darn good, if we do say so ourselves, but they have their quirks. Sometimes even bugs. Aah! The humanity!
So let's talk about problems. Let's talk about what trips people up and bogs down genomic research on a day-to-day basis. Setting aside the big stuff, like compute resources, scaling and performance blockers, which are all interesting topics for another time... Let's talk about all the insultingly small, aggravating, pulling-your-hair-out, this-should-work-why-isn't-this-stupid-thing-working kind of problems.
Because you know what, most of them are solved problems. Yes, I'm saying that most of your problems (related to GATK, that is -- any unresolved childhood issues are yours to deal with) have a solution already, sitting either in the collective consciousness of the GATK team, or in a deep dark corner of our humongous and reportedly intimidating pile of documentation.
We've made some important headway dealing with this through the forum, but I think we can do better. By blogging.
No, I'm serious. The blog format is great for communicating the kind of content that we're currently having a hard time getting across effectively. If you're not convinced, I'm not going explain why I think so; I'm just going to show you. Or rather, I'm going to let my team show you.
Comms Team, assemble!
Technically we've had this blog for years, but we've only really used it to do what -- announce releases, workshop schedules, and post holiday notices? That's not a blog, that's a bulletin board. Which is so nineties.
The last GATK 3.x release of the year 2015 has arrived!
The major feature in GATK 3.5 is the eagerly awaited MuTect2 (beta version), which brings somatic SNP and Indel calling to GATK. This is just the beginning of GATK’s scope expansion into the somatic variant domain, so expect some exciting news about copy number variation in the next few weeks! Meanwhile, more on MuTect2 awesomeness below.
In addition, we’ve got all sorts of variant context annotation-related treats for you in the 3.5 goodie bag -- both new annotations and new capabilities for existing annotations, listed below.
In the variant manipulation space, we enhanced or fixed functionality in several tools including LeftAlignAndTrimVariants, FastaAlternateReferenceMaker and VariantEval modules. And in the variant calling/genotyping space, we’ve made some performance improvements across the board to HaplotypeCaller and GenotypeGVCFs (mostly by cutting out crud and making the code more efficient) including a few improvements specifically for haploids. Read the detailed release notes for more on these changes. Note that GenotypeGVCFs will now emit no-calls at sites where RGQ=0 in acknowledgment of the fact that those sites are essentially uncallable.
We’ve got good news for you if you’re the type who worries about disk space (whether by temperament or by necessity): we finally have CRAM support -- and some recommendations for keeping the output of BQSR down to reasonable file sizes, detailed below.
Finally, be sure to check out the detailed release notes for the usual variety show of minor features (including a new Queue job runner that enables local parallelism), bug fixes and deprecation notices (a few tools have been removed from the codebase, in the spirit of slimming down ahead of the holiday season).
MuTect2 is the next-generation somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.
The original MuTect (Cibulskis et al., 2013) was built on top of the GATK engine by the Cancer Genome Analysis group at the Broad Institute, and was distributed as a separate package. By all accounts it did a great job calling somatic SNPs, and was part of the winning entries for multiple DREAM challenges (including some submitted by groups outside the Broad). However it was not able to call indels; and the less said about the indel caller that accompanied it (first named SomaticIndelDetector then Indelocator) the better.
This new incarnation of MuTect leverages much of the HaplotypeCaller’s internal machinery (including the all-important graph assembly bit) to call both SNPs and indels together. Yet it retains key parts of the original MuTect’s internal genotyping engine that allow it to model somatic variation appropriately. This is a major differentiation point compared to HaplotypeCaller, which has expectations about ploidy and allele frequencies that make it unsuitable for calling somatic variants.
As a convenience add-on to MuTect2, we also integrated the cross-sample contamination estimation tool ContEst into GATK 3.5. Note that while the previous public version of this tool relied on genotyping chip data for its operation, this version of the tool has been upgraded to enable on-the-fly genotyping for the case where genotyping data is not available. Documentation of this feature will be provided in the near future. Both MuTect2 and ContEst are now featured in the Tool Documentation section of the Guide. Stay tuned for pipeline-level documentation on performing somatic variant discovery, to be added to the Best Practices docs in the near future.
Please note that this release of MuTect2 is a beta version intended for research purposes only and should not be applied in production/clinical work. MuTect2 has not yet undergone the same degree of scrutiny and validation as the original MuTect since it is so new. Early validation results suggest that MuTect2 has a tendency to generate more false positives as compared to the original MuTect; for example, it seems to overcall somatic mutations at low allele frequencies, so for now we recommend applying post-processing filters, e.g. by hard-filtering calls with low minor allele frequencies. Rest assured that data is being generated and the tools are being improved as we speak. We’re also looking forward to feedback from you, the user community, to help us make it better faster.
Finally, note also that MuTect2 is distributed under the same restricted license as the original MuTect; for-profit users are required to seek a license to use it (please email email@example.com). To be clear, while MuTect2 is released as part of GATK, the commercial licensing has not been consolidated under a single license. Therefore, current holders of a GATK license will still need to contact our licensing office if they wish to use MuTect2.
Whew that was a long wall of text on MuTect2, wasn’t it. Let’s talk about something else now. Annotations! Not functional annotations, mind you -- we’re not talking about e.g. predicting synonymous vs. non-synonymous mutations here. I mean variant context annotations, i.e. all those statistics calculated during the variant calling process which we mostly use to estimate how confident we are that the variants are real vs. artifacts (for filtering and related purposes).
So we have two new annotations, BaseCountsBySample (what it says on the can) and ExcessHet (for excess heterozygosity, i.e. the number of heterozygote calls made in excess of the Hardy-Weinberg expectations), as well as a set of new annotations that are allele-specific versions of existing annotations (with
AS_ prefix standing for Allele-Specific) which you can browse here. Right now we’re simply experimenting with these allele-specific annotations to determine what would be the best way to make use of them to improve variant filtering. In the meantime, feel free to play around with them (via e.g. VariantsToTable) and let us know if you come up with any interesting observations. Crowdsourcing is all the rage, let’s see if it gets us anywhere on this one!
We also made some improvements to the StrandAlleleCountsBySample annotation, to how VQSR handles MQ, and to how VariantAnnotator makes use of external resources -- and we fixed that annoying bug where default annotations were getting dropped. All of which you can read about in the detailed release notes.
CRAM support! Long-awaited by many, lovingly implemented by Vadim Zalunin at EBI and colleagues at the Sanger Institute. We haven’t done extensive testing, and there are a few tickets for improvements that are planned at the htsjdk level -- but it works well enough that we’re comfortable releasing it under a beta designation. Meaning have fun with it, but do your own thorough testing before putting it into production or throwing out your old BAMs!
Static binning of base quality scores. In a nutshell, binning (or quantizing) the base qualities in a BAM file means that instead of recording all possible quality values separately, we group them into bins represented by a single value (by default, 10, 20, 30 or 40). By doing this we end up having to record fewer separate numbers, which through the magic of BAM compression yields substantially smaller files. The idea is that we don’t actually need to be able to differentiate between quality scores at a very high resolution -- if the binning scheme is set up appropriately, it doesn’t make any difference to the variant discovery process downstream. This is not a new concept, but now the GATK engine has an argument to enable binning quality scores during the base recalibration (BQSR) process using a static binning scheme that we have determined produces optimal results in our hands. The level of compression is of course adjustable if you’d like to set your own tradeoff between compression and base quality resolution. We have validated that this type of binning (with our chosen default parameters) does not have any noticeable adverse effect on germline variant discovery. However we are still looking into some possible effects on somatic variant discovery, so we can’t yet recommend binning for that application.
Disable indel quality scores. The Base Recalibration process produces indel quality scores in addition to the regular base qualities. They are stored in the BI and BD tags of the read records, taking up a substantial amount of space in the resulting BAM files. There has been a lot of discussion about whether these indel quals are worth the file size inflation. Well, we’ve done a lot of testing and we’ve now decided that no, for most use cases the indel quals don’t make enough of a difference to justify the extra file size. The one exception to this is when processing PacBio data, it seems that indel quals may help model the indel-related errors of that technology. But for the rest, we’re now comfortable recommending the use of the
--disable_indel_quals argument when writing out the recalibrated BAM file with PrintReads.
These are the materials that were presented at the November 2015 GATK workshop at the Broad Institute in Cambridge, MA.
|Slide decks presented on Day 1||Google Drive Folder|
|Workshop handout document (agenda and talk abstracts)||PDF on Google Drive|
|Variant Discovery Tutorial (Day 2 AM) (=ASHG15 Tutorial)||Google Drive Folder|
|Callset Filtering and Evaluation Tutorial (Day 2 PM)||Google Drive Folder|
We are scheduled to do a hands-on workshop at the ASHG 2015 meeting in Baltimore (see below for program details).
The workshop files are available here:
Attendees will receive a printout of the worksheet at the workshop. We will not provide printouts of the appendix document.
If you are registered to attend, you must have downloaded the materials and followed the instructions before the workshop starts, otherwise you will not be able to follow along and your workshop experience will be unsatisfying. We certainly don't want that to happen, so be sure to do your homework as follows:
ASGH15_GATKdirectory, look at the contents and read the extremely brief
Go over the second part of the
ASHG2015Tutorial-Appendix document to acquaint yourself with the technical requirements of the workshop and follow the detailed installation instructions. In particular:
If you are new to the command line or GATK, we strongly recommend reviewing the sections of the document about tool syntax (pages 8-10) and practicing basic Unix commands. You may also benefit from spending some time gaining familiarity with IGV.
Room 343, Level 3; Convention Center
7:15am - 8:45am
Fri, Oct 09
Accredited session: 1.5
By invitation or pre-registration only
The Genome Analysis Toolkit or GATK, developed at the Broad Institute, is currently one of the most widely used software tools for variant discovery and genotyping in whole genome and exome data. This workshop will present the core “Best Practices” developed by the GATK team, which are crucial to getting the most accurate and meaningful results out of any variant discovery analysis. The Best Practices workflow is a complete solution for variant discovery that goes from data preparation to variant calling and filtering. We will focus especially on the latest methodological innovations that can enable researchers to obtain best-in-class results from their data. We will cover the underlying principles of the methods that are utilized, as well as how to apply them in practice, demonstrating usage and walking attendees through the process on a real data set. The workshop will include a “hands-on” component (involving a downloadable test data set and course materials) so that attendees who wish to do so can follow along and run the commands themselves on their laptop. In addition to the principal instructor, teaching assistants will be available to answer questions during the hands-on parts. Through this workshop, attendees will be empowered to leverage the power of GATK and apply our Best Practices variant discovery workflow in their own research.
Workshop requirements: You must bring your laptop. Your laptop should have full battery power and must have a wireless card.
To attend this event, a separate ticket must be purchased during the meeting registration process. Limited tickets are available and sell out early. Tickets will not be available for sale on site. Registration cost includes breakfast.
This patch release fixes several issues that we felt were annoying enough to warrant an interim version release; see detailed list below.
Oh, and in case you're curious about the patch number, the *-46 numeral refers to the number of code commits made since the 3.4-0 release. Out of this, about a dozen are related to the bug fixes listed above; another dozen relate to new work that is not yet publicly available (gotta wait 'til 3.5...) and the rest are automated commits produced when merging completed work into the master codebase.
In version 3.4, we introduced some functionality to handle cases where a site that is variant in one sample is covered by a deletion in another sample (see release notes for details). Some of that functionality involved using a new symbolic
<*:DEL> notation to represent this in multi-sample VCFs. Unfortunately we later realized that this was not the best way to represent this, so we have now switched to a more spec-compliant notation, which is simply
*. We've also added some refinements to correctly handle cases where alleles need to be remapped, and/or where there are multiple overlapping spanning deletions in the population (which we encountered twice in a population of ~10K samples, to give you an idea of how rarely that happens in humans). The new code is fully backward-compatible, so if you have any files that contain the
<*:DEL> notation, these will be understood correctly by GATK. Note however that they will not be converted to the new notation -- they will just be passed through unchanged.
Some annoying gcc compiler version issues caused many of you to be unable to utilize Intel's accelerated PairHMM library to speed up HaplotypeCaller. We've sorted those out so now everyone should be able to utilize it provided their computing platform is running reasonably recent (i.e. not Jurassic-era) software.
There was a minor bug in the ReadBackedPhasing tool that was causing some variants to be merged into MNPs when they shouldn't be; that's now fixed.
A user reported a dramatic slowdown in the DepthOfCoverage and shortly thereafter, provided a code patch that resolved to problem. Self-fixing bug reports are our favorite kind, thank you @mnw21cam!
OK, technically not a bug fix, so this shouldn't be in a patch, according to my esteemed software engineer colleagues, but I couldn't wait to share it with y'all. This is something people often ask how to do and until now it was beyond painful. But hark! You can now tell SelectVariants to filter out variants where a given number or proportion of samples have filtered genotypes. See the usage example in the tool doc (toward the end of the list). Now tell me you're not happy that we included it in the patch. C'mon, I dare you.
Again with a not-a-bug-fix thing, sorry -- there's a new read filter to get rid of reads that have too much soft-clipping going on to be honest. This is a sign that there's something seriously wrong with them, as we recently found out while analyzing a bunch of cheek swab samples that turned out to be heavily contaminated with bacterial DNA. It's always fun to see some seriously messed-up samples go by... and it's nice that now we can do something to rescue them.
We made some minor changes to documentation (VQSLOD, QD) and error handling (clarified error message for Contigs Out of Order error, made GVCF indexing parameters not required for gzipped output).
It’s an exciting week here at GATK HQ because we finally get to tell you about this very cool project we’ve been working on: today, Broad and Google announced plans to make GATK available as a service on Google Cloud Platform as part of Google Genomics!
Our goal with this project is to enable researchers around the world to run the latest version of the GATK Best Practices pipeline without the hassle of figuring out the plumbing or storing huge amounts of genomic data locally. The initial alpha rollout will be limited to a small set of users. Based on our experience supporting the GATK user community, we think this will dramatically lower the barriers to top-notch genomic analysis capabilities, especially for researchers who don’t have access to the kind of dedicated compute infrastructure and engineering teams required for analyzing genomic data at scale.
Rest assured, if you’re the do-it-yourselfer type, the DIY GATK option is still here; you’ll still be able to download the latest version of GATK from the Broad website, set it up on your own system and run it locally. We will continue documenting and supporting this option with the same level of dedication as ever.
You can read more about the details in the Broad Institute press release and on the Google Cloud Platform blog. Feel free to ask questions in the comments thread of course. We’ll keep you posted here as we move forward.
Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.
None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.
Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (
--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension
.g.vcf so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.
Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of
-dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).
Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument
--breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use
-L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!
And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.
The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic
<*:DEL> allele representing the deletion.
The new annotation is called
RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the
-allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.
This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.
We’ve added two new documentation resources to the Guide.
One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.
The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!
The presentation slides from the 2015 "GATK in the UK" workshop (April 20-24) are available on DropBox here.
All slides will ultimately be posted in the Presentations section of the Guide.
We have some important news to share with you regarding the licensing of GATK and MuTect. The licensing agreement between us and Appistry will end effective April 15, 2015; from that point on, the tools will continue to be licensed through Broad for commercial entities that will be running the GATK and MuTect internally or as part of their own hardware offering. Current licensed users will transition to Broad Institute when their current license expires.
For our academic and non-profit GATK and MuTect users, the licensing transition will be essentially transparent. You will still be able to use the GATK and MuTect for free, and access the source code through the existing public repository. The support forum and documentation website will also remain operational and freely accessible to all as they been previously.
Since our commercial users will now get their license --and their support!-- directly from Broad, they can expect to see some clear benefits:
We have heard from our licensed user community that you would like greater access to the GATK support team. Licensing through Broad will remove intermediaries and allow all GATK and MuTect users --including those who purchase a license-- more direct contact with and support from our team.
Most current tools. Getting your GATK license through Broad license will give you access to the most cutting-edge tools and features available without sacrificing support. Under Broad licensing, you will still have the option of purchasing a license for either of two packages, “GATK” or “GATK + Cancer Tools”.
Our development team is driven by the goal of building tools to enable better science and push the boundaries of genome analysis. Revenue from GATK and MuTect licensing enables these goals by directly feeding into GATK and MuTect development, in the form of critical codebase maintenance and bug fixing work, as well as expansion of the support team. This enables us to keep pace with the growth of the user community and the ever-increasing demand for GATK and MuTect support.
This is a significant new milestone in the life of GATK and MuTect, and we recognize that there are going to be a lot of questions and discussions on this topic since it will affect many of you in the research community. We’ve put together some FAQs (below the fold) that we hope will answer your most pressing questions; feel free to comment and suggest additional points that you think should be covered there.
Note that we are still working on defining some of the finer points of the support model and pricing structure, so we can’t address those quite yet -- but feel free to email firstname.lastname@example.org if you have some burning question and/or concern that you’d like to discuss regarding licensing and/or pricing in particular. Rest assured that once the model has been finalized, we will make the full details (including pricing) available on our website in order to ensure full transparency.
Academic/non-profit users: No change. The licensing terms remain the same and the GATK remains free to use for not-for-profit research and educational purposes. The current free user support model will remain available through the online forum.
Currently licensed Commercial/for-profit users: Appistry will continue to provide full GATK support for the remainder of your current license term. After that point, Broad Institute can offer you a GATK license directly. This will offer you immediate access to the latest version of GATK. We can work directly with you on your specific licensing questions at email@example.com. For support questions, GATK product upgrade information or other suggestions, please comment in the discussion below or send us a private message.
Prospective commercial/for-profit users: Broad Institute can offer you a GATK license directly. This will offer you immediate access to the latest version of GATK. We can work directly with you on your specific licensing questions at firstname.lastname@example.org. For support questions, GATK product upgrade information or other suggestions, please comment in the discussion below or send us a private message.
No. There will only be one version for all users. We will provide our licensed users total support for the very latest version. This means they will always be able to use the most cutting-edge tools and features available without sacrificing support.
Part of the Broad Institute’s mission is to share our tools and research as broadly as possible to enable others to do transformational research. We started developing GATK several years ago and, since then, have constantly upgraded it, thanks to the hard work and dedication of many talented programmers, developers and genomic researchers in our group and beyond. That is why GATK remains the most advanced, accurate and reliable toolkit for variant discovery available anywhere (if we do say so ourselves). But please understand that building, maintaining, testing and constantly improving GATK is neither easy nor free. This is why we charge commercial users a licensing fee and funnel these resources back into upgrades of the tool itself – it allows us to continue to offer GATK for free to academic and non-profit organizations while ensuring it is always the best-of-the-best in an emerging and rapidly-changing field of research.
In addition to the GATK package, we will also offer a package that bundles GATK with MuTect and ContEst. That package will not include the SomaticIndelDetector, but a replacement for that functionality is in preparation.
No. Tools originating from the Picard toolkit will remain free and fully open-source for everyone. We are preparing to integrate them into a part of GATK that will be under a BSD license.
We are preparing to enable this in order to facilitate sharing of scientific methods, but we need input from the community first. To that end, we’d like to invite researchers who are developing or have developed such pipeline images to contact us in order to discuss options. We are envisioning simple technical solutions to ensure that users of these images are made fully aware of their own legal responsibility relative to the GATK licensing status, in a way that minimizes the burden on the researchers who distribute them.
The slides from the 2015 BroadE GATK Best Practices workshop presentations are accessible at this Dropbox link.
The video recordings of the workshop talks are online here. We'll post links to the videos (along with copies of the corresponding slides) in the Presentations section of the Guide. They will also be available on the Broad's YouTube and iTunesU channels.
We're going to be doing two back-to-back workshops in Edinburgh and Cambridge (the original, accept no substitutes) later this Spring, on April 20-21 and 23-24 respectively. The workshop program for both will be our typical one-day Best Practices lectures marathon followed by a half-day of lectures on supplemental topics (QC, non-humans, etc) and a half-day hands-on sessions for beginners to get their hands dirty with some real data.
Cheers to our hosts and we hope to see lots of you there!
bonus points to whoever gets the title reference -- and sings it in the correct tune
Warning: the following content may shock or distress our more sensitive users, as we discuss the cold-blooded elimination of some tools from the GATK.
Alright, now that I've got your attention (hopefully -- if not, what does it take?), here's the deal. We have got to a point where the GATK is a widely, even massively used toolkit (thanks to you, dear users). And it's pretty darn robust -- it's what the Broad's Genomic Platform uses in production to churn out exomes like there's no tomorrow. But it has technical limitations that are 1) a frequent source of pain on your end and 2) increasingly hampering development of new methods on our end.
The good news is that we have a plan for addressing (read: blasting away) these limitations. But part of this plan will involve streamlining GATK by getting rid of tools that are not useful or are inferior to alternative tools from other packages that we're not trying to compete with (e.g. Picard tools).
Some tools that are safe from elimination: all the tools used in the Best Practices, and a couple of utilities that we use a lot ourselves. But everything else is up for review -- and that's where you come in: we need your input to decide what to keep, what to throw away, and what to consider rewriting from scratch (yep, this is an option).
This link will take you to a SurveyMonkey page that lists the tools currently on the chopping block:
Act now to save your favorite non-BP tools! Or help us get rid of the crud. Whichever way you want to look at it, we appreciate your feedback!
We all know how HaplotypeCaller analyses can take a long time. IBM is now providing a native implementation of the PairHMM algorithm that leverages the new hardware available in their POWER8 systems. This optimization currently work on the following systems: Ubuntu14 and RHEL7 with POWER8.
To take advantage of this optimization, you need to do the following:
Here is an example for running on a P8 system with Ubuntu:
java -Xmx32g -Djava.library.path=/path/to/PairHMM_P8_Ubuntu -jar $GATK_PATH/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $REFERENCE -I $INPUT_BAM --dbsnp $SNP_VCF \ -stand_emit_conf 10 -stand_call_conf 50 \ --pair_hmm_implementation VECTOR_LOGLESS_CACHING \ -o $OUTPUT_VCF
The latest version of the library accelerates HaplotypeCaller using the same floating precision as Java on POWER8. Also, exploiting Simultaneous Multithreading (SMT) along with the vector instructions, it can accelerate HaplotypeCaller more than the previous version, in particular, in the single-thread mode (no
-nct option is specified). For example, the PairHMM computation, which the library accelerates, consumes about a half of the HaplotypeCaller runtime in the single-thread mode, 1.88x speed-up will be expected. The library uses 37% of the available processors by default. The number of threads can be tuned by setting the environment variable
PHMM_N_THREADS. The source code is available here.
If you have any questions or issues (aside from downloading the file), please contact Yinhue Cheng at IBM (email@example.com) or Takeshi Ogasawara at IBM Japan (TAKESHI@jp.ibm.com).
Disclaimer: Please note that these libraries are not an official IBM product. You use it entirely at your own risk, and neither IBM nor the author assumes any liability whatsoever, nor do they assume responsibility for maintenance. Please report comments and corrections to firstname.lastname@example.org.
Consider this a public service announcement, since most GATK users probably also use Picard tools routinely. The recently released version 1.124 of the Picard tools includes many lovely improvements, bug fixes and even a few new tools (see release notes for full details) -- but you'll want to pay attention to one major change in particular.
From this version onward, the Picard release will contain a single JAR file containing all the tools, instead of one JAR file per tool as it was before. This means that when you invoke a Picar tool, you'll invoke a single JAR, then specify which tool (which they call CLP for Command Line Program) you want to run. This should feel completely familiar if you already use GATK regularly, but it does mean you'll need to update any scripts that use Picard tools to the new paradigm. Other than that, there's no change of syntax; Picard will still use e.g.
I=input.bam where GATK would use
We will need to update some of our own documentation accordingly over the near future; please bear with us as we go through this process, and let us know by commenting in this thread if you find any docs that have yet to be updated.
Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.
-ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.
Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.
As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.
This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the
-ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the
You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:
But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).
In a nutshell, phased records will look like this:
1 1372243 . T <NON_REF> . . END=1372267 <snip> <snip> 1 1372268 . G A,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip> 1 1372269 . G T,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip> 1 1372270 . C <NON_REF> . . END=1372299 <snip> <snip>
You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).
The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.
Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).
We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the
--recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.
Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.
We're looking for a developer (software engineer or equivalent) to join our team on a part-time basis as a software engineering consultant.
The mission is to take on tasks that are non-critical but still important, such as fixing bugs and implementing minor feature requests, usability improvements and so on in the GATK codebase. The idea is to take much of the maintenance burden off of the core development team so they can focus on developing new tools and methods.
We already have one person employed in this capacity, and it's working out very well. However there is quite a bit more to do than he can find time for, and therefore we'd like to hire a second consultant to pick up the extra work in parallel. (We were hoping to just clone him but ran into some difficulties getting IRB approval.)
This position does not require expert knowledge of GATK, but familiarity with the GATK tools is a big plus. The main language is Java, with a small side of R and Scala. We also have a growing corpus of C++ code that is not yet in the scope of this position, but may move into scope some months down the line.
Note that this opportunity is amenable to remote work so you don't need to be local to the Boston area, but it is limited to US residents with valid work authorization (no H1B sponsorship possible, sorry).
Drop us a line in the comments below or private-message me (@Geraldine_VdAuwera) to discuss details.
Here is the official announcement for the upcoming workshop in Philadelphia. Registration is not necessary for the lecture sessions, but it is required for the hands-on sessions (see link further below).
We look forward to seeing you there!
The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine announce a Workshop on Variants Discovery in Next Generation Sequence Data on September 18 and 19, 2014.
This workshop will focus on the core steps involved in calling variants with the Broad Institute¹s Genome Analysis Toolkit (GATK), using the "Best Practices" developed by the GATK team, and will be presented by Dr. Geraldine Van der Auwera of the Broad Institute and other instructors from the GATK team. Participants will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.
The workshop will take place over two consecutive days (September 18 and 19, 2014). In the morning lecture sessions, attendees will learn the rationale, theory, and real-life applications of GATK Best Practices for variant discovery in high-throughput DNA sequencing data, including recommendations for additional experimental designs and datatypes such as RNAseq. In the afternoon hands-on sessions, attendees will learn to interact with the GATK tools and apply them effectively through interactive exercises and tutorials.
The morning lecture sessions will take place on Thursday, September 18, from 9:00 am to 12:30 pm, and Friday, September 19, from 9:00 am to 11:30 am, in the Dunlop Auditorium of Stemmler Hall, University of Pennsylvania, 3450 Hamilton Walk, Philadelphia, PA 19104. Both morning sessions are open to all participants and registration is not required.
The afternoon hands-on sessions will take place on Thursday, September 18, from 2:00 pm to 5:30 pm, and Friday, September 19, from 1:00 pm to 4:30 pm. The September 18 hands-on session is aimed mainly at beginners (though familiarity with the command line environment is expected). The September 19 hands-on session is aimed at more advanced users who are already familiar with the basic GATK functions. Attendance to the hands-on sessions is limited to 20 participants each day, and precedence will be given to members of the University of Pennsylvania or its affiliated hospitals and research institutes (HUP, CHOP, Monell, Wistar, etc.).
Registration for the hands-on sessions is mandatory and open through Friday, August 29th at http://ibi.upenn.edu/?p=996.
Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as
-ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.
The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).
You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as
--includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).
The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the
-bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.
We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.
We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.
In a nutshell, the base name of the codebase has changed from
gatk (which hopefully makes more sense), and the most common effect is that
sting.gatk classpath segments are now
gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.
We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.
The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.
This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.
To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.
We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.
Calling all Belgians! (and immediate neighbors)
In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.
This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!
That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.
See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.
Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.
In a nutshell: if you're using a version of GATK older than 2.7, you need to request a key to disable Phone Home (if you don't already have one). See below for a full explanation.
As you may know, the GATK includes a reporting mechanism called Phone Home that sends us runtime statistics about usage and bugs. These statistics (which you can read more about here) help us make development decisions, e.g. prioritize bug fixes and new features, as well as track adoption of new versions and tools.
The system uses an AWS cloud service as a data repository, called a "bucket", which GATK accesses using an encryption key. We currently have two active AWS keys for the Phone Home bucket. GATK versions 2.6 and older use one AWS key, and versions 2.7 and later use another AWS key.
For practical reasons, we need to deactivate the old AWS key, which means that GATK jobs run using a version older than 2.7 may end with a Phone Home failure unless the system is deactivated. To be clear, the GATK analysis itself will complete successfully, but the GATK may exit with a fail code. This may cause issues in pipelines and potentially fill up error logs.
The best way to avoid these problems is to upgrade to the latest version of GATK, which you should seriously consider anyway in order to get the best possible results out of your analysis. However, if you are unable to upgrade to a recent version, we recommend that you disable the Phone Home system. To do so, you will need to follow these two simple steps:
Request a GATK key using this online form. In the "justification" field, please write "AWS key deprecation", indicate which version of GATK you are using, and if possible let us know why you are unable to upgrade to a recent version. You can expect to receive your key within 1 business day.
-et NO_ET -K your.key(where
your.keyis the path to the key file you obtained from us) to every GATK command line.
We expect this workaround will suppress any issues EXCEPT in versions prior to 1.5, in which the Phone Home system cannot be deactivated. For versions 1.4 and older, there is nothing we can do, and you will have to either put up with the errors, or upgrade to a newer version (which you should really really do anyway!).
Let us know in the comments if you have any trouble or questions.
This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.
So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.
We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!
Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.
Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command:
Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.
|Linux kernel version||Architecture / Processor||Expected speedup||Instruction set|
|Any 64-bit Linux||Any x86 64-bit||1-1.5x||Non-vector|
|Linux 2.6 or newer||Penryn (Core 2 or newer)||1.3-1.8x||SSE 4.1|
|Linux 2.6.30 or newer||SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer)||2-2.5x||AVX|
To find out exactly which processor is in your machine, you can run this command in the terminal:
$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.
Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:
Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:
SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).
The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.
Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.
You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:
The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.
The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.
So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.
See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.
Let us know what you think!
We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.
This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.
Please see the new document here for full details about how to run this workflow in practice.
In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.
Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.
Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.
For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.
Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.
We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!
Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:
At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.
The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).
We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.
Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.
The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.
The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.
We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.
Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.
Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.
We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.
Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.
Yep, you read that right, the next release of GATK is going to be the big Three-Oh!
You may have noticed that the 2.8 release was really slim. We explained in the release notes, perhaps a tad defensively, that it was because we’d been working on some ambitious new features that just weren’t ready for prime time. And that was true. Now we’ve got a couple of shiny new toys to show for it that we think you’re really going to like.
But GATK 3.0 is not really about the new features (otherwise we’d just call it 2.9). It’s about a shift in the way we approach the problems that we want to solve -- and to some extent, a shift in the scope of problems we choose to tackle.
We’ll explain what this entails in much more detail in a series of blog posts over the next few days, but let me reassure you right now on one very important point: there is nothing in the upcoming release that will disrupt your existing workflows. What it will do is offer you new paths for discovery that we believe will empower research on a scale that has previously not been possible.
And lest you think this is all just vaporware, here’s a sample of what we have in hand right now: variant calling on RNA-Seq, and a multisample variant discovery workflow liberated from the shackles of time and scaling issues.
Stay tuned for details!
The presentation videos for:
are available here: http://www.broadinstitute.org/gatk/guide/events?id=3391
Better late than never, here are the highlights of the most recent version release, GATK 2.8. This should be short and sweet because as releases go, 2.8 is light on new features, and is best described as a collection of bug fixes, which are all* dutifully listed in the corresponding release notes document. That said, two of the changes we've made deserve some additional explanation.
* Up to now (this release included) we have not listed updates/patches to Queue in the release notes, but will start doing so from the next version onward.
In the last release (2.7, for those of you keeping score at home) we trumpeted that the old
-percentBad argument of VariantRecalibrator had been replaced by the shiny new
-numBad argument, and that this was going to be awesome for all sorts of good reasons, improve stability and whatnot. Weeeeeeell it turned out that wasn't quite the case. It worked really well on the subset of analyses that we tested it on initially, but once we expanded to different datasets (and the complaints started rolling in on the forum) we realized that it actually made things worse in some cases because the default value was less appropriate than what
-percentBad would have produced. This left people guessing as to what value would work for their particular dataset, with a great big range to choose from and very little useful information to assist in the choice.
So, long story short, we (and by "we" I mean Ryan) built in a new function that allows the VariantRecalibrator to determine for itself the amount of variants that is appropriate to use for the "bad" model depending on the data. So the short-lived
-numBad argument is gone too, replaced by... nothing. No new argument to specify; just let the VariantRecalibrator do its thing.
Of course if you really want to, you can override the default behavior and tweak the internal thresholds. See the tool doc here; and remember that a good rule of thumb is that if you can't figure out which arguments are involved based on that doc, you probably shouldn't be messing with this advanced functionality.
This is still a rather experimental feature, so we're still making changes as we go. The two big changes worth mentioning here are that you can now run this on reduced reads, and that we've changed the indexing routine to optimize the compression level. The latter shouldn't have any immediate impact on normal users, but it was necessary for a new feature project we've been working on behind the scenes (the single-sample-to-joint-discovery pipeline we have been alluding to in recent forum discussions). The reason we're mentioning it now is that if you use
-ERC GVCF output, you'll need to specify a couple of new arguments as well (
-variant_index_type LINEAR and
-variant_index_parameter 128000, with those exact values). This useful little fact didn't quite make it into the documentation before we released, and not specifying them leads to an error message, so... there you go. No error message for you!
That's all for tool changes. In addition to those, we have made a number of corrections in the tool documentation pages, updated the Best Practices (mostly layout, tiny bit of content update related to the VQSR -numBad deprecation) and made some minor changes to the website, e.g. updated the list of publications that cite the GATK and improved the Guide index somewhat (but that's still a work in progress).
Heads up, people: our generous overlords at the Broad Institute are giving us all time off from December 23 until January 1st (included). So we will effectively be off-duty starting tomorrow evening (Friday Dec 20) for the entire Christmas and New Year period, only to return on January 2nd. During that time, we will all be busy stuffing ourselves with food and enjoying the company of our loved ones, and we hope many of you will have the opportunity to do the same, regardless of your cultural affiliations (I for one will be raising a glass of mulled wine in honor of my Gaul ancestors and the winter solstice). For those of you who will be working, I'm afraid no-one from the GATK team will be around to answer forum questions (unless one of us really needs an excuse to get away from the in-laws) so I encourage you to try to answer each other's questions in the meantime. To all, good luck, happy holidays and/or our deepest sympathy, as applicable. See you next year!
We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.
We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.