It’s an exciting week here at GATK HQ because we finally get to tell you about this very cool project we’ve been working on: today, Broad and Google announced plans to make GATK available as a service on Google Cloud Platform as part of Google Genomics!
Our goal with this project is to enable researchers around the world to run the latest version of the GATK Best Practices pipeline without the hassle of figuring out the plumbing or storing huge amounts of genomic data locally. The initial alpha rollout will be limited to a small set of users. Based on our experience supporting the GATK user community, we think this will dramatically lower the barriers to top-notch genomic analysis capabilities, especially for researchers who don’t have access to the kind of dedicated compute infrastructure and engineering teams required for analyzing genomic data at scale.
Rest assured, if you’re the do-it-yourselfer type, the DIY GATK option is still here; you’ll still be able to download the latest version of GATK from the Broad website, set it up on your own system and run it locally. We will continue documenting and supporting this option with the same level of dedication as ever.
You can read more about the details in the Broad Institute press release and on the Google Cloud Platform blog. Feel free to ask questions in the comments thread of course. We’ll keep you posted here as we move forward.
Folks, I’m all out of banter for this one, so let’s go straight to the facts. GATK 3.4 contains a shedload of improvements and bug fixes, including some new functionality that we hope you’ll find useful. The full list is available in the detailed release notes.
None of the recent changes involves any disruption to the Best Practice workflow (I hear some sighs of relief) but you’ll definitely want to check out the tweaks we made to the joint discovery tools (HaplotypeCaller, CombineGVCFs and GenotypeGVCFs), which are rapidly maturing as they log more flight time at Broad and in the wild.
Let’s start at the very beginning with HaplotypeCaller (a very good place to start). On the usability front, we’ve finally given in to the nigh-universal complaint about the required variant indexing arguments (
--variant_index_type LINEAR --variant_index_parameter 128000) being obnoxious and a waste of characters. So, tadaa, they are no longer required, as long as you name your output file with the extension
.g.vcf so that the engine knows what level of compression to use to write the gVCF index (which leads to better performance in downstream tools). We think this naming convention makes a lot of sense anyway, as it’s a great way to distinguish gVCFs from regular VCFs on sight, so we hope most of you will adopt it. That said, we stopped short of making this convention mandatory (for now…) so you don’t have to change all your scripts and conventions if you don’t want to. All that will happen (assuming you still specify the variant index parameters as previously) is that you’ll get a warning in the log telling you that you could use the new convention.
Where we’ve been a bit more dictatorial is that we’ve completely disabled the use of
-dcov with HaplotypeCaller because it was causing very buggy behavior due to an unforeseen complication in how different levels of downsampling are applied in HaplotypeCaller. We know that the default setting does the right thing, and there’s almost no legitimate reason to change it, so we’re disabling this for the greater good pending a fix (which may be a long time coming due to the complexity of the code involved).
Next up, CombineGVCFs gets a new option to break up reference blocks at every N sites. The new argument
--breakBandsAtMultiplesOf Nwill ensure that no reference blocks in the combined gVCF span genomic positions that are multiples of N. This is meant to enable scatter-gather parallelization of joint genotyping on whole-genome data, as a workaround to some annoying limitations of the GATK engine that make it unsafe to use
-L intervals that might start within the span of a block record. For exome data, joint genotyping can easily be parallelized by scatter-gathering across exome capture target intervals, because we know that there won’t be any hom-ref block records spanning the target interval boundaries. In contrast, in whole-genome data, there is no equivalent predictable termination of block records, so it’s not possible to know up front where it would be safe to set scatter-gather interval start and end points -- until now!
And finally, GenotypeGVCFs gets an important bug fix, and a very useful new annotation.
The bug is something that has arisen mostly (though not exclusively) from large cohort studies. What happened is that, when a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, GenotypeGVCFs would emit a homozygous reference genotype for sample B at that position -- which is obviously incorrect. The fix is that now, sample B will be genotyped as having a symbolic
<DEL> allele representing the deletion.
The new annotation is called
RGQ for Reference Genotype Quality. It is a new sample-level annotation that will be added by GenotypeGVCFs to monomorphic sites if you use the
-allSites argument to emit non-variant sites to the output VCF. This is obviously super useful for evaluating the level of confidence of those sites called homozygous-reference.
This new coverage analysis tool is designed to count read depth in a way that is appropriate for allele-specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. The default output format produced by this tool is a structured text file intended to be consumed by the statistical analysis toolkit MAMBA. A paper by Stephane Castel and colleagues describing the complete ASE analysis workflow is available as a preprint on bioarxiv.
We’ve added two new documentation resources to the Guide.
One is a new category of documentation articles called Common Problems, to cover topics that are a specialized subset of FAQs: problems that many users encounter, which are typically due to misunderstandings about input requirements or about the expected behavior of the tools, or complications that arise from certain experimental designs. This category is being actively worked on and we welcome suggestions of additional topics that it should cover.
The second is an Issue Tracker that lists issues that have been reported as well as features or enhancements that have been requested. If you encounter a problem that you think might be a bug (or you have a feature request in mind), you can check this page to see if it’s something we already know about. If you have submitted a bug report, you can use the issue tracker to check whether your issue is in the backlog, in the queue, or is being actively worked on. In future we’ll add some functionality to enable voting on what issues or features should be prioritized, so stay tuned for an announcement on that!
The presentation slides from the 2015 "GATK in the UK" workshop (April 20-24) are available on DropBox here.
All slides will ultimately be posted in the Presentations section of the Guide.
We have some important news to share with you regarding the licensing of GATK and MuTect. The licensing agreement between us and Appistry will end effective April 15, 2015; from that point on, the tools will continue to be licensed through Broad for commercial entities that will be running the GATK and MuTect internally or as part of their own hardware offering. Current licensed users will transition to Broad Institute when their current license expires.
For our academic and non-profit GATK and MuTect users, the licensing transition will be essentially transparent. You will still be able to use the GATK and MuTect for free, and access the source code through the existing public repository. The support forum and documentation website will also remain operational and freely accessible to all as they been previously.
Since our commercial users will now get their license --and their support!-- directly from Broad, they can expect to see some clear benefits:
We have heard from our licensed user community that you would like greater access to the GATK support team. Licensing through Broad will remove intermediaries and allow all GATK and MuTect users --including those who purchase a license-- more direct contact with and support from our team.
Most current tools. Getting your GATK license through Broad license will give you access to the most cutting-edge tools and features available without sacrificing support. Under Broad licensing, you will still have the option of purchasing a license for either of two packages, “GATK” or “GATK + Cancer Tools”.
Up-to-date Best Practices recommendations.
You will get our latest recommendations, informed by the latest in our internal analysis and internal R&D work, directly from us. So you can be confident that you have access to the freshest information at all times.
Our development team is driven by the goal of building tools to enable better science and push the boundaries of genome analysis. Revenue from GATK and MuTect licensing enables these goals by directly feeding into GATK and MuTect development, in the form of critical codebase maintenance and bug fixing work, as well as expansion of the support team. This enables us to keep pace with the growth of the user community and the ever-increasing demand for GATK and MuTect support.
This is a significant new milestone in the life of GATK and MuTect, and we recognize that there are going to be a lot of questions and discussions on this topic since it will affect many of you in the research community. We’ve put together some FAQs (below the fold) that we hope will answer your most pressing questions; feel free to comment and suggest additional points that you think should be covered there.
Note that we are still working on defining some of the finer points of the support model and pricing structure, so we can’t address those quite yet -- but feel free to email email@example.com if you have some burning question and/or concern that you’d like to discuss regarding licensing and/or pricing in particular. Rest assured that once the model has been finalized, we will make the full details (including pricing) available on our website in order to ensure full transparency.
Academic/non-profit users: No change. The licensing terms remain the same and the GATK remains free to use for not-for-profit research and educational purposes. The current free user support model will remain available through the online forum.
Currently licensed Commercial/for-profit users: Appistry will continue to provide full GATK support for the remainder of your current license term. After that point, Broad Institute can offer you a GATK license directly. This will offer you immediate access to the latest version of GATK. We can work directly with you on your specific licensing questions at firstname.lastname@example.org. For support questions, GATK product upgrade information or other suggestions, please comment in the discussion below or send us a private message.
Prospective commercial/for-profit users: Broad Institute can offer you a GATK license directly. This will offer you immediate access to the latest version of GATK. We can work directly with you on your specific licensing questions at email@example.com. For support questions, GATK product upgrade information or other suggestions, please comment in the discussion below or send us a private message.
No. There will only be one version for all users. We will provide our licensed users total support for the very latest version. This means they will always be able to use the most cutting-edge tools and features available without sacrificing support.
Part of the Broad Institute’s mission is to share our tools and research as broadly as possible to enable others to do transformational research. We started developing GATK several years ago and, since then, have constantly upgraded it, thanks to the hard work and dedication of many talented programmers, developers and genomic researchers in our group and beyond. That is why GATK remains the most advanced, accurate and reliable toolkit for variant discovery available anywhere (if we do say so ourselves). But please understand that building, maintaining, testing and constantly improving GATK is neither easy nor free. This is why we charge commercial users a licensing fee and funnel these resources back into upgrades of the tool itself – it allows us to continue to offer GATK for free to academic and non-profit organizations while ensuring it is always the best-of-the-best in an emerging and rapidly-changing field of research.
In addition to the GATK package, we will also offer a package that bundles GATK with MuTect and ContEst. That package will not include the SomaticIndelDetector, but a replacement for that functionality is in preparation.
No. Tools originating from the Picard toolkit will remain free and fully open-source for everyone. We are preparing to integrate them into a part of GATK that will be under a BSD license.
We are preparing to enable this in order to facilitate sharing of scientific methods, but we need input from the community first. To that end, we’d like to invite researchers who are developing or have developed such pipeline images to contact us in order to discuss options. We are envisioning simple technical solutions to ensure that users of these images are made fully aware of their own legal responsibility relative to the GATK licensing status, in a way that minimizes the burden on the researchers who distribute them.
The slides from the 2015 BroadE GATK Best Practices workshop presentations are accessible at this Dropbox link.
The video recordings of the workshop talks are online here. We'll post links to the videos (along with copies of the corresponding slides) in the Presentations section of the Guide. They will also be available on the Broad's YouTube and iTunesU channels.
We're going to be doing two back-to-back workshops in Edinburgh and Cambridge (the original, accept no substitutes) later this Spring, on April 20-21 and 23-24 respectively. The workshop program for both will be our typical one-day Best Practices lectures marathon followed by a half-day of lectures on supplemental topics (QC, non-humans, etc) and a half-day hands-on sessions for beginners to get their hands dirty with some real data.
Cheers to our hosts and we hope to see lots of you there!
bonus points to whoever gets the title reference -- and sings it in the correct tune
Warning: the following content may shock or distress our more sensitive users, as we discuss the cold-blooded elimination of some tools from the GATK.
Alright, now that I've got your attention (hopefully -- if not, what does it take?), here's the deal. We have got to a point where the GATK is a widely, even massively used toolkit (thanks to you, dear users). And it's pretty darn robust -- it's what the Broad's Genomic Platform uses in production to churn out exomes like there's no tomorrow. But it has technical limitations that are 1) a frequent source of pain on your end and 2) increasingly hampering development of new methods on our end.
The good news is that we have a plan for addressing (read: blasting away) these limitations. But part of this plan will involve streamlining GATK by getting rid of tools that are not useful or are inferior to alternative tools from other packages that we're not trying to compete with (e.g. Picard tools).
Some tools that are safe from elimination: all the tools used in the Best Practices, and a couple of utilities that we use a lot ourselves. But everything else is up for review -- and that's where you come in: we need your input to decide what to keep, what to throw away, and what to consider rewriting from scratch (yep, this is an option).
This link will take you to a SurveyMonkey page that lists the tools currently on the chopping block:
Act now to save your favorite non-BP tools! Or help us get rid of the crud. Whichever way you want to look at it, we appreciate your feedback!
We all know how HaplotypeCaller analyses can take a long time. IBM is now providing a native implementation of the PairHMM algorithm that leverages the new hardware available in their POWER8 systems. This optimization currently work on the following systems: Ubuntu14 and RHEL7 with POWER8.
To take advantage of this optimization, you need to do the following:
Here is an example for running on a P8 system with Ubuntu:
java -Xmx32g -Djava.library.path=/path/to/PairHMM_P8_Ubuntu -jar $GATK_PATH/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R $REFERENCE -I $INPUT_BAM --dbsnp $SNP_VCF \ -stand_emit_conf 10 -stand_call_conf 50 \ --pair_hmm_implementation VECTOR_LOGLESS_CACHING \ -o $OUTPUT_VCF
You can expect a speedup in the range of 1-1.7x depending on the hardware, OS and test cases.
If you have any questions or issues (aside from downloading the file), please contact Yinhue Cheng at IBM (firstname.lastname@example.org).
Consider this a public service announcement, since most GATK users probably also use Picard tools routinely. The recently released version 1.124 of the Picard tools includes many lovely improvements, bug fixes and even a few new tools (see release notes for full details) -- but you'll want to pay attention to one major change in particular.
From this version onward, the Picard release will contain a single JAR file containing all the tools, instead of one JAR file per tool as it was before. This means that when you invoke a Picar tool, you'll invoke a single JAR, then specify which tool (which they call CLP for Command Line Program) you want to run. This should feel completely familiar if you already use GATK regularly, but it does mean you'll need to update any scripts that use Picard tools to the new paradigm. Other than that, there's no change of syntax; Picard will still use e.g.
I=input.bam where GATK would use
We will need to update some of our own documentation accordingly over the near future; please bear with us as we go through this process, and let us know by commenting in this thread if you find any docs that have yet to be updated.
Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.
-ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.
Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.
As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.
This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the
-ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the
You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:
But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).
In a nutshell, phased records will look like this:
1 1372243 . T <NON_REF> . . END=1372267 <snip> <snip> 1 1372268 . G A,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip> 1 1372269 . G T,<NON_REF> . . <snip> GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip> 1 1372270 . C <NON_REF> . . END=1372299 <snip> <snip>
You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).
The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.
Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).
We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the
--recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.
Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.
We're looking for a developer (software engineer or equivalent) to join our team on a part-time basis as a software engineering consultant.
The mission is to take on tasks that are non-critical but still important, such as fixing bugs and implementing minor feature requests, usability improvements and so on in the GATK codebase. The idea is to take much of the maintenance burden off of the core development team so they can focus on developing new tools and methods.
We already have one person employed in this capacity, and it's working out very well. However there is quite a bit more to do than he can find time for, and therefore we'd like to hire a second consultant to pick up the extra work in parallel. (We were hoping to just clone him but ran into some difficulties getting IRB approval.)
This position does not require expert knowledge of GATK, but familiarity with the GATK tools is a big plus. The main language is Java, with a small side of R and Scala. We also have a growing corpus of C++ code that is not yet in the scope of this position, but may move into scope some months down the line.
Note that this opportunity is amenable to remote work so you don't need to be local to the Boston area, but it is limited to US residents with valid work authorization (no H1B sponsorship possible, sorry).
Drop us a line in the comments below or private-message me (@Geraldine_VdAuwera) to discuss details.
Here is the official announcement for the upcoming workshop in Philadelphia. Registration is not necessary for the lecture sessions, but it is required for the hands-on sessions (see link further below).
We look forward to seeing you there!
The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine announce a Workshop on Variants Discovery in Next Generation Sequence Data on September 18 and 19, 2014.
This workshop will focus on the core steps involved in calling variants with the Broad Institute¹s Genome Analysis Toolkit (GATK), using the "Best Practices" developed by the GATK team, and will be presented by Dr. Geraldine Van der Auwera of the Broad Institute and other instructors from the GATK team. Participants will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.
The workshop will take place over two consecutive days (September 18 and 19, 2014). In the morning lecture sessions, attendees will learn the rationale, theory, and real-life applications of GATK Best Practices for variant discovery in high-throughput DNA sequencing data, including recommendations for additional experimental designs and datatypes such as RNAseq. In the afternoon hands-on sessions, attendees will learn to interact with the GATK tools and apply them effectively through interactive exercises and tutorials.
The morning lecture sessions will take place on Thursday, September 18, from 9:00 am to 12:30 pm, and Friday, September 19, from 9:00 am to 11:30 am, in the Dunlop Auditorium of Stemmler Hall, University of Pennsylvania, 3450 Hamilton Walk, Philadelphia, PA 19104. Both morning sessions are open to all participants and registration is not required.
The afternoon hands-on sessions will take place on Thursday, September 18, from 2:00 pm to 5:30 pm, and Friday, September 19, from 1:00 pm to 4:30 pm. The September 18 hands-on session is aimed mainly at beginners (though familiarity with the command line environment is expected). The September 19 hands-on session is aimed at more advanced users who are already familiar with the basic GATK functions. Attendance to the hands-on sessions is limited to 20 participants each day, and precedence will be given to members of the University of Pennsylvania or its affiliated hospitals and research institutes (HUP, CHOP, Monell, Wistar, etc.).
Registration for the hands-on sessions is mandatory and open through Friday, August 29th at http://ibi.upenn.edu/?p=996.
Better late than never (right?), here are the version highlights for GATK 3.2. Overall, this release is essentially a collection of bug fixes and incremental improvements that we wanted to push out to not keep folks waiting while we're working on the next big features. Most of the bug fixes are related to the HaplotypeCaller and its "reference confidence model" mode (which you may know as
-ERC GVCF). But there are also a few noteworthy improvements/changes in other tools which I'll go over below.
The "reference confidence model" workflow, which I hope you have heard of by now, is that awesome new workflow we released in March 2014, which was the core feature of the GATK 3.0 version. It solves the N+1 problem and allows you to perform joint variant analysis on ridiculously large cohorts without having to enslave the entire human race and turning people into batteries to power a planet-sized computing cluster. More on that later (omg we're writing a paper on it, finally!).
You can read the full list of improvements we've made to the tools involved in the workflow (mainly HaplotypeCaller and Genotype GVCFs) in Eric's (unusually detailed) Release Notes for this version. The ones you are most likely to care about are that the "missing PLs" bug is fixed, GenotypeGVCFs now accepts arguments that allow it to emulate the HC's genotyping capabilities more closely (such as
--includeNonVariantSites), the AB annotation is fully functional, reference DPs are no longer dropped, and CatVariants now accepts lists of VCFs as input. OK, so that last one is not really specific to the reference model pipeline, but that's where it really comes in handy (imagine generating a command line with thousands of VCF filenames -- it's not pretty).
The coverage metrics (DP and AD) reported by HaplotypeCaller are now those calculated after the HC's reassembly step, based on the reads having been realigned to the most likely haplotypes. So the metrics you see in the variant record should match what you see if you use the
-bamout option and visualize the reassembled ActiveRegion in a genome browser such as IGV. Note that if any of this is not making sense to you, say so in the comments and we'll point you to the new HaplotypeCaller documentation! Or, you know, look for it in the Guide.
We updated the plotting scripts used by BQSR and VQSR to use the latest version of ggplot2, to get rid of some deprecated function issues. If your Rscripts are suddenly failing, you'll need to update your R libraries.
We're sorry for making you jump through all these hoops recently. As if the switch to Maven wasn't enough, we have now completed a massive reorganization/renaming of the codebase that will probably cause you some headaches when you port your tools to the newest version. But we promise this is the last big wave, and ultimately this will make your life easier once we get the GATK core framework to be a proper maven artifact.
In a nutshell, the base name of the codebase has changed from
gatk (which hopefully makes more sense), and the most common effect is that
sting.gatk classpath segments are now
gatk.tools. This, by the way, is why we had a bunch of broken documentation links; most of these have been fixed (yay symlinks) but there may be a few broken URLs remaining. If you see something, say something, and we'll fix it.
We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.
The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:
java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.
This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.
To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.
We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.
Calling all Belgians! (and immediate neighbors)
In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.
This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!
That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.
See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.
Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.
In a nutshell: if you're using a version of GATK older than 2.7, you need to request a key to disable Phone Home (if you don't already have one). See below for a full explanation.
As you may know, the GATK includes a reporting mechanism called Phone Home that sends us runtime statistics about usage and bugs. These statistics (which you can read more about here) help us make development decisions, e.g. prioritize bug fixes and new features, as well as track adoption of new versions and tools.
The system uses an AWS cloud service as a data repository, called a "bucket", which GATK accesses using an encryption key. We currently have two active AWS keys for the Phone Home bucket. GATK versions 2.6 and older use one AWS key, and versions 2.7 and later use another AWS key.
For practical reasons, we need to deactivate the old AWS key, which means that GATK jobs run using a version older than 2.7 may end with a Phone Home failure unless the system is deactivated. To be clear, the GATK analysis itself will complete successfully, but the GATK may exit with a fail code. This may cause issues in pipelines and potentially fill up error logs.
The best way to avoid these problems is to upgrade to the latest version of GATK, which you should seriously consider anyway in order to get the best possible results out of your analysis. However, if you are unable to upgrade to a recent version, we recommend that you disable the Phone Home system. To do so, you will need to follow these two simple steps:
Request a GATK key using this online form. In the "justification" field, please write "AWS key deprecation", indicate which version of GATK you are using, and if possible let us know why you are unable to upgrade to a recent version. You can expect to receive your key within 1 business day.
Once you have the key, apply it to all GATK jobs by adding
-et NO_ET -K your.key (where
your.key is the path to the key file you obtained from us) to every GATK command line.
We expect this workaround will suppress any issues EXCEPT in versions prior to 1.5, in which the Phone Home system cannot be deactivated. For versions 1.4 and older, there is nothing we can do, and you will have to either put up with the errors, or upgrade to a newer version (which you should really really do anyway!).
Let us know in the comments if you have any trouble or questions.
This may seem crazy considering we released the big 3.0 version not two weeks ago, but yes, we have a new version for you already! It's a bit of a special case because this release is all about the hardware-based optimizations we had previously announced. What we hadn't announced yet was that this is the fruit of a new collaboration with a team at Intel (which you can read more about here), so we were waiting for everyone to be ready for the big reveal.
So basically, the story is that we've started collaborating with the Intel Bio Team to enable key parts of the GATK to run more efficiently on certain hardware configurations. For our first project together, we tackled the PairHMM algorithm, which is responsible for a large proportion of the runtime of HaplotypeCaller analyses. The resulting optimizations, which are the main feature in version 3.1, produce significant speedups for HaplotypeCaller runs on a wide range of hardware.
We will continue working with Intel to further improve the performance of GATK tools that have historically been afflicted with performance issues and long runtimes (hello BQSR). As always, we hope these new features will make your life easier, and we welcome your feedback in the forum!
Note that these optimizations currently work on Linux systems only, and will not work on Mac or Windows operating systems. In the near future we will add support for Mac OS. We have no plans to add support for Windows since the GATK itself does not run on Windows.
Please note also that to take advantage of these optimizations, you need to opt-in by adding the following flag to your GATK command:
Here is a handy little table of the speedups you can expect depending on the hardware and operating system you are using. The configurations given here are the minimum requirements for benefiting from the expected speedup ranges shown in the third column. Keep in mind that these numbers are based on tests in controlled conditions; in the wild, your mileage may vary.
|Linux kernel version||Architecture / Processor||Expected speedup||Instruction set|
|Any 64-bit Linux||Any x86 64-bit||1-1.5x||Non-vector|
|Linux 2.6 or newer||Penryn (Core 2 or newer)||1.3-1.8x||SSE 4.1|
|Linux 2.6.30 or newer||SandyBridge (i3, i5, i7, Xeon E3, E5, E7 or newer)||2-2.5x||AVX|
To find out exactly which processor is in your machine, you can run this command in the terminal:
$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
In this example, the machine has 4 cores (8-threads), so you see the answer 8 times. With the model name (here i7-2600) you can look up your hardware's relevant capabilities in the Wikipedia page on vector extensions.
Alternatively, Intel has provided us with some links to lists of processors categorized by architecture, in which you can look up your hardware:
Finally, a few notes to clarify some concepts regarding Linux kernels vs. distributions and processors vs. architectures:
SandyBridge and Penryn are microarchitectures; essentially, these are sets of instructions built into the CPU. Core 2, core i3, i4, i7, Xeon e3, e5, e7 are the processors that will implement a specific architecture to make use of the relevant improvements (see table above).
The Linux kernel has no connection with Linux distribution (e.g. Ubuntu, RedHat etc). Any distribution can use any kernel they want. There are "default kernels" shipped with each distribution, but that's beyond the scope of this article to cover (there are at least 300 Linux distributions out there). But you can always install whatever kernel version you want.
The kernel version 2.6.30 was released in 2009, so we expect every sane person or IT out there to be using something better than this.
Okay, we realize the name's a bit of a mouthful, and we're willing to tweak it if anyone has any good ideas. But never mind that. It's difficult to overstate the importance of this new approach to joint variant discovery (but I'll give it a shot) so we're really stoked to finally be able to share the details of how it's is going to work in practice.
You're probably going to be surprised at how simple it is in practice (not that it was particularly easy to actually build, mind you). The gory details are in the new document here, but here's an overview of how it looks within the Best Practices workflow you all know and (hopefully) love:
The first surprise is that instead of calling variants on multiple samples, you now just run HaplotypeCaller on each sample individually. "Oh no," I hear you cry, "but the results were so much better when I called multiple samples together!". Well yeah, but it took forever. Bear with me for a minute.
The key here is that you run HaplotypeCaller in gVCF mode. This outputs a so-called genomic VCF, which contains a record of the genotype likelihoods and annotations for every single site in the genome (or exome), whether or not there is evidence of variation. This essentially boils down all the useful information that can be gleaned from the BAM files, and makes it unnecessary to refer back to the BAM in later steps.
So you repeat that for all your samples (which goes reasonably fast since per-sample calling is pretty tractable nowadays). Optionally, you can add in a step to combine gVCF files if you're working on a really large cohort. Then in the next step, you just run a separate genotyping tool on all the gVCFs (or combined gVCFs) together, which gives you the same output (raw SNPs and indel calls) that you would have got from one-step multisample calling.
See, that's the beauty of the new workflow. A lot less work (for the computer) for equivalent results. And the ability to process samples incrementally and perform joint discovery on cohort sizes that would have previously got you hauled off to the funny farm.
Let us know what you think!
We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.
This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.
Please see the new document here for full details about how to run this workflow in practice.
In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.
Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.
Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.
For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.
Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.
We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!
Previously, we covered the spirit of GATK 3.0 (what our intentions are for this new release, and what we’re hoping to achieve). Let’s now have a look at the top three features you can look forward to in 3.0, in no particular order:
At this point everyone knows that the HaplotypeCaller is fabulous (you know this, right?) but beyond a certain number of samples that you’re trying to call jointly, it just grinds to a crawl, and any further movement is on the scale of continental drift. Obviously this is a major obstacle if you’re trying to do any kind of work at scale beyond a handful of samples, and that’s why it hasn’t been used in recent large-cohort projects despite showing best-in-class performance in terms of discovery power.
The major culprit in this case is the PairHMM algorithm, which takes up the lion’s share of HC runtime. With the help of external collaborators (to be credited in a follow-up post) we rewrote the code of the PairHMM to make it orders of magnitude faster, especially on specialized hardware like GPU and FPGA chips (but you’ll still see a speedup on “regular” hardware).
We plan to follow up on this by doing similar optimizations on the other “slowpoke” algorithms that are responsible for long runtimes in GATK tools.
Some problems in variant calling can’t be solved by Daft Punk hardware upgrades (better faster stronger) alone. Beyond the question of speed, a major issue with multi-sample variant discovery is that you have to wait until all the samples are available to call variants on them. Then, if later you want to add some more samples to your cohort, you have to re-call all of them together, old and new. This, also known as the “N+1 problem”, is a huge pain in the anatomy.
The underlying idea of the “single-sample pipeline for joint variant discovery” is to decouple the two steps in the variant calling process: identifying evidence of variation, and interpreting the evidence. Only the second step needs to be done jointly on all samples, while the first step can be done just as well (and a heck of a lot faster) on one sample at a time.
The new pipeline allows us to process each sample as it comes off the sequencing machine, up to the first step of variant calling. Cumulatively, this will produce a database of per-sample, per-site allele frequencies. Then it’s just a matter of running a joint analysis on the database, which can be done incrementally each time a new sample is added or at certain intervals or timepoints, depending on the research needs, because this step runs quickly and cheaply.
We’ll go into the details of exactly how this works in a follow-up post. For now, the take-home message is that it’s a “single-sample pipeline” because you do the heavy-lifting per-sample (and just once, ever), but you are empowered to perform “joint discovery” because you interpret the evidence from each sample in light of what you see in all the other samples, and you can do this at any point in the project timeline.
Our Best Practices recommendations for calling variants on DNA sequence data have proved to be wildly popular with the scientific community, presumably because it takes a lot of the guesswork out of running GATK, and provides a large degree of reproducibility.
Now, we’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences the early data processing steps, as well as in the calling step. We do not yet have RNAseq-specific recommendations for variant filtering/recalibration, but will be developing those in the coming weeks.
We’ll go into the details of the RNAseq Best Practices in a follow-up post, but in a nutshell, these are the key differences: use STAR for alignment, add an exon splitting and cleanup step, and tell the variant caller to take the splits into account. The latter involves some new code added to the variant callers; it is available to both HaplotypeCaller and UnifiedGenotyper, but UG is currently missing a whole lot of indels, so we do recommend using only HC in the immediate future.
Keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing. We will be improving these recommendations progressively as we go, and we hope that the researcher community will help us by providing feedback of their experiences applying our recommendations to their data.
Yep, you read that right, the next release of GATK is going to be the big Three-Oh!
You may have noticed that the 2.8 release was really slim. We explained in the release notes, perhaps a tad defensively, that it was because we’d been working on some ambitious new features that just weren’t ready for prime time. And that was true. Now we’ve got a couple of shiny new toys to show for it that we think you’re really going to like.
But GATK 3.0 is not really about the new features (otherwise we’d just call it 2.9). It’s about a shift in the way we approach the problems that we want to solve -- and to some extent, a shift in the scope of problems we choose to tackle.
We’ll explain what this entails in much more detail in a series of blog posts over the next few days, but let me reassure you right now on one very important point: there is nothing in the upcoming release that will disrupt your existing workflows. What it will do is offer you new paths for discovery that we believe will empower research on a scale that has previously not been possible.
And lest you think this is all just vaporware, here’s a sample of what we have in hand right now: variant calling on RNA-Seq, and a multisample variant discovery workflow liberated from the shackles of time and scaling issues.
Stay tuned for details!
The presentation videos for:
are available here: http://www.broadinstitute.org/gatk/guide/events?id=3391
Better late than never, here are the highlights of the most recent version release, GATK 2.8. This should be short and sweet because as releases go, 2.8 is light on new features, and is best described as a collection of bug fixes, which are all* dutifully listed in the corresponding release notes document. That said, two of the changes we've made deserve some additional explanation.
* Up to now (this release included) we have not listed updates/patches to Queue in the release notes, but will start doing so from the next version onward.
In the last release (2.7, for those of you keeping score at home) we trumpeted that the old
-percentBad argument of VariantRecalibrator had been replaced by the shiny new
-numBad argument, and that this was going to be awesome for all sorts of good reasons, improve stability and whatnot. Weeeeeeell it turned out that wasn't quite the case. It worked really well on the subset of analyses that we tested it on initially, but once we expanded to different datasets (and the complaints started rolling in on the forum) we realized that it actually made things worse in some cases because the default value was less appropriate than what
-percentBad would have produced. This left people guessing as to what value would work for their particular dataset, with a great big range to choose from and very little useful information to assist in the choice.
So, long story short, we (and by "we" I mean Ryan) built in a new function that allows the VariantRecalibrator to determine for itself the amount of variants that is appropriate to use for the "bad" model depending on the data. So the short-lived
-numBad argument is gone too, replaced by... nothing. No new argument to specify; just let the VariantRecalibrator do its thing.
Of course if you really want to, you can override the default behavior and tweak the internal thresholds. See the tool doc here; and remember that a good rule of thumb is that if you can't figure out which arguments are involved based on that doc, you probably shouldn't be messing with this advanced functionality.
This is still a rather experimental feature, so we're still making changes as we go. The two big changes worth mentioning here are that you can now run this on reduced reads, and that we've changed the indexing routine to optimize the compression level. The latter shouldn't have any immediate impact on normal users, but it was necessary for a new feature project we've been working on behind the scenes (the single-sample-to-joint-discovery pipeline we have been alluding to in recent forum discussions). The reason we're mentioning it now is that if you use
-ERC GVCF output, you'll need to specify a couple of new arguments as well (
-variant_index_type LINEAR and
-variant_index_parameter 128000, with those exact values). This useful little fact didn't quite make it into the documentation before we released, and not specifying them leads to an error message, so... there you go. No error message for you!
That's all for tool changes. In addition to those, we have made a number of corrections in the tool documentation pages, updated the Best Practices (mostly layout, tiny bit of content update related to the VQSR -numBad deprecation) and made some minor changes to the website, e.g. updated the list of publications that cite the GATK and improved the Guide index somewhat (but that's still a work in progress).
Heads up, people: our generous overlords at the Broad Institute are giving us all time off from December 23 until January 1st (included). So we will effectively be off-duty starting tomorrow evening (Friday Dec 20) for the entire Christmas and New Year period, only to return on January 2nd. During that time, we will all be busy stuffing ourselves with food and enjoying the company of our loved ones, and we hope many of you will have the opportunity to do the same, regardless of your cultural affiliations (I for one will be raising a glass of mulled wine in honor of my Gaul ancestors and the winter solstice). For those of you who will be working, I'm afraid no-one from the GATK team will be around to answer forum questions (unless one of us really needs an excuse to get away from the in-laws) so I encourage you to try to answer each other's questions in the meantime. To all, good luck, happy holidays and/or our deepest sympathy, as applicable. See you next year!
We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.
We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.