The Best Practices
have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History
section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History
section) to ensure that you are using the appropriate recommendations for your version.
GATK Best Practices
Recommended workflows for variant discovery analysis with GATK
What can you do with this?
The GATK Best Practices workflows provide step-by-step recommendations for performing variant discovery analysis in high-throughput sequencing (HTS) data. They enable discovery of SNPs and small indels (no size limit in theory but adjustments may be required to call indels > 50 bp) in DNA and RNAseq. They do not yet enable discovery of structural variants (SVs) or copy number variants (CNVs). Although they were originally designed for human genome research, the GATK Best Practices are widely used (with adaptations as described in the documentation) for analysis of non-human organisms of all kinds, including non-diploids.
What's in the box?
Our recommendations describe in detail the core processing steps required to go from raw reads coming off the sequencing machine, all the way to a variant callset that can be used in downstream analyses. Wherever we can, we try to provide guidance regarding experimental design, quality control (QC) and pipeline implementation options (how to set it up in practice), but please understand that those are dependent on many factors including sequencing technology and the hardware infrastructure that are at your disposal, so you may need to adapt our recommendations to your specific situation.
Break it down
We currently have two separate workflows for Germline DNA and for RNAseq, and we are developing a workflow for Somatic DNA. All of them are divided into three sequential phases:
Use the buttons in the figure above to start exploring the Best Practices workflows in detail.
Important notes on scope and limitations
Origin of the Practices
We develop and validate these workflows in collaboration with many investigators within the Broad Institute's network of affiliates. They are deployed at scale in the Broad's production pipelines -- a very large scale indeed. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable (no pun intended). However, our testing does focus largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. See the FAQs and Common Problems documentation in particular for help with that. Note that we may not be able to provide recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).
Beware legacy commands (or, trust but verify)
If someone hands you a script and tells you "this implements the GATK Best Practices", start by asking what version it used and when it was written. Both our software and our usage recommendations evolve in step with the rapid pace of technological and methodological innovation in the field of genomics, so what was Best Practice last year (let alone in 2010) may no longer be applicable. And if all the steps seem to be in accordance with our docs (same tools in the same order), you should still check every single argument in the commands. If anything is different, figure out what it does. It's one or two hours of your life that can save you days of troubleshooting. We're working on a way to produce versioned Best Practices documents that will mitigate this problem, but in the meantime, protect yourself by being thorough.
It's lonely at the edge
Several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If you are working at the cutting edge of your field and there are no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution. On the bright side, if you generate those resources yourself, you can make a big impact.