Introduction to GATK


What is GATK?
Simply what it says on the can: a Toolkit for Genome Analysis


Say you have ten exomes and you want to identify the rare mutations they all have in common – the GATK can do that. Or you need to know which mutations are specific to a group of patients, as opposed to a healthy cohort – the GATK can do that too. In fact, the GATK is the industry standard for such analyses.


But wait, there's more!

Because of the way it is built, the GATK is highly generic and can be applied to all kinds of datasets and genome analysis problems. It can be used for discovery as well as for validation. It's just as happy handling exomes as whole genomes. It can use data generated with a variety of different sequencing technologies. It can handle RNAseq data. And although it was originally developed for human genetics, the GATK has evolved to handle genome data from any organism, with any level of ploidy. Your plant has six copies of each chromosome? Bring it on.

Organisms

The GATK can handle a variety of organism genomes in addition to humans.

 

Toolbox and toolchain

The toolkit provides a wide set of tools that can be chained into workflows, taking advantage of the common architecture and powerful engine.

So what's in the can?

At the heart of the GATK is an industrial-strength infrastructure and engine that handle data access, conversion and traversal, as well as high-performance computing features. On top of that lives a rich ecosystem of specialized tools, called "walkers", that you can use out of the box, individually or chained into scripted workflows, to perform anything from simple data diagnostics to complex "reads-to-results" analyses.

Some typical workflows are detailed on the next page of this section. Please see the Tool Documentation section for a complete list of tools and their capabilities.


Using the GATK

 

Using the GATK
Get started today


Platform and requirements

The GATK is designed to run on Linux and other POSIX-compatible platforms. Yes, that includes MacOS X! If you are on any of the above, see the Downloads section for downloading and installation instructions. If you're stuck with Windows, you're not completely out of luck – it's possible to use the GATK with Cygwin, although we can't provide any specific support for that. If you're on something else... no, there are no plans to port the GATK to Android or iOS in the near future.

You will need to have Java installed to run the GATK, and some tools additionally require R to generate PDF plots. Version requirements and installation instructions for both can be found in the Documentation Guide.

Unix-based operating systems

The GATK is designed to run on Linux and other POSIX-compatible platforms. Yes, that includes MacOS X!


The GATK runs on Java


The GATK runs on Java, straight from the command-line.

Interface

Now here's the kicker: the GATK does not have a graphical user interface. All tools are called via the command-line interface.

If that is not something you are used to, or you have no idea what that even means, don't worry. It's easier to learn than you might think, and there are many good online tutorials that can get help you get comfortable with the command-line environment. Before you know it you'll be writing scripts to chain tools together into workflows... You don't need to have any programming experience to use the GATK, but you might pick some up along the way!


Command structure and tool arguments

All the GATK tools are called using the same basic command structure. Here's a simple example that counts the number of sequence reads in a BAM file:

java -jar GenomeAnalysisTK.jar \
	-T CountReads \
	-R example_reference.fasta \
	-I example_reads.bamgsaw

The -jar argument invokes the GATK engine itself, and the -T argument tells it which tool you want to run. Arguments like -R for the genome reference and -I for the input file are also given to the GATK engine and can be used with all the tools (see complete list of available arguments for the GATK engine. Most tools also take additional arguments that are specific to their function. These are listed for each tool on that tool's documentation page, all easily accessible through the Tool Documentation index.

Output of the example command

The GATK outputs structured command information, status messages and result summaries to the console.


Please see this page for more detailed tutorials on using the GATK tools.


What is the GATK? Typical Workflows

 

Typical Workflows
From sequencing reads to actionable results


When you're isolating DNA in the lab, you don't treat the work like isolated, disconnected tasks. Every task is a step in a well-documented protocol, carefully developed to optimize yield, purity and to ensure reproducibility as well as consistency across all samples and experiments.

We believe working with NGS data should be exactly the same.

That's why we have developed industry-standard workflows that are optimized to produce the most accurate results from your dataset, with the most efficiency in terms of both manual handling and computational cost.


NGS data processing

Whatever the sequencing technology you're using, you need to process the raw dataset to make it suitable for analysis. These data processing workflows guide you through the necessary steps, with detailed explanations of each operation, why it is required and what transformations are applied to the data.

data processing workflow

Variant discovery, genotyping and filtering

Finding sequence variation within and between samples is fairly straightforward. Distinguishing what part of that variation is real and assigning the right genotypes is a heck of a lot more difficult. These variant discovery, genotyping and filtering workflows and protocols help you choose the parameters that are most appropriate for your dataset and guides you through the necessary steps to produce a variant callset that you can trust. Various options are available depending on whether you're working with whole genomes or exomes and according to the type, number and coverage depth of your samples.

variant discovery, genotyping and filtering workflow

Best practices for calling variants with the GATK

This reads-to-results variant calling workflow lays out the best practices recommended by our group for all the steps involved in calling variants with the GATK. It is used in production at the Broad Institute on every genome that rolls out of the sequencing facility. Be sure to check out the series of workshop videos dedicated to this workflow!

Best Practices workflow

See the Best Practices documentation for more detail.


Using the GATK High Performance

 

High Performance
Built for scalability and parallelism


The GATK was built from the ground up with performance in mind.

Map/Reduce: it's not just for Google anymore

Every GATK walker is built using the Map/Reduce framework, which is basically a strategy to speed up performance by breaking down large iterative tasks into shorter segements then merging overall results.

Muli-threading

The GATK takes advantage of the latest processors using multi-threading, i. e. run using multiple cores on the same machine. Multi-threading is enabled simply by using the -nt and -nct command line arguments. See this article for details.


Multi-threading

The GATK does multi-threading.


Queue and scatter-gather

Queue uses a scatter-gather process to parallelize operations.

Out on the farm with Queue

Queue is a companion program that allows the GATK to take parallelization to the next level: running jobs on a high-performance computing cluster, or server farm.

Queue manages the entire process of breaking down big jobs into many smaller ones (scatter) then collecting and merging results when they are done (gather).

At the Broad, we use a Queue pipeline to run GATK analyses on hundreds, even thousands of exomes, on our cluster of hundreds of nodes.

 
See this article for more details on parallelism with the GATK.


Typical Workflows Getting Help

 

Getting Help
Don't panic! Help is at hand.


The GATK has a reputation for being wicked complicated, and it's not entirely undeserved. With great power comes great responsibility complexity... But don't panic! Help is at hand.

Our crack team of space monkeys has put together a revolutionary documentation and support system to help you get the answers you need, fast and with the least amount of pain. It's composed of reference guides, tutorials, videos and a forum that are all cross-connected, so that relevant material from various sources is automatically aggregated for you, whether you're browsing an index or searching by keyword.

All that time you'll save on looking for information and troubleshooting? Feel free to use it to get even more work done... or finally have a life!


The Hitchhiker's Guide to the GATK


If you've ever wondered "Is there a tool that can do X?" or "Does this tool have an argument to do Y?", the Technical Documentation should be your first stop.

Every tool in the GATK has its own article detailing what it does and how it does it, as well as all the available options, default parameter values and argument names. There are also articles detailing options and arguments of the GATK engine, which are common to all GATK tools, and documenting companion software such as Queue.

In addition to this technical documentation, the Guide also features more "meta" articles detailing methods and workflows including Best Practices recommendations for study design and analysis. Or if you're looking to write your own walkers and Queue scripts, check out the Developer Zone!


FAQs, tutorials and videos


The Guide is further enriched by a regularly updated collection of Frequently Asked Questions, as well as tutorials – some in video form – that will guide you step by step through various tasks such as running analyses and troubleshooting errors.

And because we know there are few things more frustrating than trawling through tutorials that are either too basic, too advanced or otherwise not appropriate for your needs, each tutorial is clearly labeled to identify the intended audience type ( Analyst or Developer ), level ( Basic, Intermediate, or Advanced ) and prerequisite knowledge.


Community forum


Finally, if you've exhausted all these avenues and still haven't found the answer to your question, ask the forum! It's powered by rainbows and staffed by unicorns who love to answer USER ERROR questions.

Well, maybe not unicorns, but a team of computational biologists and software engineers who work hard to address your problems quickly and accurately. If something's not clearly documented, we'll answer you question and improve the docs accordingly. If you think you found a bug, we'll track it down and fix it. Just Ask the Team.


High Performance Licensing & Source Code

 

Licensing & Source Code


Free for academics, fee for commercial use

The GATK and its sister program, MuTect, are increasingly used not just by academic researchers, but by commercial companies who have needs, in terms of support and production readiness, that are beyond what our small development team can provide. In order to meet that demand, we release GATK and MuTect under a mixed licensing model, in which researchers at academic and non-profit organizations can access the tools and source code for free, while for-profit organizations are asked to purchased a license. The revenue generated by this model is then used to fund and build out our support team and infrastructure to accommodate the demand for support in the community, as well as invest more resources to improve development speed, functionality and stability overall.

Direct licensing and support through Broad

Until now, we have relied on a commercial partner to provide licensing and premium support services. Starting April 16, 2015, we will be providing licensing and support directly to commercial entities that will be running the GATK or MuTect internally or as part of their own hardware offering. Current licensed users will transition to Broad Institute when their current license expires. This new model will allow licensed customers better access to the GATK and MuTect development and support teams, full support for the latest releases of our tools, and the most up-to-date Best Practice recommendations that are based on our team's extensive analysis and R&D work.



For more information about the upcoming transition from Appistry to Broad licensing, please read our recent announcement on the GATK blog or contact softwarelicensing@broadinstitute.org directly with any questions about licensing, pricing or the availability of premium support.