Introduction to GATK


What is GATK?
Simply what it says on the can: a Toolkit for Genome Analysis


Say you have ten exomes and you want to identify the rare mutations they all have in common – the GATK can do that. Or you need to know which mutations are specific to a group of patients, as opposed to a healthy cohort – the GATK can do that too. In fact, the GATK is the industry standard for such analyses.


But wait, there's more!

Because of the way it is built, the GATK is highly generic and can be applied to all kinds of datasets and genome analysis problems. It can be used for discovery as well as for validation. It's just as happy handling exomes as whole genomes. It can use data generated with a variety of different sequencing technologies. And although it was originally developed for human genetics, the GATK has evolved to handle genome data from any organism, with any level of ploidy. Your plant has six copies of each chromosome? Bring it on.

Organisms

The GATK can handle a variety of organism genomes in addition to humans.

 

Toolbox and toolchain

The toolkit provides a wide set of tools that can be chained into workflows, taking advantage of the common architecture and powerful engine.

So what's in the can?

At the heart of the GATK is an industrial-strength infrastructure and engine that handle data access, conversion and traversal, as well as high-performance computing features. On top of that lives a rich ecosystem of specialized tools, called "walkers", that you can use out of the box, individually or chained into scripted workflows, to perform anything from simple data diagnostics to complex "reads-to-results" analyses.

Some typical workflows are detailed on the next page of this section. Please see the Technical Documentation section for a complete list of tools and their capabilities.


High Performance

 

Using the GATK
Get started today


Platform and requirements

The GATK is designed to run on Linux and other POSIX-compatible platforms. Yes, that includes MacOS X! If you are on any of the above, see the Downloads section for downloading and installation instructions. If you're stuck with Windows, you're not completely out of luck – it's possible to use the GATK with Cygwin, although we can't provide any specific support for that. If you're on something else... no, there are no plans to port the GATK to Android or iOS in the near future.

You will need to have Java installed to run the GATK, and some tools additionally require R to generate PDF plots. Version requirements and installation instructions for both can be found in the Documentation Guide.

Unix-based operating systems

The GATK is designed to run on Linux and other POSIX-compatible platforms. Yes, that includes MacOS X!


The GATK runs on Java


The GATK runs on Java, straight from the command-line.

Interface

Now here's the kicker: the GATK does not have a graphical user interface. All tools are called via the command-line interface.

If that is not something you are used to, or you have no idea what that even means, don't worry. It's easier to learn than you might think, and there are many good online tutorials that can get help you get comfortable with the command-line environment. Before you know it you'll be writing scripts to chain tools together into workflows... You don't need to have any programming experience to use the GATK, but you might pick some up along the way!


Command structure and tool arguments

All the GATK tools are called using the same basic command structure. Here's a simple example that counts the number of sequence reads in a BAM file:

java -jar GenomeAnalysisTK.jar \
	-T CountReads \
	-R example_reference.fasta \
	-I example_reads.bam

The -jar argument invokes the GATK engine itself, and the -T argument tells it which tool you want to run. Arguments like -R for the genome reference and -I for the input file are also given to the GATK engine and can be used with all the tools (see complete list of available arguments for the GATK engine. Most tools also take additional arguments that are specific to their function. These are listed for each tool on that tool's documentation page, all easily accessible through the Technical Documentation index.

Output of the example command

The GATK outputs structured command information, status messages and result summaries to the console.


Please see this page for more detailed tutorials on using the GATK tools.


What is the GATK? Typical Workflows

 

Typical Workflows
From sequencing reads to actionable results


When you're isolating DNA in the lab, you don't treat the work like isolated, disconnected tasks. Every task is a step in a well-documented protocol, carefully developed to optimize yield, purity and to ensure reproducibility as well as consistency across all samples and experiments.

We believe working with NGS data should be exactly the same.

That's why we have developed industry-standard workflows that are optimized to produce the most accurate results from your dataset, with the most efficiency in terms of both manual handling and computational cost.


NGS data processing

Whatever the sequencing technology you're using, you need to process the raw dataset to make it suitable for analysis. These ">data processing workflows guide you through the necessary steps, with detailed explanations of each operation, why it is required and what transformations are applied to the data.

data processing workflow

Variant discovery, genotyping and filtering

Finding sequence variation within and between samples is fairly straightforward. Distinguishing what part of that variation is real and assigning the right genotypes is a heck of a lot more difficult. These variant discovery, genotyping and filtering workflows and protocols help you choose the parameters that are most appropriate for your dataset and guides you through the necessary steps to produce a variant callset that you can trust. Various options are available depending on whether you're working with whole genomes or exomes and according to the type, number and coverage depth of your samples.

variant discovery, genotyping and filtering workflow

Best practices for calling variants with the GATK

This reads-to-results variant calling workflow lays out the best practices recommended by our group for all the steps involved in calling variants with the GATK. It is used in production at the Broad Institute on every genome that rolls out of the sequencing facility. Be sure to check out the series of workshop videos dedicated to this workflow!

Best Practices workflow

Other workflows are available here.


Using the GATK High Performance

 

High Performance
Built for scalability and parallelism


The GATK was built from the ground up with performance in mind.

Map/Reduce: it's not just for Google anymore

Every GATK walker is built using the Map/Reduce framework, which is basically a strategy to speed up performance by breaking down large iterative tasks into shorter segements then merging overall results.

Muli-threading

The GATK takes advantage of the latest processors using multi-threading, i. e. run using multiple cores on the same machine. Multi-threading is enabled simply by using the -nt and -nct command line arguments. See this article for details.


Multi-threading

The GATK does multi-threading.


Queue and scatter-gather

Queue uses a scatter-gather process to parallelize operations.

Out on the farm with Queue

Queue is a companion program that allows the GATK to take parallelization to the next level: running jobs on a high-performance computing cluster, or server farm.

Queue manages the entire process of breaking down big jobs into many smaller ones (scatter) then collecting and merging results when they are done (gather).

At the Broad, we use a Queue pipeline to run GATK analyses on hundreds, even thousands of exomes, on our cluster of hundreds of nodes.

 
See this article for more details on parallelism with the GATK.


Typical Workflows Getting Help

 

Getting Help
Don't panic! Help is at hand.


The GATK has a reputation for being wicked complicated, and it's not entirely undeserved. With great power comes great responsibility complexity... But don't panic! Help is at hand.

Our crack team of space monkeys has put together a revolutionary documentation and support system to help you get the answers you need, fast and with the least amount of pain. It's composed of reference guides, tutorials, videos and a forum that are all cross-connected, so that relevant material from various sources is automatically aggregated for you, whether you're browsing an index or searching by keyword.

All that time you'll save on looking for information and troubleshooting? Feel free to use it to get even more work done... or finally have a life!


The Hitchhiker's Guide to the GATK


If you've ever wondered "Is there a tool that can do X?" or "Does this tool have an argument to do Y?", the Technical Documentation should be your first stop.

Every tool in the GATK has its own article detailing what it does and how it does it, as well as all the available options, default parameter values and argument names. There are also articles detailing options and arguments of the GATK engine, which are common to all GATK tools, and documenting companion software such as Queue.

In addition to this technical documentation, the Guide also features more "meta" articles detailing methods and workflows including Best Practices recommendations for study design and analysis. Or if you're looking to write your own walkers and Queue scripts, check out the Developer Zone!


FAQs, tutorials and videos


The Guide is further enriched by a regularly updated collection of Frequently Asked Questions, as well as tutorials – some in video form – that will guide you step by step through various tasks such as running analyses and troubleshooting errors.

And because we know there are few things more frustrating than trawling through tutorials that are either too basic, too advanced or otherwise not appropriate for your needs, each tutorial is clearly labeled to identify the intended audience type ( Analyst or Developer ), level ( Basic, Intermediate, or Advanced ) and prerequisite knowledge.


Community forum


Finally, if you've exhausted all these avenues and still haven't found the answer to your question, ask the forum! It's powered by rainbows and staffed by unicorns who love to answer USER ERROR questions.

Well, maybe not unicorns, but a team of computational biologists and software engineers who work hard to address your problems quickly and accurately. If something's not clearly documented, we'll answer you question and improve the docs accordingly. If you think you found a bug, we'll track it down and fix it. Just Ask the Team.


High Performance Licensing & Source Code

 

Licensing & Source Code


Free for academics, fee for commercial use

The GATK is increasingly used not just by academic researchers, but by commercial companies who have needs, in terms of support and production readiness, that are beyond what our small development team can provide. Charging a licensing fee for commercial use gives us the means to provide upgraded support for those users, as well as invest more resources to improve development speed, functionality and stability overall. That being said, we remain committed to providing free access to the full-featured GATK for non-commercial use by the academic scientific community.

Mixed closed/open-source model

The entire GATK engine and infrastructure (i.e. the programming framework) as well as a large number of utility tools are open-source and provided to all free of charge under the original MIT license. The source code for the full GATK suite (which also includes the Best Practices tools) is freely available for non-commercial use to the academic research community; while commercial users may obtain access to the source code if they purchase a license from our partner, Appistry. Please see the license text for full details.


Appistry

Introducing Appistry, our exclusive partner for commercial licensing and support

We have selected Appistry to be the exclusive partner who will handle for-profit GATK licensing and support. If you have any questions, please visit their page of Frequently Asked Questions about for-profit GATK, and contact them directly to inquire about pricing or any "special case" scenario you feel may apply to your use of the GATK.


Which GATK package is right for you?


GATK packages

If you would like to discuss why we have made these changes to the licensing model, please see our own discussion thread on licensing and source code in GATK 2.


Getting Help Downloads