The Genome Analysis Toolkit
From GSA
What is the GATK?
The GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.
We aim to work well with both samtools and Picard by providing complementary tools to those available in those two packages. Our SNP calling pipeline (Q score recalibration -> multiple sequence realignment -> snp/index calling) is a particular area of focus, and have been pushing to make these capabilities as general-purpose and powerful as possible. My group's mandate is to ensure the success of the human medical resequencing projects we've undertaken at the Broad over the next 2-3 years, which involves providing a robust, production-quality development library that underlies tools for common analysis problems (like SNP calling) as well as enabling exploratory research on NGS data.
Take a look at File:CBBO 100709 v3.pptx.pdf to view a presentation that provides an introduction to some of the capabilities of the GATK and its application to the 1000 Genomes project.
Introduction
The massive size of data generated by next-generation DNA sequencing projects, such as the 1000 genomes project whose pilot data freeze alone includes nearly five terabases of sequence, requires sophisticated data management, from data storage to analysis. The complexity of managing access to such enormous amounts of data usually results in feature-poor, inefficient, and brittle analysis tools from developers. Worse, many of the first-line users of next-generation sequencing data are professional biologists whose ability to analyze their own data to answer their scientific questions is severely hampered by the complexity of accessing and manipulating the data files produced by these machines. A key challenge shared by all users of next-generation sequencing users from advanced algorithms developers at large-scale sequencing centers to first-year graduate students working with an illumina instrument for the first time is how to write tools easily to efficiently analyze their next-generation sequencer data.
The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers. The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce (see Figure). Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum (see Figure). Because many tools to analyze next-generation sequencing data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools. For example, traversals “by each sequencer read” and “by every read covering each locus in a genome” are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs. The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations. Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms (see Figure). This not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine.
Before Using the GATK
Having trouble? Getting help
If you are having any problems with the GATK, the first thing to do is to make sure that your input files adhere to our official policy. See here for details.
We support the GATK via a Get Satisfaction forum at http://getsatisfaction.com/gsa. This is the place to go to search through previously asked questions and to post new ones; please do a quick search before posting a question as there's a good chance it has been asked before! More information about joining the GATK development and user community can be found at Providing feedback and getting help.
Getting Started
- Running the GATK for the first time
- Built-in command-line arguments
- Built-in walkers
- GATK Error Messages
- GATK licensing
Using the GATK for Variant Detection
The following pages describe the current best practices for calling variants from various types of next-generation sequencing data. Please note that as we improve our tools and methods, many of the steps are likely to change.
GATK publications and news items
- GATK framework paper -- July 2010
- Broad's GATK Aims to Help Developers Keep Pace with Rapidly Evolving Sequencing Tools -- July 2010
- A Foundation for Next-Generation Analysis Tools -- August 2010
BAM Processing and Analysis Tools
- Local realignment around indels -- correct alignment errors due to indels
- Base quality score recalibration -- correct inaccurate base quality scores
- Read Clipping
- Depth of Coverage v3.0 -- how much data do I have?
- Callable Loci Walker
Variant Discovery Tools
- Unified Genotyper -- call SNPs (and soon to be indels) and assign genotypes
- Variant quality score recalibration -- determine the reliability of individual SNP call sites, in a probabilistic framework
- Smart Merging of Batched Calls
- Indel Genotyper V2.0 -- indel discovery
- Interface with BEAGLE imputation software
Variant Evaluation and Manipulation Tools
- VariantEval -- calculate general purpose metrics like percent in dbSNP, genotype concordance, Ti/Tv ratios, etc.
- Variant Annotator -- richly annotate variant calls
- Variant Filtration -- filter calls based on annotations
- GenomicAnnotator -- annotate variant calls with classic genomic annotations
- Analyze Annotations
- Combine Variants
- Select Variants
- VariantsToVCF -- convert variant calls to VCF format
Sequenom Utilities
- Creating Sequenom Probe Files -- Using the GATK to design Sequenom probe files
- Sequenom Validation Converter -- be sure that your .ped file uses the proper naming convention
Fastq and Fasta Utilities
- FastqToBam
- BamToFastq
- Creating a Fasta Reference
- Creating a Fasta Alternate Reference (incorporating variants into the reference)
Miscellaneous Experimental (and Potentially Unstable) Tools
Tools in this class are shared with the external world but aren't supported in any way by GSA. The tool may change or be removed entirely without notice, as they represent exploratory or research projects. They are included here as some users -- including intra-Broad users or the GSA members -- may be experimenting with them. Please be understanding if you have issues with these tools.
- Sting BWA/C bindings
- HLA caller algorithm -- determine HLA types at class I (A, B, C) and class II (DRB1, DQA1, DQB1, DPA1, DPB1) loci.
- Adding Sample data to an analysis
If you are looking for historical information see Deprecated walkers.
GATK file formats
See here.
Programming in the GATK
Writing Walkers
- Your first walker
- Command-line arguments
- Walker data requirements
- Collecting output
- Documenting walkers
- Building a new release and redistributing walkers
- Writing and working with ReferenceOrderedData classes
- ReadBackedPileup -- Introduction to version 2, Nov. 24, 2009
Contributing to the GATK
- Building the GATK
- Configuring IntelliJ
- GATK architecture
- Submitting patches
- Adding and updating dependencies
- Archiving files
- Coding standards
- Running findbugs
- Updating the Tribble library
- GSA members only
Advanced features
- Parallelism and the GATK
- Using GATK from Matlab
- Writing walkers in Scala
- Seeing deletion spanning reads in LocusWalkers
- ROD walkers
- Writing unit / regression tests for walkers
- Report format tool
- Using JEXL expressions
- Representing Indels and other complex events and working with them with Variant Contexts
