The Genome Analysis Toolkit
From GSA
Introduction
What is the GATK?
The GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.
We aim to work well with both samtools and Picard by providing complementary tools to those available in those two packages. Our SNP calling pipeline (Q score recalibration -> multiple sequence realignment -> snp/index calling) is a particular area of focus, and have been pushing to make these capabilities as general-purpose and powerful as possible. My group's mandate is to ensure the success of the human medical resequencing projects we've undertaken at the Broad over the next 2-3 years, which involves providing a robust, production-quality development library that underlies tools for common analysis problems (like SNP calling) as well as enabling exploratory research on NGS data.
Take a look at File:CBBO 100709 v3.pptx.pdf to view a presentation that provides an introduction to some of the capabilities of the GATK and its application to the 1000 Genomes project.
History
The massive size of data generated by next-generation DNA sequencing projects, such as the 1000 genomes project whose pilot data freeze alone includes nearly five terabases of sequence, requires sophisticated data management, from data storage to analysis. The complexity of managing access to such enormous amounts of data usually results in feature-poor, inefficient, and brittle analysis tools from developers. Worse, many of the first-line users of next-generation sequencing data are professional biologists whose ability to analyze their own data to answer their scientific questions is severely hampered by the complexity of accessing and manipulating the data files produced by these machines. A key challenge shared by all users of next-generation sequencing users from advanced algorithms developers at large-scale sequencing centers to first-year graduate students working with an illumina instrument for the first time is how to write tools easily to efficiently analyze their next-generation sequencer data.
The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers. The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce (see Figure). Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum (see Figure). Because many tools to analyze next-generation sequencing data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools. For example, traversals “by each sequencer read” and “by every read covering each locus in a genome” are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs. The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations. Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms (see Figure). This not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine.
GATK Tools Documentation
Detailed documentation of every GATK tool: [1]
Using the GATK
- Introduction
- Prerequisites
- Downloading the GATK
- Signing up for email notifications whenever there's a new GATK release
- Input files for the GATK
- GATK resource bundle -- a collection of standard files for working with human resequencing data with the GATK
- Running the GATK for the first time
- Common problems when running the GATK
- GATKdocs
General GATK Arguments and Features
- Examples from basic walkers
- Built-in command-line arguments
- JEXL expressions for selecting subsets of VCF records
Supported GATK Tools
Variant Detection
- Best Practice Variant Detection with the GATK v3 -- Please note that as we improve our tools and methods, many of the steps are likely to change.
Quality Control and Simple Analysis Tools
- GC Content -- report the GC content of the reference or by specified interval.
- Count Loci -- counts the number of loci in the BAM file
- Count Pairs -- counts the number of read pairs encountered in a BAM file.
- Count Reads -- counts the number of reads in the BAM file
- Print Reads -- prints reads from a BAM file
BAM Processing and Analysis Tools
- Local realignment around indels -- correct alignment errors due to indels
- Base quality score recalibration -- correct inaccurate base quality scores
- Read Clipping
- Depth of Coverage v3.0 -- how much data do I have?
- Callable Loci Walker
- Data Processing Pipeline -- A queue script for NGS data processing. It runs the necessary tools to prepare BAM files for GATK analysis. It is considered instructional.
- PacBio Data Processing Guidelines -- Guidelines to process data from Pacific Biosciences RS using the GATK
Variant Discovery Tools
- Unified Genotyper -- discover SNPs and indels and assign genotypes
- Variant quality score recalibration -- determine the reliability of individual SNP call sites, in a probabilistic framework
- Interface with BEAGLE imputation software
- Read-backed phasing algorithm -- uses the sequence reads to perform physical phasing of the variants
- Phasing By Transmission -- phase variants based on family transmission patterns
- Merging batched call sets
- Smart Merging of Batched Calls -- Deprecated
Cancer-specific Variant Discovery Tools
- MuTect -- Somatic mutation SNP caller from the Broad's Cancer group
- Somatic Indel Detector -- previously called 'Indel Genotyper V2.0'
Variant Evaluation and Manipulation Tools
- VariantEval -- calculate general purpose metrics like percent in dbSNP, genotype concordance, Ti/Tv ratios, etc.
- Variant Annotator -- richly annotate variant calls
- Variant Filtration -- filter calls based on annotations
- Combine Variants -- Combines VCF files
- Select Variants -- Generates a VCF file from a VCF file with the selected variants
- VariantsToVCF -- convert variant calls to VCF format
- VariantsToTable -- convert variant calls to a tabular format
- Left-Align Indels -- left-align indels within VCF files
Validation Utilities
- Validate Variants -- strict validation of the info within VCF files
- Creating Variant Validation Sets -- Using the GATK to randomly select variants for validation
- Creating Amplicon Sequences -- Using the GATK to design Sequenom probe files
- Variant Validation Assessor -- Assessing the output of a validation exercise
- Converting ped to vcf
Companion Utilities
- Creating a Fasta Reference
- Creating a Fasta Alternate Reference (incorporating variants into the reference)
- ReorderSam -- Reorder a BAM file to match the contig order in another reference fasta]]
- ReplaceReadGroups -- Adds or modifies read groups in a BAM file
Miscellaneous Experimental (and Potentially Unstable) Tools
- Sting BWA/C bindings
- Genotype and Validate -- genotypes a dataset and validates the calls of another dataset using the unified genotyper.
- liftOverVCF.pl
- HLA caller algorithm -- determine HLA types at class I (A, B, C) and class II (DRB1, DQA1, DQB1, DPA1, DPB1) loci.
Tools in this class are shared with the external world but aren't supported in any way by GSA. The tool may change or be removed entirely without notice, as they represent exploratory or research projects. They are included here as some users -- including intra-Broad users or the GSA members -- may be experimenting with them. Please be understanding if you have issues with these tools.
If you are looking for historical information see Deprecated walkers.
Queue and the GATK-Pipeline
At the Broad Institute the GSA team runs a production-scale NGS data processing pipeline using Queue:
- Queue -- the GATK companion pipeline execution engine
- GSA QC Methodology.
- How to run the FullCallingPipeline.q script on your own to duplicate our standard pipeline methodology.
Getting help
If you are having any problems with the GATK, the first thing to do is to make sure that your input files adhere to our official policy. See here for details.
We support the GATK via a Get Satisfaction forum at http://getsatisfaction.com/gsa. This is the place to go to search through previously asked questions and to post new ones; please do a quick search before posting a question as there's a good chance it has been asked before! More information about joining the GATK development and user community can be found at Providing feedback and getting help.
New to the GATK? Confused about input or output formats? Need to know what a particular error message means? See our Frequently Asked Questions page for the answers to some of the most common GATK-related questions.
GATK Development
- Developing your own tools on top of the GATK-Engine infrastructure
- GATK development process
- Collaborating with the GATK -- how to get your tools and patches in the official release of the GATK
- The first GATK presentation ever given -- July 2009
- 20-Line Lifesavers: Coding simple solutions in the GATK -- Sept. 2011
- Thanks to Kiran Garimella and the Eli Lilly team for creating and sharing this content!
- Using Git (For GSA/Broad developers only)
- Git FAQ (For GSA/Broad developers only)
Release Notes
History of notes for previous versions of the GATK release
The full development logs and individual changes to the public GATK are available at the github GATK repository
Miscellaneous information about the GATK
- GATK licensing
- Phone home
- Using JEXL expressions
- Understanding the Unified Genotyper's VCF files
- Parallelism and the GATK -- information on how to run GATK tools with shared memory and distributed parallelism
- Per-base alignment qualities (BAQ) in the GATK -- Implementation of Heng Li's Base Alignment Qualities
- VCF Streaming -- Highly experimental feature
- Managing User Input and the ROD system
Private Genome Analysis Toolkit
If you are a member of the Broad institute, please see the Private Genome Analysis Toolkit Wiki for Broad-internal information about the GATK, from upcoming infrastructure to private tools. This wiki can also serve as a staging area for wiki entries intended for the public wiki but not yet released.
GATK publications and data sets
If you are using the GATK in your work you should follow these guildlines when Citing the GATK in your publications.
- A framework for variation discovery and genotyping using next-generation DNA sequencing data -- April 2011
Please visit Data sets for a framework for variation discovery and genotyping using next-generation DNA sequencing data to obtain the raw NGS data sets and the derived VCF call sets
