The Genome Analysis Toolkit

From GSA

Jump to: navigation, search

Contents

Introduction

What is the GATK?

The GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.

We aim to work well with both samtools and Picard by providing complementary tools to those available in those two packages. Our SNP calling pipeline (Q score recalibration -> multiple sequence realignment -> snp/index calling) is a particular area of focus, and have been pushing to make these capabilities as general-purpose and powerful as possible. My group's mandate is to ensure the success of the human medical resequencing projects we've undertaken at the Broad over the next 2-3 years, which involves providing a robust, production-quality development library that underlies tools for common analysis problems (like SNP calling) as well as enabling exploratory research on NGS data.

Take a look at File:CBBO 100709 v3.pptx.pdf to view a presentation that provides an introduction to some of the capabilities of the GATK and its application to the 1000 Genomes project.

GATK Map/Reduce framework


History

The massive size of data generated by next-generation DNA sequencing projects, such as the 1000 genomes project whose pilot data freeze alone includes nearly five terabases of sequence, requires sophisticated data management, from data storage to analysis. The complexity of managing access to such enormous amounts of data usually results in feature-poor, inefficient, and brittle analysis tools from developers. Worse, many of the first-line users of next-generation sequencing data are professional biologists whose ability to analyze their own data to answer their scientific questions is severely hampered by the complexity of accessing and manipulating the data files produced by these machines. A key challenge shared by all users of next-generation sequencing users from advanced algorithms developers at large-scale sequencing centers to first-year graduate students working with an illumina instrument for the first time is how to write tools easily to efficiently analyze their next-generation sequencer data.

Programming against the GATK

The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers. The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce (see Figure). Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum (see Figure). Because many tools to analyze next-generation sequencing data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools. For example, traversals “by each sequencer read” and “by every read covering each locus in a genome” are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs. The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations. Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms (see Figure). This not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine.

GATK Tools Documentation

Detailed documentation of every GATK tool: [1]


Using the GATK


General GATK Arguments and Features


Supported GATK Tools

Variant Detection

Quality Control and Simple Analysis Tools

  • GC Content -- report the GC content of the reference or by specified interval.
  • Count Loci -- counts the number of loci in the BAM file
  • Count Pairs -- counts the number of read pairs encountered in a BAM file.
  • Count Reads -- counts the number of reads in the BAM file
  • Print Reads -- prints reads from a BAM file

BAM Processing and Analysis Tools

Variant Discovery Tools

Cancer-specific Variant Discovery Tools

Variant Evaluation and Manipulation Tools

Validation Utilities

Companion Utilities

Miscellaneous Experimental (and Potentially Unstable) Tools

Tools in this class are shared with the external world but aren't supported in any way by GSA. The tool may change or be removed entirely without notice, as they represent exploratory or research projects. They are included here as some users -- including intra-Broad users or the GSA members -- may be experimenting with them. Please be understanding if you have issues with these tools.

If you are looking for historical information see Deprecated walkers.

Queue and the GATK-Pipeline

At the Broad Institute the GSA team runs a production-scale NGS data processing pipeline using Queue:

Getting help

If you are having any problems with the GATK, the first thing to do is to make sure that your input files adhere to our official policy. See here for details.

We support the GATK via a Get Satisfaction forum at http://getsatisfaction.com/gsa. This is the place to go to search through previously asked questions and to post new ones; please do a quick search before posting a question as there's a good chance it has been asked before! More information about joining the GATK development and user community can be found at Providing feedback and getting help.

New to the GATK? Confused about input or output formats? Need to know what a particular error message means? See our Frequently Asked Questions page for the answers to some of the most common GATK-related questions.


GATK Development

Release Notes

History of notes for previous versions of the GATK release

The full development logs and individual changes to the public GATK are available at the github GATK repository


Miscellaneous information about the GATK

Private Genome Analysis Toolkit

If you are a member of the Broad institute, please see the Private Genome Analysis Toolkit Wiki for Broad-internal information about the GATK, from upcoming infrastructure to private tools. This wiki can also serve as a staging area for wiki entries intended for the public wiki but not yet released.


GATK publications and data sets

If you are using the GATK in your work you should follow these guildlines when Citing the GATK in your publications.

Please visit Data sets for a framework for variation discovery and genotyping using next-generation DNA sequencing data to obtain the raw NGS data sets and the derived VCF call sets

Personal tools