The Genome Analysis Toolkit

From GSA

Jump to: navigation, search

Contents

What is the GATK?

The GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.

We aim to work well with both samtools and Picard by providing complementary tools to those available in those two packages. Our SNP calling pipeline (Q score recalibration -> multiple sequence realignment -> snp/index calling) is a particular area of focus, and have been pushing to make these capabilities as general-purpose and powerful as possible. My group's mandate is to ensure the success of the human medical resequencing projects we've undertaken at the Broad over the next 2-3 years, which involves providing a robust, production-quality development library that underlies tools for common analysis problems (like SNP calling) as well as enabling exploratory research on NGS data.

Take a look at File:CBBO 100709 v3.pptx.pdf to view a presentation that provides an introduction to some of the capabilities of the GATK and its application to the 1000 Genomes project.

GATK Map/Reduce framework

Introduction

The massive size of data generated by next-generation DNA sequencing projects, such as the 1000 genomes project whose pilot data freeze alone includes nearly five terabases of sequence, requires sophisticated data management, from data storage to analysis. The complexity of managing access to such enormous amounts of data usually results in feature-poor, inefficient, and brittle analysis tools from developers. Worse, many of the first-line users of next-generation sequencing data are professional biologists whose ability to analyze their own data to answer their scientific questions is severely hampered by the complexity of accessing and manipulating the data files produced by these machines. A key challenge shared by all users of next-generation sequencing users from advanced algorithms developers at large-scale sequencing centers to first-year graduate students working with an illumina instrument for the first time is how to write tools easily to efficiently analyze their next-generation sequencer data.

Programming against the GATK

The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers. The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce (see Figure). Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum (see Figure). Because many tools to analyze next-generation sequencing data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools. For example, traversals “by each sequencer read” and “by every read covering each locus in a genome” are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs. The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations. Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms (see Figure). This not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine.

Before Using the GATK

Having trouble? Getting help

If you are having any problems with the GATK, the first thing to do is to make sure that your input files adhere to our official policy. See here for details.

We support the GATK via a Get Satisfaction forum at http://getsatisfaction.com/gsa. This is the place to go to search through previously asked questions and to post new ones; please do a quick search before posting a question as there's a good chance it has been asked before! More information about joining the GATK development and user community can be found at Providing feedback and getting help.

Getting Started

Using the GATK for Variant Detection

The following pages describe the current best practices for calling variants from various types of next-generation sequencing data. Please note that as we improve our tools and methods, many of the steps are likely to change.

GATK publications and news items

BAM Processing and Analysis Tools

Variant Discovery Tools

Variant Evaluation and Manipulation Tools

Sequenom Utilities

Fastq and Fasta Utilities

Miscellaneous Experimental (and Potentially Unstable) Tools

Tools in this class are shared with the external world but aren't supported in any way by GSA. The tool may change or be removed entirely without notice, as they represent exploratory or research projects. They are included here as some users -- including intra-Broad users or the GSA members -- may be experimenting with them. Please be understanding if you have issues with these tools.

If you are looking for historical information see Deprecated walkers.

GATK file formats

See here.

Programming in the GATK

Writing Walkers

Contributing to the GATK

Advanced features

Simple GATK scripts

Personal tools