The Genome Analysis Toolkit

From GSA

Jump to: navigation, search

Contents

What is the GATK?

The GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.

We aim to work well with both samtools and Picard by providing complementary tools to those available in those two packages. Our SNP calling pipeline (Q score recalibration -> multiple sequence realignment -> snp/index calling) is a particular area of focus, and have been pushing to make these capabilities as general-purpose and powerful as possible. My group's mandate is to ensure the success of the human medical resequencing projects we've undertaken at the Broad over the next 2-3 years, which involves providing a robust, production-quality development library that underlies tools for common analysis problems (like SNP calling) as well as enabling exploratory research on NGS data.

Take a look at File:CBBO 100709 v3.pptx.pdf to view a presentation that provides an introduction to some of the capabilities of the GATK and its application to the 1000 Genomes project.

GATK Map/Reduce framework

Introduction

The massive size of data generated by next-generation DNA sequencing projects, such as the 1000 genomes project whose pilot data freeze alone includes nearly five terabases of sequence, requires sophisticated data management, from data storage to analysis. The complexity of managing access to such enormous amounts of data usually results in feature-poor, inefficient, and brittle analysis tools from developers. Worse, many of the first-line users of next-generation sequencing data are professional biologists whose ability to analyze their own data to answer their scientific questions is severely hampered by the complexity of accessing and manipulating the data files produced by these machiens. A key challenge shared by all users of next-generation sequencing users from advanced algorithms developers at large-scale sequencing centers to first-year graduate students working with an illumina instrument for the first time is how to write tools easily to efficiently analyze their next-generation sequencer data.

Programming against the GATK

The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers. The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce (see Figure). Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum (see Figure). Because many tools to analyze next-generation sequencing data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools. For example, traversals “by each sequencer read” and “by every read covering each locus in a genome” are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs. The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations. Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms (see Figure). This not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine.

Getting Started

Using the GATK

GATK Tools in 1.0 Release Version

Stable Early Access GATK Tools

Experimental GATK Tools (only available with full SVN checkout)

Validating variation using the GATK

Simple GATK scripts

GATK file formats

Programming in the GATK

Writing Walkers

Contributing to the GATK

Advanced features

Personal tools