Frequently Asked Questions
The following are some frequently asked questions regarding the GATK. Common data formatting and usage questions relevant to both new and long-time users are addressed here. Check here for the information on file formats, usage notes, error modes, etc.
General questions
What is the GATK?
The Genome Analysis Toolkit, or "GATK", is two things:
- It's a cross-platform application programming interface (API), written in Java, specifically designed for working with gargantuan (up to hundreds of terabytes) next-generation sequencing (NGS) datasets.
- It's a set of tools built upon that API for performing certain processing and analysis tasks on NGS data.
What is the scope of the GATK?
The scope of the GATK begins with resequencing datasets where the reads have already been base-called and aligned to the reference sequence (for a review of NGS methods and terminology, see #Where can I get more information about next-generation sequencing concepts and terms?).
The scope of the GATK roughly ends with the generation of a highly sensitive and specific set of SNP and indel calls, ready for downstream analysis. These calls can be generated for a single sample or hundreds of samples considered by the calling algorithm simultaneously.
In between the limits of the scope are dozens of tools for processing the NGS data for optimal results. Our tools include:
- a method for recalibrating base quality scores
- a method for fixing mismapped reads proximal or spanning indels unrecognized by the original mapping software
- a method for calling SNPs on a single sample or multiple samples considered simultaneously
- a method for calling indels on a single sample
- Tools for subsetting and combining variant callsets
- a tool for evaluating read coverage over the genome in a variety of ways
- a tool for assessing the quality of a variant callset and its concordance to other callsets
- Much more...
How does the GATK handle these huge NGS datasets?
Imagine a simple question like, "What's the depth of coverage at position A of the genome?" You are now given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question then becomes very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.
Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.
This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and in less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, bringing us to our next question:
What is the Central Dogma of the GATK?
The Central Dogma of the GATK is as follows:
All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences.
What versions of Java does the GATK support?
Officially the GATK currently supports only Java 1.6, however as of April 2012 we've committed ourselves to testing compatibility with Java 1.7 as well before each major release. By the end of 2012 we expect that Java 1.7 will replace 1.6 as the version of Java we officially support.
At this time we only support the official Sun/Oracle Java JDKs; alternate Java implementations such as OpenJDK are not supported.
See the Prerequisites page for more information.
File format questions
Sequencer output
What file formats do you support for sequencer output?
The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.
How do I get my data into BAM format?
The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.
What are the formatting requirements for my BAM file(s)?
All BAM files must satisfy the following requirements:
- It must be aligned to one of the references mentioned on the wiki
- It must be sorted in coordinate order (not by queryname and not "unsorted")
- It must list the read groups with sample names in the header
- Every read must belong to a read group
- The BAM file must pass Picard validation.
See the BAM specification for more information.
What is the canonical ordering of human reference contigs in a BAM file?
It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is listed here.
How can I tell if my BAM file is sorted properly?
The easiest way to do it is to download Samtools and run the following command to examine the header of your file:
$ samtools view -H /path/to/my.bam @HD VN:1.0 GO:none SO:coordinate @SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827 @SQ SN:4 LN:191273063 @SQ SN:5 LN:180857866 @SQ SN:6 LN:170899992 @SQ SN:7 LN:158821424 @SQ SN:8 LN:146274826 @SQ SN:9 LN:140273252 @SQ SN:10 LN:135374737 @SQ SN:11 LN:134452384 @SQ SN:12 LN:132349534 @SQ SN:13 LN:114142980 @SQ SN:14 LN:106368585 @SQ SN:15 LN:100338915 @SQ SN:16 LN:88827254 @SQ SN:17 LN:78774742 @SQ SN:18 LN:76117153 @SQ SN:19 LN:63811651 @SQ SN:20 LN:62435964 @SQ SN:21 LN:46944323 @SQ SN:22 LN:49691432 @SQ SN:X LN:154913754 @SQ SN:Y LN:57772954 @SQ SN:MT LN:16571 @SQ SN:NT_113887 LN:3994 ...
If the order of the contigs here matches the canonical ordering mentioned on our wiki, and the 'SO:coordinate' flag appears in your header, then your contig and read ordering satisfies the GATK requirements.
My BAM file isn't sorted that way. How can I fix it?
Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.
How can I tell if my BAM file has read group and sample information?
A quick Unix command using Samtools will do the trick:
$ samtools view -H /path/to/my.bam | grep '^@RG' @RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm @RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC @RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC ...
The presence of the '@RG' tags indicate the presence of read groups. Each read group has a "SM" tag, indicating the sample from which the reads belonging to that read group originate.
In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,
$ samtools view /path/to/my.bam | grep '^@RG' EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85 EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31 ...
membership in a read group is specified by the 'RG:Z:*' tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 2 (sample NA12414).
My BAM file doesn't have read group and sample information. Do I really need it?
Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.
What's the meaning of the standard read group fields
For technical details, see the sam specification on the Samtools website.
| Tag in SAM spec | Importance | SAM spec definition | Meaning |
|---|---|---|---|
| ID | Required | Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read group
IDs may be modified when merging SAM files in order to handle collisions. |
Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model. |
| SM | Sample. Use pool name where a pool is being sequenced. | Required. As important as ID. | The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper. |
| PL | Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. | Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data | It's a good idea to use this field |
| LB | DNA preparation library identify | Essential for MarkDuplicates | MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. |
We do not require value for the CN, DS, DT, PG, PI, or PU fields.
A concrete example may be instructive. Support I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:
Dad's data: @RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200 @RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 @RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400 Mom's data: @RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200 @RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 @RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400 Kid's data: @RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200 @RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400 @RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library)
My BAM file doesn't have read group and sample information. How do I add it?
See this post on our help forum for information on how to add read group information to a file where it is absent.
How do I know if my BAM file is valid?
Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing "STRICT" validation stringency work best with the GATK.
What's the best way to create a subset of my BAM file containing only reads over a small interval?
You can use the GATK to
GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam
and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK
Variant callsets
What file formats do you support for variant callsets?
We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.
How do I know if my VCF file is valid?
VCFTools contains a validation tool that's quite helpful in this regard.
Interval lists
What file formats do you support for interval lists?
We support three types of interval lists, as mentioned on the Input_files_for_the_GATK#Intervals page. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).
I have two (or more) sequencing experiments with different target intervals. How can I combine them?
One relatively easy way to combine your intervals is to use the online tool, Galaxy, using the "Get Data"->Upload command to upload your intervals, and the "Operate on Genomic Intervals" command to compute the intersection or union of your intervals (depending on your needs).
Getting more help
Where can I get more information about next-generation sequencing concepts and terms?
The following links should be help as a review or an introduction to concepts and terminology related to next-generation sequencing:
- DNA sequencing (Wikipedia) - Basic review of the sequencing process.
- Sequencing technologies, the next generation, (M. Metzker, Nature Reviews - Genetics) - an excellent, detailed overview of the myriad next-gen sequencing methdologies.
- Next-generation sequencing: adjusting to data overload (M. Baker, Nature Methods) - nice piece explaining the problems inherent in trying to analyze terabytes of data (the GATK addresses this issue by requiring all datasets be in reference order, so only small chunks of the genome need to be in memory at once).
Where can I get more help with the GATK tools?
We run a GetSatisfaction support forum where GATK core contributors and users of the toolkit ask and answer questions. This is always a good place to see what other people have asked and, if your question has been posted yet (please check!), for you to add a question.