This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.
Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).
Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.
This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.
The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.
Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.
Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.
OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?
Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.
Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:
open the file, count the number of lines in the file, tell us the number, close the fileNote that tell us the number can mean writing it to the console, or storing it somewhere for use later on.
Now let's say we want to know the number of words on each line. The set of instructions would be:
open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the numberAnd so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.
So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:
open the file, index the lines
read the first line, count the number of words, tell us the number
read the second line, count the number of words, tell us the numberread the third line, count the number of words, tell us the number[repeat for all lines]
collect final results and close the file
Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.
You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.
Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.
There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.
By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.
Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.
Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.
Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.
Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.
In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.
If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.
Hey, if you have a better example, let us know in the forum and we'll use that instead.
Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?
There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:
-nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)
-nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).
Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.
In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.
If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.
So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?
As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).

At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).
Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.
So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.
That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.
The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.
Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.
Our current best practice for making SNP and indel calls is divided into four sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes.

Example commands for each tool are available on the individual tool's wiki entry. There is also a list of which resource files to use with which tool.
Note that due to the specific attributes of a project the specific values used in each of the commands may need to be selected/modified by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits the data and project design.
There are four major organizational units for next-generation DNA sequencing processes that used throughout this documentation:
This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.
In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data from the GATK resource bundle. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.
In our GATK 2.0 slide archive.
The GATK data processing pipeline assumes that one of the many NGS read aligners (see this review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits SAM files natively (which are then easy to convert to BAM).
The three key processes used here are:
There are several options here from the easy and fast basic protocol to the more comprehensive but computationally expensive pipeline. For example, there are two types of realignment which constitute a vastly different amount of processing power required:
This protocol uses lane-level local realignment around known indels (very fast, as there's no sample level processing) to clean up lane-level alignments. This results in better quality scores, as they are less biased for indel alignment artefacts.
for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)
Here we are essentially just merging the recalibrated lane.bams for a sample, dedupping the reads, and calling it quite. It doesn't perform indel realignment across lanes, so it leaves in some indels artifacts. For humans, which now have an extensive list of indels (get them from the GATK bundle!) the lane-level realignment around known indels is going to make up for the lack of cross-lane realignment. This protocol is appropriate if you are going to use callers like the HaplotypeCaller, UnifiedGenotyper with BAQ, or samtools with BAQ that are less sensitive to the initial alignment of reads, or if your project has limited coverage per sample (< 8x) where per-sample indel realignment isn't more empowered than per-lane realignment. For other situations or for organisms with limited database of segregating indels, it's better to use the advanced protocol if you have deep enough data per sample.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
sample.bam <- dedup.bam
As with the basic protocol, this protocol assumes the per-lane processing has been already completed. This protocol is essentially the basic protocol but with per-sample indel realignment.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
sample.bam <- realigned.bam
This is the protocol we use at the Broad in our fully automated pipeline because it gives an optimal balance of performance, accuracy and convenience.
Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.
for each sample
lanes.bam <- merged lane.bams for sample
dedup.bam <- MarkDuplicates(lanes.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
recal.bam <- recal(realigned.bam)
sample.bam <- recal.bam
This protocol can be hard to implement in practice unless you can afford to wait until all of the data is available to do data processing for your samples.
ReduceReads is a novel (perhaps even breakthrough?) GATK 2.0 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.
You can find more information about reduced reads in some of our presentations in the archive.
ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.
Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.
for each sample
sample.reduced.bam <- ReduceReads(sample.bam)
After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass.
For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.
The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Haplotype Caller or Unified Genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.
A common question is the confidence score threshold to use for variant detection. We recommend:
raw.vcf <- HaplotypeCaller(sample1.bam, sample2.bam, ..., sampleN.bam)
raw.vcf <- UnifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)
-nt option. However you can use Queue to parallelize execution of HaplotypeCaller.This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the false positive machine artifacts from the true positive genetic variants.
The process used here is the Variant quality score recalibrator which builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant in the callset is a true genetic variant or a machine/alignment artifact. All filtering criteria are learned from the data itself.
Take a look at our FAQ page for specific recommendations on which training sets and command-line arguments to use with various project designs. It is expected that analysis will be required by the user when determining the best way to use this tool for different projects. These recommendations are only meant as jumping off points!
We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):
--maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters. These recommended arguments for VariantFiltration are only to used when ALL other options are not available:
You will need to compose filter expressions (see here, here and here for details) to filter on the following annotations and values:
For SNPs:
QD < 2.0MQ < 40.0FS > 60.0HaplotypeScore > 13.0 MQRankSum < -12.5ReadPosRankSum < -8.0For indels:
QD < 2.0ReadPosRankSum < -20.0InbreedingCoeff < -0.8FS > 200.0Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by variant quality score recalibration.
The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.
NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.
We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.
> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
Runtime.totalMemory()=2112487424
44.922u 2.308s 0:47.09 100.2% 0+0k 0+0io 2pf+0w
This produces a SAM-style header file describing the contents of our fasta file.
> cat Homo_sapiens_assembly18.dict
@HD VN:1.0 SO:unsorted
@SQ SN:chrM LN:16571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ SN:chr1 LN:247249719 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ SN:chr2 LN:242951149 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d
@SQ SN:chr3 LN:199501827 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a
@SQ SN:chr4 LN:191273063 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2
@SQ SN:chr5 LN:180857866 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858
@SQ SN:chr6 LN:170899992 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583
@SQ SN:chr7 LN:158821424 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb
@SQ SN:chr8 LN:146274826 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ SN:chr9 LN:140273252 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b
@SQ SN:chr10 LN:135374737 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ SN:chr11 LN:134452384 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ SN:chr12 LN:132349534 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716
@SQ SN:chr13 LN:114142980 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587
@SQ SN:chr14 LN:106368585 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ SN:chr15 LN:100338915 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ SN:chr16 LN:88827254 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3e81884229e8dc6b7f258169ec8da246
@SQ SN:chr17 LN:78774742 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2a5c95ed99c5298bb107f313c7044588
@SQ SN:chr18 LN:76117153 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3d11df432bcdc1407835d5ef2ce62634
@SQ SN:chr19 LN:63811651 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2f1a59077cfad51df907ac25723bff28
@SQ SN:chr20 LN:62435964 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ SN:chr21 LN:46944323 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ SN:chr22 LN:49691432 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2041e6a0c914b48dd537922cca63acb8
@SQ SN:chrX LN:154913754 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d7e626c80ad172a4d7c95aadb94d9040
@SQ SN:chrY LN:57772954 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:62f69d0e82a12af74bad85e2e4a8bd91
@SQ SN:chr1_random LN:1663265 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cc05cb1554258add2eb62e88c0746394
@SQ SN:chr2_random LN:185571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ SN:chr3_random LN:749256 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cc571e918ac18afa0b2053262cadab6
@SQ SN:chr4_random LN:842648 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cab2949ccf26ee0f69a875412c93740
@SQ SN:chr5_random LN:143687 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:05926bdbff978d4a0906862eb3f773d0
@SQ SN:chr6_random LN:1875562 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d62eb2919ba7b9c1d382c011c5218094
@SQ SN:chr7_random LN:549659 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ SN:chr8_random LN:943810 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ SN:chr9_random LN:1146434 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ SN:chr10_random LN:113275 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:50be2d2c6720dabeff497ffb53189daa
@SQ SN:chr11_random LN:215294 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfc93adc30c621d5c83eee3f0d841624
@SQ SN:chr13_random LN:186858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:563531689f3dbd691331fd6c5730a88b
@SQ SN:chr15_random LN:784346 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bf885e99940d2d439d83eba791804a48
@SQ SN:chr16_random LN:105485 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ SN:chr17_random LN:2617613 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ SN:chr19_random LN:301858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:420ce95da035386cc8c63094288c49e2
@SQ SN:chr21_random LN:1679693 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:a7252115bfe5bb5525f34d039eecd096
@SQ SN:chr22_random LN:257318 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4f2d259b82f7647d3b668063cf18378b
@SQ SN:chrX_random LN:1719168 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f
We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.
> samtools faidx Homo_sapiens_assembly18.fasta
108.446u 3.384s 2:44.61 67.9% 0+0k 0+0io 0pf+0w
This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:
> cat Homo_sapiens_assembly18.fasta.fai
chrM 16571 6 50 51
chr1 247249719 16915 50 51
chr2 242951149 252211635 50 51
chr3 199501827 500021813 50 51
chr4 191273063 703513683 50 51
chr5 180857866 898612214 50 51
chr6 170899992 1083087244 50 51
chr7 158821424 1257405242 50 51
chr8 146274826 1419403101 50 51
chr9 140273252 1568603430 50 51
chr10 135374737 1711682155 50 51
chr11 134452384 1849764394 50 51
chr12 132349534 1986905833 50 51
chr13 114142980 2121902365 50 51
chr14 106368585 2238328212 50 51
chr15 100338915 2346824176 50 51
chr16 88827254 2449169877 50 51
chr17 78774742 2539773684 50 51
chr18 76117153 2620123928 50 51
chr19 63811651 2697763432 50 51
chr20 62435964 2762851324 50 51
chr21 46944323 2826536015 50 51
chr22 49691432 2874419232 50 51
chrX 154913754 2925104499 50 51
chrY 57772954 3083116535 50 51
chr1_random 1663265 3142044962 50 51
chr2_random 185571 3143741506 50 51
chr3_random 749256 3143930802 50 51
chr4_random 842648 3144695057 50 51
chr5_random 143687 3145554571 50 51
chr6_random 1875562 3145701145 50 51
chr7_random 549659 3147614232 50 51
chr8_random 943810 3148174898 50 51
chr9_random 1146434 3149137598 50 51
chr10_random 113275 3150306975 50 51
chr11_random 215294 3150422530 50 51
chr13_random 186858 3150642144 50 51
chr15_random 784346 3150832754 50 51
chr16_random 105485 3151632801 50 51
chr17_random 2617613 3151740410 50 51
chr18_random 4262 3154410390 50 51
chr19_random 301858 3154414752 50 51
chr21_random 1679693 3154722662 50 51
chr22_random 257318 3156435963 50 51
chrX_random 1719168 3156698441 50 51
VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.
VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.
The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:
##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G 3931.66 PASS AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T 71.77 PASS AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 T C 29.84 LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
It seems a bit complex, but the structure of the file is actually quite simple:
[HEADER LINES]
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.
The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.
CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.
ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.
REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).
QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.
FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.
For more details about these fields, please see this page.
In the excerpt shown above, here is how we interpret the line corresponding to each variant:
The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:
chr1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
Looking at that last column, here is what the tags mean:
GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.
AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).
PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.
With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
At this site, the called genotype is GT = 0/1, which is C/T. The confidence (GQ=25.92) isn't so good, largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref (=had the reference base) and 3 of which were alt (=had the alternate base) (AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value), whereas there's a serious chance that the subject is hom-var (=homozygous with the variant allele) since PL(1/1) = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that the subject is definitely not home-ref (=homozygous with the reference allele) here since PL(0/0) = 103 = 10^(-10.3) which is a very small number.
Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.
chr1 873762 [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1 877664 [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1 899282 [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148
Here are some commonly used built-in annotations and what they mean:
| Annotation tag in VCF | Meaning |
|---|---|
| AC,AF,AN | See the Technical Documentation for Chromosome Counts. |
| DB | If present, then the variant is in dbSNP. |
| DP | See the Technical Documentation for Coverage. |
| DS | Were any of the samples downsampled because of too much coverage? |
| Dels | See the Technical Documentation for SpanningDeletions. |
| MQ and MQ0 | See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero. |
| BaseQualityRankSumTest | See the Technical Documentation for Base Quality Rank Sum Test. |
| MappingQualityRankSumTest | See the Technical Documentation for Mapping Quality Rank Sum Test. |
| ReadPosRankSumTest | See the Technical Documentation for Read Position Rank Sum Test. |
| HRun | See the Technical Documentation for Homopolymer Run. |
| HaplotypeScore | See the Technical Documentation for Haplotype Score. |
| QD | See the Technical Documentation for Qual By Depth. |
| VQSLOD | Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model. |
| FS | See the Technical Documentation for Fisher Strand |
| SB | How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls). |
Run a basic analysis command on example data, parallelized with Queue.
One very cool feature of Queue is that you can test your script by doing a "dry run". That means Queue will prepare the analysis and build the scatter commands, but not actually run them. This makes it easier to check the sanity of your script and command.
Here we're going to set up a dry run of a CountReads analysis. You should be familiar with the CountReads walker and the example files from the bundles, as used in the basic "GATK for the first time" tutorial. In addition, we're going to use the example QScript called ExampleCountReads.scala provided in the Queue package download.
Type the following command:
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam
where -S ExampleCountReads.scala specifies which QScript we want to run, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.
After a few seconds you should see output that looks nearly identical to this:
INFO 00:30:45,527 QScriptManager - Compiling 1 QScript
INFO 00:30:52,869 QScriptManager - Compilation complete
INFO 00:30:53,284 HelpFormatter - ----------------------------------------------------------------------
INFO 00:30:53,284 HelpFormatter - Queue v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:18:21
INFO 00:30:53,284 HelpFormatter - Copyright (c) 2012 The Broad Institute
INFO 00:30:53,284 HelpFormatter - Fro support and documentation go to http://www.broadinstitute.org/gatk
INFO 00:30:53,285 HelpFormatter - Program Args: -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam
INFO 00:30:53,285 HelpFormatter - Date/Time: 2012/08/09 00:30:53
INFO 00:30:53,285 HelpFormatter - ----------------------------------------------------------------------
INFO 00:30:53,285 HelpFormatter - ----------------------------------------------------------------------
INFO 00:30:53,290 QCommandLine - Scripting ExampleCountReads
INFO 00:30:53,364 QCommandLine - Added 1 functions
INFO 00:30:53,364 QGraph - Generating graph.
INFO 00:30:53,388 QGraph - -------
INFO 00:30:53,402 QGraph - Pending: 'java' '-Xmx1024m' '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp' '-cp' '/Users/vdauwera/sandbox/Q2/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'CountReads' '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam' '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'
INFO 00:30:53,403 QGraph - Log: /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads-1.out
INFO 00:30:53,403 QGraph - Dry run completed successfully!
INFO 00:30:53,404 QGraph - Re-run with "-run" to execute the functions.
INFO 00:30:53,409 QCommandLine - Script completed successfully with 1 total jobs
INFO 00:30:53,410 QCommandLine - Writing JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.txt
If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK and Queue are properly installed.
If you do see this output, congratulations! You just successfully ran you first Queue dry run!
Once you have verified that the Queue functions have been generated successfully, you can execute the pipeline by appending -run to the command line.
Instead of this command, which we used earlier:
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam
this time you type this:
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run
See the difference?
You should see output that looks nearly identical to this:
INFO 00:56:33,688 QScriptManager - Compiling 1 QScript
INFO 00:56:39,327 QScriptManager - Compilation complete
INFO 00:56:39,487 HelpFormatter - ----------------------------------------------------------------------
INFO 00:56:39,487 HelpFormatter - Queue v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:18:21
INFO 00:56:39,488 HelpFormatter - Copyright (c) 2012 The Broad Institute
INFO 00:56:39,488 HelpFormatter - Fro support and documentation go to http://www.broadinstitute.org/gatk
INFO 00:56:39,489 HelpFormatter - Program Args: -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run
INFO 00:56:39,490 HelpFormatter - Date/Time: 2012/08/09 00:56:39
INFO 00:56:39,490 HelpFormatter - ----------------------------------------------------------------------
INFO 00:56:39,491 HelpFormatter - ----------------------------------------------------------------------
INFO 00:56:39,498 QCommandLine - Scripting ExampleCountReads
INFO 00:56:39,569 QCommandLine - Added 1 functions
INFO 00:56:39,569 QGraph - Generating graph.
INFO 00:56:39,589 QGraph - Running jobs.
INFO 00:56:39,623 FunctionEdge - Starting: 'java' '-Xmx1024m' '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp' '-cp' '/Users/vdauwera/sandbox/Q2/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'CountReads' '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam' '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'
INFO 00:56:39,623 FunctionEdge - Output written to /Users/GG/codespace/GATK/Q2/resources/ExampleCountReads-1.out
INFO 00:56:50,301 QGraph - 0 Pend, 1 Run, 0 Fail, 0 Done
INFO 00:57:09,827 FunctionEdge - Done: 'java' '-Xmx1024m' '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp' '-cp' '/Users/vdauwera/sandbox/Q2/resources/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'CountReads' '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam' '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'
INFO 00:57:09,828 QGraph - 0 Pend, 0 Run, 0 Fail, 1 Done
INFO 00:57:09,835 QCommandLine - Script completed successfully with 1 total jobs
INFO 00:57:09,835 QCommandLine - Writing JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.txt
INFO 00:57:10,107 QCommandLine - Plotting JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.pdf
WARN 00:57:18,597 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
Great! It works!
The results of the traversal will be written to a file in the current directory. The name of the file will be printed in the output, ExampleCountReads.out in this example.
If for some reason the run was interrupted, in most cases you can resume by just launching the command. Queue will pick up where it left off without redoing the parts that ran successfully.
Run with -bsub to run on LSF, or for early Grid Engine support see Queue with Grid Engine.
See also QFunction and Command Line Options for more info on Queue options.
Run a basic analysis command on example data.
A very simple analysis that you can do with the GATK is getting a count of the reads in a BAM file. The GATK is capable of much more powerful analyses, but this is a good starting example because there are very few things that can go wrong.
So we are going to count the reads in the file exampleBAM.bam, which you can find in the GATK resource bundle along with its associated index (same file name with .bai extension), as well as the example reference exampleFASTA.fasta and its associated index (same file name with .fai extension) and dictionary (same file name with .dict extension). Copy them to your working directory so that your directory contents look like this:
[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x 9 vdauwera CHARLES\Domain Users 306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera CHARLES\Domain Users 204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 20 Apr 10 07:39 exampleFASTA.fasta.fai
Type the following command:
java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
where -T CountReads specifies which analysis tool we want to use, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.
For any analysis that you want to run on a set of aligned reads, you will always need to use at least these three arguments:
-T for the tool name, which specifices the corresponding analysis-R for the reference sequence file-I for the input BAM file of aligned readsThey don't have to be in that order in your command, but this way you can remember that you need them if you TRI...
After a few seconds you should see output that looks like to this:
INFO 16:17:45,945 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:45,946 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO 16:17:45,947 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:17:45,947 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:17:45,947 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO 16:17:45,947 HelpFormatter - Date/Time: 2012/07/25 16:17:45
INFO 16:17:45,947 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:45,948 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:45,950 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:17:45,982 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:17:45,993 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:17:46,060 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:17:46,060 TraversalEngine - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33
INFO 16:17:46,061 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours
INFO 16:17:46,100 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO 16:17:46,729 GATKRunReport - Uploaded run statistics report to AWS S3
Depending on the GATK release, you may see slightly different information output, but you know everything is running correctly if you see the line:
INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33
somewhere in your output.
If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK is properly installed.
If you do see this output, congratulations! You just successfully ran you first GATK analysis!
Basically the output you see means that the CountReadsWalker (which you invoked with the command line option -T CountReads) counted 33 reads in the exampleBAM.bam file, which is exactly what we expect to see.
Wait, what is this walker thing?
In the GATK jargon, we call the tools walkers because the way they work is that they walk through the dataset --either along the reference sequence (LocusWalkers), or down the list of reads in the BAM file (ReadWalkers)-- collecting the requested information along the way.
Now that you're rocking the read counts, you can start to expand your use of the GATK command line.
Let's say you don't care about counting reads anymore; now you want to know the number of loci (positions on the genome) that are covered by one or more reads. The name of the tool, or walker, that does this is CountLoci. Since the structure of the GATK command is basically always the same, you can simply switch the tool name, right?
Instead of this command, which we used earlier:
java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
this time you type this:
java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
See the difference?
You should see something like this output:
INFO 16:18:26,183 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:18:26,185 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO 16:18:26,185 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:18:26,185 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:18:26,186 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
INFO 16:18:26,186 HelpFormatter - Date/Time: 2012/07/25 16:18:26
INFO 16:18:26,186 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:18:26,186 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:18:26,189 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:18:26,222 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:18:26,351 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
2052
INFO 16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO 16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO 16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3
Great! But wait -- where's the result? Last time the result was given on this line:
INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33
But this time there is no line that says [REDUCE RESULT]! Is something wrong?
Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen.
So we repeat the command, but this time we specify an output file, like this:
java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt
where -o (lowercase o, not zero) is used to specify the output.
You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included):
INFO 16:29:15,451 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO 16:29:15,453 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt
INFO 16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15
INFO 16:29:15,454 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:29:15,454 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:29:15,618 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO 16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO 16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3
This time however, if we look inside the working directory, there is a newly created file there called output.txt.
[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x 9 vdauwera CHARLES\Domain Users 306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera CHARLES\Domain Users 204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r-- 1 vdauwera CHARLES\Domain Users 5 Jul 25 16:29 output.txt
This file contains the result of the analysis:
[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt
2052
This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.
Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.
Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.
Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:
java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta
the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------
You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.
So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!
There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.
So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!
Test that the GATK is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks the GATK to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your GATK package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to GenomeAnalysisTK.jar> --help
replacing the <path to GenomeAnalysisTK.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L
<intervals>] [-R <reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H
<hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e <err>] [-oe <outerr>] [-A] [-M
<maximum_reads>] [-sort <sort_on_the_fly>] [-compress <bam_compression>] [-fmq0] [-dfrac
<downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S
<validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l
<logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h]
-T,--analysis_type <analysis_type> Type of analysis to run
-I,--input_file <input_file> SAM or BAM file(s)
-L,--intervals <intervals> A list of genomic intervals over which
to operate. Can be explicitly specified
on the command line or in a file.
-R,--reference_sequence <reference_sequence> Reference sequence file
-B,--rodBind <rodBind> Bindings for reference-ordered data, in
the form <name>,<type>,<file>
-D,--DBSNP <DBSNP> DBSNP file
-H,--hapmap <hapmap> Hapmap file
-hc,--hapmap_chip <hapmap_chip> Hapmap chip file
-o,--out <out> An output file presented to the walker.
Will overwrite contents if file exists.
-e,--err <err> An error output file presented to the
walker. Will overwrite contents if file
exists.
-oe,--outerr <outerr> A joint file for 'normal' and error
output presented to the walker. Will
overwrite contents if file exists.
...
If you see this message, your GATK installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
Test that Queue is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to Queue.jar> --help
replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
[-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
<emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>]
[-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
<status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
[-debug] [-h]
-S,--script <script> QScript scala file
-jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs.
-jobQueue,--job_queue <job_queue> Default queue for compute farm jobs.
-jobProject,--job_project <job_project> Default project for compute farm jobs.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather
output for compute farm jobs.
-memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory> Root directory to run functions from.
-tempDir,--temp_directory <temp_directory> Temp directory to pass to functions.
-emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL Email should use SSL. Defaults to false.
-emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none.
-emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not
secure! See emailPassFile.
-emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none.
-bsub,--bsub_all_jobs Use bsub to submit jobs
-run,--run_scripts Run QScripts. Without this flag set only
performs a dry run.
-dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to
a .dot file. Otherwise overwrites the
dot_graph
-startFromScratch,--start_from_scratch Runs all command line functions even if the
outputs were previously output successfully.
-status,--status Get status of jobs for the qscript
-statusFrom,--status_email_from <status_email_from> Email address to send emails from upon
completion or on error.
-statusTo,--status_email_to <status_email_to> Email address to send emails to upon
completion or on error.
-keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of
any Function marked as intermediate.
-retry,--retry_failed <retry_failed> Retry the specified number of times after a
command fails. Defaults to no retries.
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file> Set the logging location
-quiet,--quiet_output_mode Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help Generate this help message
If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:
Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.
With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.
You have two options: donwload the binary distribution (prepackaged, ready to run program) or build it from source.
This is obviously the easiest way to go. Links are on the Downloads page.
Briefly, here's what you need to know/do:
Queue is part of the Sting repository. Download the source from our repository on Github. Run the following command:
git clone git://github.com/broadgsa/gatk.git Sting
Use ant to build the source.
cd Sting
ant queue
Queue uses the Ivy dependency manager to fetch all other dependencies. Just make sure you have suitable versions of the JDK and Ant!
See this article on how to test your installation of Queue.
See this article on running Queue for the first time for full details.
Queue arguments can be listed by running with --help
java -jar dist/Queue.jar --help
To list the arguments required by a QScript, add the script with -S and run with --help.
java -jar dist/Queue.jar -S script.scala --help
Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.
See QFunction and Command Line Options for more info on adjusting Queue options.
Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.
Every QScript includes the following steps:
add() to Queue for dispatch and monitoringThe basic command-line to run the Queue pipelines on the command line is
java -jar Queue.jar -S <script>.scala
See the main article Queue QScripts for more info on QScripts.
While most QScripts are analysis pipelines that are custom-built for specific projects, some have been released as supported tools. See
The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples
Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.
Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile .libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.
Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.
To clarify your pipeline, override the dotString() function:
class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
@Input(doc="foo") var bam = bamIn
@Input(doc="foo") var bamIndex = bai(bamIn)
@Output(doc="foo") var recalData = recalDataIn
memoryLimit = Some(4)
override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}
Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:
The GATK runs natively on most if not all flavors of UNIX, which includes MacOSX, Linux and BSD. It is possible to get it running on Windows using Cygwin, but we don't provide any support nor instructions for that.
The GATK is a Java-based program, so you'll need to have Java installed on your machine. The Java version should be at 1.6 (at this time we don't support 1.7). You can check what version you have by typing java -version at the command line. This article has some more details about what to do if you don't have the right version. Note that at this time we only support the Sun/Oracle Java JDK; OpenJDK is not supported.
The GATK does not have a Graphical User Interface (GUI). You don't open it by clicking on the .jar file; you have to use the Console (or Terminal) to input commands. If this is all new to you, we recommend you first learn about that and follow some online tutorials before trying to use the GATK. It's not difficult but you'll need to learn some jargon and get used to living without a mouse...
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.
If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.
The only input format for NGS reads that the GATK supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.
In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:
.bam file extension).Below is an example well-formed SAM field header and fields from the 1000 Genomes Project:
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4 LN:191154276 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:23dccd106897542ad87d2765d28a19a1
@SQ SN:5 LN:180915260 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:0740173db9ffd264d728f32784845cd7
@SQ SN:6 LN:171115067 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1d3a93a248d92a729ee764823acbbc6b
@SQ SN:7 LN:159138663 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:618366e953d6aaad97dbe4777c29375e
@SQ SN:8 LN:146364022 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:96f514a9929e410c6651697bded59aec
@SQ SN:9 LN:141213431 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:3e273117f15e0a400f01055d9f393768
@SQ SN:10 LN:135534747 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:988c28e000e84c26d552359af1ea2e1d
@SQ SN:11 LN:135006516 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ SN:12 LN:133851895 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:51851ac0e1a115847ad36449b0015864
@SQ SN:13 LN:115169878 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:283f8d7892baa81b510a015719ca7b0b
@SQ SN:14 LN:107349540 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:98f3cae32b2a2e9524bc19813927542e
@SQ SN:15 LN:102531392 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:e5645a794a8238215b2cd77acb95a078
@SQ SN:16 LN:90354753 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ SN:17 LN:81195210 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ SN:18 LN:78077248 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ SN:19 LN:59128983 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1aacd71f30db8e561810913e0b72636d
@SQ SN:20 LN:63025520 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ SN:21 LN:48129895 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ SN:22 LN:51304566 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a718acaa6135fdca8357d5bfe94211dd
@SQ SN:X LN:155270560 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ SN:Y LN:59373566 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1fa3474750af0948bdf97d5a0ee52e51
@SQ SN:MT LN:16569 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:c68f52674c9fb33aef52dcf399755519
@RG ID:ERR000162 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR000252 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001684 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001685 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001686 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001687 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001688 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001689 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001690 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002307 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002308 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002309 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002310 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002311 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002312 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002313 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002434 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@PG ID:GATK TableRecalibration VN:v2.2.16 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau
t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b4eb71ee878d3706246b7c1dbef69299
@PG ID:bwa VN:0.5.5
ERR001685.4315085 16 1 9997 25 35M * 0 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@? XT:A:U XN:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001685 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
ERR001689.1165834 117 1 9997 0 * = 9997 0 CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@ RG:Z:ERR001689 OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
ERR001689.1165834 185 1 9997 25 35M = 9997 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT 758A:?>>8?=@@>>?;4<>=??@@==??@?==?8 XT:A:U XN:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001689 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
ERR001688.2681347 117 1 9998 0 * = 9998 0 CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 5@BA@A6B???A?B??>B@B??>B@B??>BAB??? RG:Z:ERR001688 OQ:Z:=>>>><4><<?><??????????????????????
The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.
The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files have a .interval_list extension and look like this:
@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens
@SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens
@SQ SN:3 LN:198022430 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:fdfd811849cc2fadebc929bb925902e5 SP:Homo Sapiens
@SQ SN:4 LN:191154276 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:23dccd106897542ad87d2765d28a19a1 SP:Homo Sapiens
@SQ SN:5 LN:180915260 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:0740173db9ffd264d728f32784845cd7 SP:Homo Sapiens
@SQ SN:6 LN:171115067 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1d3a93a248d92a729ee764823acbbc6b SP:Homo Sapiens
@SQ SN:7 LN:159138663 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:618366e953d6aaad97dbe4777c29375e SP:Homo Sapiens
@SQ SN:8 LN:146364022 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:96f514a9929e410c6651697bded59aec SP:Homo Sapiens
@SQ SN:9 LN:141213431 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:3e273117f15e0a400f01055d9f393768 SP:Homo Sapiens
@SQ SN:10 LN:135534747 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:988c28e000e84c26d552359af1ea2e1d SP:Homo Sapiens
@SQ SN:11 LN:135006516 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:98c59049a2df285c76ffb1c6db8f8b96 SP:Homo Sapiens
@SQ SN:12 LN:133851895 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:51851ac0e1a115847ad36449b0015864 SP:Homo Sapiens
@SQ SN:13 LN:115169878 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:283f8d7892baa81b510a015719ca7b0b SP:Homo Sapiens
@SQ SN:14 LN:107349540 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:98f3cae32b2a2e9524bc19813927542e SP:Homo Sapiens
@SQ SN:15 LN:102531392 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:e5645a794a8238215b2cd77acb95a078 SP:Homo Sapiens
@SQ SN:16 LN:90354753 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:fc9b1a7b42b97a864f56b348b06095e6 SP:Homo Sapiens
@SQ SN:17 LN:81195210 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:351f64d4f4f9ddd45b35336ad97aa6de SP:Homo Sapiens
@SQ SN:18 LN:78077248 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c SP:Homo Sapiens
@SQ SN:19 LN:59128983 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1aacd71f30db8e561810913e0b72636d SP:Homo Sapiens
@SQ SN:20 LN:63025520 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:0dec9660ec1efaaf33281c0d5ea2560f SP:Homo Sapiens
@SQ SN:21 LN:48129895 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:2979a6085bfe28e3ad6f552f361ed74d SP:Homo Sapiens
@SQ SN:22 LN:51304566 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a718acaa6135fdca8357d5bfe94211dd SP:Homo Sapiens
@SQ SN:X LN:155270560 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:7e0e2e580297b7764e31dbc80c2540dd SP:Homo Sapiens
@SQ SN:Y LN:59373566 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1fa3474750af0948bdf97d5a0ee52e51 SP:Homo Sapiens
@SQ SN:MT LN:16569 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:c68f52674c9fb33aef52dcf399755519 SP:Homo Sapiens
1 30366 30503 + target_1
1 69089 70010 + target_2
1 367657 368599 + target_3
1 621094 622036 + target_4
1 861320 861395 + target_5
1 865533 865718 + target_6
...
consisting of a SAM-file-like sequence dictionary (the header), and targets in the form of .dict extension) and your intervals into one file.
You can also specify a list of intervals in a .interval_list file formatted as
Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.
The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:
-argumentName:name,type file
Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data.
The GATK supports several common file formats for reading ROD data:
Note that we no longer support the PED format. See here for converting .ped files to VCF.
Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default.
The information provided by the phone-home feature is critical in driving improvements to the GATK
Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
<GATK-run-report>
<id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id>
<exception>
<message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message>
<stacktrace class="java.util.ArrayList">
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string>
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
</stacktrace>
<cause>
<message>Position: '10,000,001x' contains invalid chars.</message>
<stacktrace class="java.util.ArrayList">
<string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string>
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string>
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
</stacktrace>
<is-user-exception>false</is-user-exception>
</cause>
<is-user-exception>true</is-user-exception>
</exception>
<start-time>2012/03/10 20.19.52</start-time>
<end-time>2012/03/10 20.19.52</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>0</iterations>
</GATK-run-report>
Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory.
The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature.
At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK.
Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment.
Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report.
For such users we have provided an -et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below.
To obtain a GATK key, please fill out the request form.
Running the GATK with a key is simple: you just need to append a -K your.key argument to your customary command line, where your.key is the path to the key file you obtained from us:
java -jar dist/GenomeAnalysisTK.jar \
-T PrintReads \
-I public/testdata/exampleBAM.bam \
-R public/testdata/exampleFASTA.fasta \
-et NO_ET \
-K your.key
The -K argument is only necessary when running the GATK with the NO_ET option.
If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''gsahelp@broadinstitute.org''' to ask for a replacement key.
If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an ant clean and re-compiling. If all else fails, please ask for help on our community forum.
We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review.
You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).
But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?
To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.
As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.

We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.
The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.
For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the .jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.
We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.
We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.

Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:
The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.
Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.
The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.
We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!
One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.
Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.
Map/Reduce is the technique we use to achieve this. It consists of three steps formally called filter, map and reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.
filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.
map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.
reduce combines the elements in the list of results output by the map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.
This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.
All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done.
Note that even though it’s not included in the Map/Reduce technique’s name, the filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.
Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways.
There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers.
The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often.
A primer on parallelism with the GATK How can I use parallelism to make GATK tools run faster?
Inside of the Broad, the latest bundle will always be available in:
/humgen/gsa-hpprojects/GATK/bundle/current
with a subdirectory containing for each reference sequence and associated data files.
External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the UCSC-style hg19 reference along with all lifted over VCF files.