Detailed information about command line options for BaseRecalibrator can be found here.
The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc.
New with the release of the full version of GATK 2.0 is the ability to recalibrate not only the well-known base quality scores but also base insertion and base deletion quality scores. These are per-base quantities which estimate the probability that the next base in the read was mis-incorporated or mis-deleted (due to slippage, for example). We've found that these new quality scores are very valuable in indel calling algorithms. In particular these new probabilities fit very naturally as the gap penalties in an HMM-based indel calling algorithms. We suspect there are many other fantastic uses for these data.
This process is accomplished by analyzing the covariation among several features of a base. For example:
These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.
For example, pre-calibration a file could contain only reported Q25 bases, which seems good. However, it may be that these bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20. These higher-than-empirical quality scores provide false confidence in the base calls. Moreover, as is common with sequencing-by-synthesis machine, base mismatches with the reference occur at the end of the reads more frequently than at the beginning. Also, mismatches are strongly associated with sequencing context, in that the dinucleotide AC is often much lower quality than TG. The recalibration tool will not only correct the average Q inaccuracy (shifting from Q25 to Q20) but identify subsets of high-quality bases by separating the low-quality end of read bases AC bases from the high-quality TG bases at the start of the read. See below for examples of pre and post corrected values.
The system was designed for users to be able to easily add new covariates to the calculations. For users wishing to add their own covariate simply look at QualityScoreCovariate.java for an idea of how to implement the required interface. Each covariate is a Java class which implements the org.broadinstitute.sting.gatk.walkers.recalibration.Covariate interface. Specifically, the class needs to have a getValue method defined which looks at the read and associated sequence context and pulls out the desired information such as machine cycle.
Detailed information about command line options for BaseRecalibrator can be found here.
This GATK processing step walks over all of the reads in my_reads.bam and tabulates data about the following features of the bases:
For each bin, we count the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to dbSNP. After running over all reads, BaseRecalibrator produces a file called my_reads.recal_data.grp, which contains the data needed to recalibrate reads. The format of this GATK report is described below.
To create a recalibrated BAM you can use GATK's PrintReads with the engine on-the-fly recalibration capability. Here is a typical command line to do so:
java -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R reference.fasta \ -I input.bam \ -BQSR recalibration_report.grp \ -o output.bam
After computing covariates in the initial BAM File, we then walk through the BAM file again and rewrite the quality scores (in the QUAL field) using the data in the recalibration_report.grp file, into a new BAM file.
This step uses the recalibration table data in recalibration_report.grp produced by BaseRecalibration to recalibrate the quality scores in input.bam, and writing out a new BAM file output.bam with recalibrated QUAL field values.
Effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as SNP calling. In additional, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
The recalibration report is a [GATKReport](http://gatk.vanillaforums.com/discussion/1244/what-is-a-gatkreport) and not only contains the main result of the analysis, but it is also used as an input to all subsequent analyses on the data. The recalibration report contains the following 5 tables:
This is the table that contains all the arguments used to run BQSRv2 for this dataset. This is important for the on-the-fly recalibration step to use the same parameters used in the recalibration step (context sizes, covariates, ...).
Example Arguments table:
#:GATKTable:true:1:17::; #:GATKTable:Arguments:Recalibration argument collection values used in this run Argument Value covariate null default_platform null deletions_context_size 6 force_platform null insertions_context_size 6 ...
The GATK offers native support to quantize base qualities. The GATK quantization procedure uses a statistical approach to determine the best binning system that minimizes the error introduced by amalgamating the different qualities present in the specific dataset. When running BQSRv2, a table with the base counts for each base quality is generated and a 'default' quantization table is generated. This table is a required parameter for any other tool in the GATK if you want to quantize your quality scores.
The default behavior (currently) is to use no quantization when performing on-the-fly recalibration. You can override this by using the engine argument -qq. With -qq 0 you don't quantize qualities, or -qq N you recalculate the quantization bins using N bins on the fly. Note that quantization is completely experimental now and we do not recommend using it unless you are a super advanced user.
Example Arguments table:
#:GATKTable:true:2:94:::; #:GATKTable:Quantized:Quality quantization map QualityScore Count QuantizedScore 0 252 0 1 15972 1 2 553525 2 3 2190142 9 4 5369681 9 9 83645762 9 ...
This table contains the empirical quality scores for each read group, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:18:%s:%s:%.4f:%.4f:%d:%d:; #:GATKTable:RecalTable0: ReadGroup EventType EmpiricalQuality EstimatedQReported Observations Errors SRR032768 D 40.7476 45.0000 2642683174 222475 SRR032766 D 40.9072 45.0000 2630282426 213441 SRR032764 D 40.5931 45.0000 2919572148 254687 SRR032769 D 40.7448 45.0000 2850110574 240094 SRR032767 D 40.6820 45.0000 2820040026 241020 SRR032765 D 40.9034 45.0000 2441035052 198258 SRR032766 M 23.2573 23.7733 2630282426 12424434 SRR032768 M 23.0281 23.5366 2642683174 13159514 SRR032769 M 23.2608 23.6920 2850110574 13451898 SRR032764 M 23.2302 23.6039 2919572148 13877177 SRR032765 M 23.0271 23.5527 2441035052 12158144 SRR032767 M 23.1195 23.5852 2820040026 13750197 SRR032766 I 41.7198 45.0000 2630282426 177017 SRR032768 I 41.5682 45.0000 2642683174 184172 SRR032769 I 41.5828 45.0000 2850110574 197959 SRR032764 I 41.2958 45.0000 2919572148 216637 SRR032765 I 41.5546 45.0000 2441035052 170651 SRR032767 I 41.5192 45.0000 2820040026 198762
This table contains the empirical quality scores for each read group and original quality score, for mismatches insertions and deletions. This is not different from the table used in the old table recalibration walker.
#:GATKTable:false:6:274:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable1: ReadGroup QualityScore EventType EmpiricalQuality Observations Errors SRR032767 49 M 33.7794 9549 3 SRR032769 49 M 36.9975 5008 0 SRR032764 49 M 39.2490 8411 0 SRR032766 18 M 17.7397 16330200 274803 SRR032768 18 M 17.7922 17707920 294405 SRR032764 45 I 41.2958 2919572148 216637 SRR032765 6 M 6.0600 3401801 842765 SRR032769 45 I 41.5828 2850110574 197959 SRR032764 6 M 6.0751 4220451 1041946 SRR032767 45 I 41.5192 2820040026 198762 SRR032769 6 M 6.3481 5045533 1169748 SRR032768 16 M 15.7681 12427549 329283 SRR032766 16 M 15.8173 11799056 309110 SRR032764 16 M 15.9033 13017244 334343 SRR032769 16 M 15.8042 13817386 363078 ...
This table has the empirical qualities for each covariate used in the dataset. The default covariates are cycle and context. In the current implementation, context is of a fixed size (default 6). Each context and each cycle will have an entry on this table stratified by read group and original quality score.
#:GATKTable:false:8:1003738:%s:%s:%s:%s:%s:%.4f:%d:%d:; #:GATKTable:RecalTable2: ReadGroup QualityScore CovariateValue CovariateName EventType EmpiricalQuality Observations Errors SRR032767 16 TACGGA Context M 14.2139 817 30 SRR032766 16 AACGGA Context M 14.9938 1420 44 SRR032765 16 TACGGA Context M 15.5145 711 19 SRR032768 16 AACGGA Context M 15.0133 1585 49 SRR032764 16 TACGGA Context M 14.5393 710 24 SRR032766 16 GACGGA Context M 17.9746 1379 21 SRR032768 45 CACCTC Context I 40.7907 575849 47 SRR032764 45 TACCTC Context I 43.8286 507088 20 SRR032769 45 TACGGC Context D 38.7536 37525 4 SRR032768 45 GACCTC Context I 46.0724 445275 10 SRR032766 45 CACCTC Context I 41.0696 575664 44 SRR032769 45 TACCTC Context I 43.4821 490491 21 SRR032766 45 CACGGC Context D 45.1471 65424 1 SRR032768 45 GACGGC Context D 45.3980 34657 0 SRR032767 45 TACGGC Context D 42.7663 37814 1 SRR032767 16 AACGGA Context M 15.9371 1647 41 SRR032764 16 GACGGA Context M 18.2642 1273 18 SRR032769 16 CACGGA Context M 13.0801 1442 70 SRR032765 16 GACGGA Context M 15.9934 1271 31 ...
The memory requirements of the recalibrator will vary based on the type of JVM running the application and the number of read groups in the input bam file.
If the application reports 'java.lang.OutOfMemoryError: Java heap space', increase the max heap size provided to the JVM by adding ' -Xmx????m' to the jvm_args variable in RecalQual.py, where '????' is the maximum available memory on the processing computer.
I've tried recalibrating my data using a downloaded file, such as NA12878 on 454, and apply the table to any of the chromosome BAM files always fails due to hitting my memory limit. I've tried giving it as much as 15GB but that still isn't enough.
All of our big merged files for 454 are running with -Xmx16000m arguments to the JVM -- it's enough to process all of the files. 32GB might make the 454 runs a lot faster though.
I have a recalibration file calculated over the entire genome (such as for the 1000 genomes trio) but I split my file into pieces (such as by chromosome). Can the recalibration tables safely be applied to the per chromosome BAM files?
Yes they can. The original tables needed to be calculated over the whole genome but they can be applied to each piece of the data set independently.
I'm working on a genome that doesn't really have a good SNP database yet. I'm wondering if it still makes sense to run base quality score recalibration without known SNPs.
The base quality score recalibrator treats every reference mismatch as indicative of machine error. True polymorphisms are legitimate mismatches to the reference and shouldn't be counted against the quality of a base. We use a database of known polymorphisms to skip over most polymorphic sites. Unfortunately without this information the data becomes almost completely unusable since the quality of the bases will be inferred to be much much lower than it actually is as a result of the reference-mismatching SNP sites.
However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works:
For users concerned about run time please note this small analysis below showing the approximate number of reads per read group that are required to achieve a given level of recalibration performance. The analysis was performed with 51 base pair Illumina reads on pilot data from the 1000 Genomes Project. Downsampling can be achieved by specifying a genome interval using the -L option. For users concerned only with recalibration accuracy please disregard this plot and continue to use all available data when generating the recalibration table.
Our current best practice for making SNP and indel calls is divided into four sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes.

Example commands for each tool are available on the individual tool's wiki entry. There is also a list of which resource files to use with which tool.
Note that due to the specific attributes of a project the specific values used in each of the commands may need to be selected/modified by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits the data and project design.
There are four major organizational units for next-generation DNA sequencing processes that used throughout this documentation:
This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.
In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data from the GATK resource bundle. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.
In our GATK 2.0 slide archive.
The GATK data processing pipeline assumes that one of the many NGS read aligners (see this review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits SAM files natively (which are then easy to convert to BAM).
The three key processes used here are:
There are several options here from the easy and fast basic protocol to the more comprehensive but computationally expensive pipeline. For example, there are two types of realignment which constitute a vastly different amount of processing power required:
This protocol uses lane-level local realignment around known indels (very fast, as there's no sample level processing) to clean up lane-level alignments. This results in better quality scores, as they are less biased for indel alignment artefacts.
for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)
Here we are essentially just merging the recalibrated lane.bams for a sample, dedupping the reads, and calling it quite. It doesn't perform indel realignment across lanes, so it leaves in some indels artifacts. For humans, which now have an extensive list of indels (get them from the GATK bundle!) the lane-level realignment around known indels is going to make up for the lack of cross-lane realignment. This protocol is appropriate if you are going to use callers like the HaplotypeCaller, UnifiedGenotyper with BAQ, or samtools with BAQ that are less sensitive to the initial alignment of reads, or if your project has limited coverage per sample (< 8x) where per-sample indel realignment isn't more empowered than per-lane realignment. For other situations or for organisms with limited database of segregating indels, it's better to use the advanced protocol if you have deep enough data per sample.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
sample.bam <- dedup.bam
As with the basic protocol, this protocol assumes the per-lane processing has been already completed. This protocol is essentially the basic protocol but with per-sample indel realignment.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
sample.bam <- realigned.bam
This is the protocol we use at the Broad in our fully automated pipeline because it gives an optimal balance of performance, accuracy and convenience.
Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.
for each sample
lanes.bam <- merged lane.bams for sample
dedup.bam <- MarkDuplicates(lanes.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
recal.bam <- recal(realigned.bam)
sample.bam <- recal.bam
This protocol can be hard to implement in practice unless you can afford to wait until all of the data is available to do data processing for your samples.
ReduceReads is a novel (perhaps even breakthrough?) GATK 2.0 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.
You can find more information about reduced reads in some of our presentations in the archive.
ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.
Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.
for each sample
sample.reduced.bam <- ReduceReads(sample.bam)
After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass.
For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.
The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Haplotype Caller or Unified Genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.
A common question is the confidence score threshold to use for variant detection. We recommend:
raw.vcf <- HaplotypeCaller(sample1.bam, sample2.bam, ..., sampleN.bam)
raw.vcf <- UnifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)
-nt option. However you can use Queue to parallelize execution of HaplotypeCaller.This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the false positive machine artifacts from the true positive genetic variants.
The process used here is the Variant quality score recalibrator which builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant in the callset is a true genetic variant or a machine/alignment artifact. All filtering criteria are learned from the data itself.
Take a look at our FAQ page for specific recommendations on which training sets and command-line arguments to use with various project designs. It is expected that analysis will be required by the user when determining the best way to use this tool for different projects. These recommendations are only meant as jumping off points!
We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):
--maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters. These recommended arguments for VariantFiltration are only to used when ALL other options are not available:
You will need to compose filter expressions (see here, here and here for details) to filter on the following annotations and values:
For SNPs:
QD < 2.0MQ < 40.0FS > 60.0HaplotypeScore > 13.0 MQRankSum < -12.5ReadPosRankSum < -8.0For indels:
QD < 2.0ReadPosRankSum < -20.0InbreedingCoeff < -0.8FS > 200.0Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by variant quality score recalibration.
New in GATK 2.0 is the capability of UnifiedGenotyper to natively call non-diploid organisms. Three use cases are currently supported:
In order to enable this feature, users need to set the -ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e. (Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \
-R reference.fasta \
-I myReads.bam \
-T UnifiedGenotyper \
-ploidy 3
The glm argument works in the same way as in the diploid case - set to [INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed in the following weeks as we iron out details and fix issues in the GATK 2.0 beta.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, current InbreedingCoeff is omitted. Annotations which do work and are supported in non-diploid use cases are the following: QUAL, QD, SB, FS, AC, AF and Genotype annotations such as PL, AD, GT, etc.
The interaction between non-diploid calling and other experimental tools like HaplotypeCaller or ReduceReads is currently not supported.
Whereas it's entirely possible to use VQSR to filter non-diploid calls, we currently have no experience with this and can hence offer no support nor best practices for this.
Only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
Users should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). No other file formats are supported.
The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.
All BAM files must satisfy the following requirements:
See the BAM specification for more information.
It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:
Human genome reference consortium standard ordering and names (b3x): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...
UCSC convention (hg1x): chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...
The easiest way to do it is to download Samtools and run the following command to examine the header of your file:
$ samtools view -H /path/to/my.bam
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:1 LN:247249719
@SQ SN:2 LN:242951149
@SQ SN:3 LN:199501827
@SQ SN:4 LN:191273063
@SQ SN:5 LN:180857866
@SQ SN:6 LN:170899992
@SQ SN:7 LN:158821424
@SQ SN:8 LN:146274826
@SQ SN:9 LN:140273252
@SQ SN:10 LN:135374737
@SQ SN:11 LN:134452384
@SQ SN:12 LN:132349534
@SQ SN:13 LN:114142980
@SQ SN:14 LN:106368585
@SQ SN:15 LN:100338915
@SQ SN:16 LN:88827254
@SQ SN:17 LN:78774742
@SQ SN:18 LN:76117153
@SQ SN:19 LN:63811651
@SQ SN:20 LN:62435964
@SQ SN:21 LN:46944323
@SQ SN:22 LN:49691432
@SQ SN:X LN:154913754
@SQ SN:Y LN:57772954
@SQ SN:MT LN:16571
@SQ SN:NT_113887 LN:3994
...
If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.
Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.
A quick Unix command using Samtools will do the trick:
$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0 PL:solid PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414 CN:bcm
@RG ID:1 PL:solid PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414 CN:bcm
@RG ID:2 PL:LS454 PU:R_2008_10_02_06_06_12_FLX01080312_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
@RG ID:3 PL:LS454 PU:R_2008_10_02_06_07_08_rig19_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
@RG ID:4 PL:LS454 PU:R_2008_10_02_17_50_32_FLX03080339_retry LB:HL#01_NA11881 PI:0 SM:NA11881 CN:454MSC
...
The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.
In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,
$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781 35 1 1 0 51M = 9 59 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601 35 1 1 0 51M = 12 62 TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3 MF:i:18 Aq:i:0 NM:i:1 UQ:i:5 H0:i:0 H1:i:85
EAS139_44:8:59:118:13881 35 1 1 0 51M = 2 52 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391 35 1 1 0 51M = 12 62 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0 MF:i:18 Aq:i:0 NM:i:0 UQ:i:0 H0:i:85 H1:i:31
...
membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).
Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information.
For technical details, see the SAM specification on the Samtools website.
| Tag | Importance | SAM spec definition | Meaning |
|---|---|---|---|
ID |
Required | Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read groupIDs may be modified when merging SAM files in order to handle collisions. |
Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model. |
SM |
Sample. Use pool name where a pool is being sequenced. | Required. As important as ID. |
The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper. |
PL |
Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. | Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data . | It's a good idea to use this field. |
LB |
DNA preparation library identify | Essential for MarkDuplicates | MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. |
We do not require value for the CN, DS, DT, PG, PI, or PU fields.
A concrete example may be instructive. Suppose I have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an Illumina HiSeq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, I would create 12 bam files, with the following @RG fields in the header:
Dad's data:
@RG ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE2 PL:ILLUMINA LB:LIB-DAD-1 SM:DAD PI:200
@RG ID:FLOWCELL1.LANE3 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
@RG ID:FLOWCELL1.LANE4 PL:ILLUMINA LB:LIB-DAD-2 SM:DAD PI:400
Mom's data:
@RG ID:FLOWCELL1.LANE5 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE6 PL:ILLUMINA LB:LIB-MOM-1 SM:MOM PI:200
@RG ID:FLOWCELL1.LANE7 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
@RG ID:FLOWCELL1.LANE8 PL:ILLUMINA LB:LIB-MOM-2 SM:MOM PI:400
Kid's data:
@RG ID:FLOWCELL2.LANE1 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE2 PL:ILLUMINA LB:LIB-KID-1 SM:KID PI:200
@RG ID:FLOWCELL2.LANE3 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
@RG ID:FLOWCELL2.LANE4 PL:ILLUMINA LB:LIB-KID-2 SM:KID PI:400
Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).
Use Picard's AddOrReplaceReadGroups tool to add read group information.
Picard contains a tool called ValidateSamFile that can be used for this. BAMs passing STRICT validation stringency work best with the GATK.
You can use the GATK to do the following:
GATK -I full.bam -T PrintReads -L chr1:10-20 -o subset.bam
and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.
We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.
VCFTools contains a validation tool that will allow you to verify it.
No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts.
We support three types of interval lists, as mentioned here. Interval lists should preferentially be formatted as Picard-style interval lists, with an explicit sequence dictionary, as this prevents accidental misuse (e.g. hg18 intervals on an hg19 file). Note that this file is 1-based, not 0-based (first position in the genome is position 1).
One relatively easy way to combine your intervals is to use the online tool Galaxy, using the Get Data -> Upload command to upload your intervals, and the Operate on Genomic Intervals command to compute the intersection or union of your intervals (depending on your needs).
The GATK can be particular about the ordering of a BAM file. If you find yourself in the not uncommon situation of having created or received BAM files sorted in a bad order, you can use the tool ReorderSam to generate a new BAM file where the reads have been reordered to match a well-ordered reference file.
java -jar picard/ReorderSam.jar I= lexicographc.bam O= kayrotypic.bam REFERENCE= Homo_sapiens_assembly18.kayrotypic.fasta
This tool requires you have a correctly sorted version of the reference sequence you used to align your reads. This tool will drop reads that don't have equivalent contigs in the new reference (potentially bad, but maybe not). If contigs have the same name in the bam and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a lift over tool!
The tool, though once in the GATK, is now part of the Picard package.
The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.
The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.
Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).
This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.
Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.
| Argument (short-name) | Argument (long-name) | Description |
|---|---|---|
| -i <BAM file / BAM list> | --input <BAM file / BAM list> | input BAM file - or list of BAM files. |
| -R <fasta> | --reference <fasta> | Reference fasta file. |
| -D <vcf> | --dbsnp <dbsnp vcf> | dbsnp ROD to use (must be in VCF format). |
| Argument (short-name) | Argument (long-name) | Description |
|---|---|---|
| -indels <vcf> | --extra_indels <vcf> | VCF files to use as reference indels for Indel Realignment. |
| -bwa <path> | --path_to_bwa <path> | The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option) |
| -outputDir <path> | --output_directory <path> | Output path for the processed BAM files. |
| -L <GATK interval string> | --gatk_interval_string <GATK interval string> | the -L interval string to be used by GATK - output bams at interval only |
| -intervals <GATK interval file> | --gatk_interval_file <GATK interval file> | an intervals file to be used by GATK - output bams at intervals |
| Argument (short-name) | Argument (long-name) | Description |
|---|---|---|
| -p <name> | --project <name> | the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam |
| -knowns | --knowns_only | Perform cleaning on knowns only. |
| -sw | --use_smith_waterman | Perform cleaning using Smith Waterman |
| -bwase | --use_bwa_single_ended | Decompose input BAM file and fully realign it using BWA and assume Single Ended reads |
| -bwape | --use_bwa_pair_ended | Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads |
Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:
This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.
This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.
This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.
This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate
The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.
The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.
We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.
Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.
PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.
Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)
java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run
Performing realignment and the full data processing pipeline in one pair-ended bam file
java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run
Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
Usage of GenotypeAndValidate and its command line arguments are described here.
The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
| ALT | REF | Predictive Value | |
|---|---|---|---|
| called alt | True Positive (TP) | False Positive (FP) | Positive PV |
| called ref | False Negative (FN) | True Negative (TN) | Negative PV |
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)
The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).
Genotypes BAM file from new technology using the VCF as a truth dataset:
java \
-jar /GenomeAnalysisTK.jar \
-T GenotypeAndValidate \
-R human_g1k_v37.fasta \
-I myNewTechReads.bam \
-alleles handAnnotatedVCF.vcf \
-BTI alleles \
-o gav.vcf
An annotated VCF example (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1
1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./.
13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./.
1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
Using a BAM file as the truth dataset:
java \
-jar /GenomeAnalysisTK.jar \
-T GenotypeAndValidate \
-R human_g1k_v37.fasta \
-I myTruthDataset.bam \
-alleles callsToValidate.vcf \
-BTI alleles \
-bt \
-o gav.vcf
Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:

WARNING: unfortunately we do not have the resources to directly support the HLA typer at this time. As such this tool is no longer under active development or supported by our group. The source code is available in the GATK as is. This tool may or may not work without substantial experimentation by an analyst.
Inherited DNA sequence variation in the major histocompatibilty complex (MHC) on human chromosome 6 significantly influences the inherited risk for autoimmune diseases and the host response to pathogenic infections. Collecting allelic sequence information at the classical human leukocyte antigen (HLA) genes is critical for matching in organ transplantation and for genetic association studies, but is complicated due to the high degree of polymorphism across the MHC. Next-generation sequencing offers a cost-effective alternative to Sanger-based sequencing, which has been the standard for classical HLA typing. To bridge the gap between traditional typing and newer sequencing technologies, we developed a generic algorithm to call HLA alleles at 4-digit resolution from next-generation sequence data.
The HLA-specific walkers/tools (FindClosestHLA, CalculateBaseLikelihoods, and HLACaller) are available as a separate download from our FTP site and as source code only. Instructions for obtaining and compiling them are as follows:
Download the source code (in a tar ball):
location: ftp://gsapubftp-anonymous@ftp.broadinstitute.org password: <blank> subdirectory: HLA/
Untar the file, 'cd' into the untar'ed directory and compile with 'ant'
Remember that we no longer support this tool, so if you encounter issues with any of these steps please do NOT post them to our support forum.
Algorithmic components of the HLA caller:

The HLA caller algorithm, developed as part of the open-source GATK, examines sequence reads aligned to the classical HLA loci taking SAM/BAM formatted files as input and calculates, for each locus, the posterior probabilities for all pairs of classical alleles based on three key considerations: (1) genotype calls at each base position, (2) phase information of nearby variants, and (3) population-specific allele frequencies. See the diagram below for a visualization of the heuristic. The output of the algorithm is a list of HLA allele pairs with the highest posterior probabilities.
Functionally, the HLA caller was designed to run in three steps:
The "FindClosestAllele" walker detects misaligned reads by comparing each read to the dictionary of HLA alleles (reads with < 75% SNP homology to the closest matching allele are removed)
The "CalculateBaseLikelihoods" walker calculates the likelihoods for each genotype at each position within the HLA loci and finds the polymorphic sites in relation to the reference
The "HLAcaller" walker reads the output of the previous steps, and makes the likelihood / probability calculations based on base genotypes, phase information, and allele frequencies.
These inputs are required:
You can download the latter four here.
The GATK contains a wealth of tools for analysis of sequencing data. Required inputs include an aligned bam file and reference fasta file. The following example shows how to calculate depth of coverage.
Usage:
java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I input.bam -R ref.fasta -L input.intervals > output.doc
Arguments:
-T (required) name of walker/function-I (required) Input (.bam) file. -R (required) Genomic reference (.fasta) file.-L (optional) Interval or list of genomic intervals to run the genotyper on.The FindClosestHLA walker traverses each read and compares it to all overlapping HLA alleles (at specific polymorphic sites), and identifies the closest matching alleles. This is useful for detecting misalignments (low concordance with best-matching alleles), and helps narrow the list of candidate alleles (narrowing the search space reduces computational speed) for subsequent analysis by the HLACaller walker. Inputs include the HLA dictionary, a list of polymorphic sites in the HLA, and the exons of interest. Output is a file (output.filter) that includes the closest matching alleles and statistics for each read.
Usage:
java -jar GenomeAnalysisTK.jar -T FindClosestHLA -I input.bam -R ref.fasta -L HLA_EXONS.intervals -HLAdictionary HLA_DICTIONARY.txt \
-PolymorphicSites HLA_POLYMORPHIC_SITES.txt -useInterval HLA_EXONS.intervals | grep -v INFO > output.filter
Arguments:
-HLAdictionary (required) HLA_DICTIONARY.txt file-PolymorphicSites (required) HLA_POLYMORPHIC_SITES.txt file-useInterval (required) HLA_EXONS.intervals fileCalculateBestLikelihoods walker traverses each base position to determine the likelihood for each of the 10 diploid genotypes. These calculations are used later by HLACaller to determine likelihoods for HLA allele pairs based on genotypes, as well as determining the polymorphic sites used in the phasing algorithm. Inputs include aligned bam input, (optional) results from FindClosestHLA (to remove misalignments), and cutoff values for inclusion or exclusion of specific reads. Output is a file (output.baselikelihoods) that contains base likelihoods at each position.
Usage:
java -jar GenomeAnalysisTK.jar -T CalculateBaseLikelihoods -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter \
-maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v "INFO" | grep -v "MISALIGNED" > output.baselikelihoods
Arguments:
-filter (optional) file = output of FindClosestHLA walker (output.filter - to exclude misaligned reads in genotype calculations)-maxAllowedMismatches (optional) max number of mismatches tolerated between a read and the closest allele (default = 6)-minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 0)The HLACaller walker calculates the likelihoods for observing pairs of HLA alleles given the data based on genotype, phasing, and allele frequency information. It traverses through each read as part of the phasing algorithm to determine likelihoods based on phase information. The inputs include an aligned bam files, the outputs from FindClosestHLA and CalculateBaseLikelihoods, the HLA dictionary and allele frequencies, and optional cutoffs for excluding specific reads due to misalignment (maxAllowedMismatches and minRequiredMatches).
Usage:
java -jar GenomeAnalysisTK.jar -T HLACaller -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter -baselikelihoods output.baselikelihoods\
-maxAllowedMismatches 6 -minRequiredMatches 5 -HLAdictionary HLA_DICTIONARY.txt -HLAfrequencies HLA_FREQUENCIES.txt | grep -v "INFO" > output.calls
Arguments:
-baseLikelihoods (required) output of CalculateBaseLikelihoods walker (output.baselikelihoods - genotype likelihoods / list of polymorphic sites from the data)-HLAdictionary (required) HLA_DICTIONARY.txt file-HLAfrequencies (required) HLA_FREQUENCIES.txt file-useInterval (required) HLA_EXONS.intervals file-filter (optional) file = output of FindClosestAllele walker (to exclude misaligned reads in genotype calculations)-maxAllowedMismatches (optional) max number of mismatched bases tolerated between a read and the closest allele (default = 6)-minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 5)-minFreq (option) minimum allele frequency required to consider the HLA allele (default = 0.0).Extract sequences from the HLA loci and make a new bam file:
use Java-1.6
set HLA=/seq/NKseq/sjia/HLA_CALLER
set GATK=/seq/NKseq/sjia/Sting/dist/GenomeAnalysisTK.jar
set REF=/humgen/1kg/reference/human_b36_both.fasta
cp $HLA/samheader NA12878.HLA.sam
java -jar $GATK -T PrintReads \
-I /seq/dirseq/ftp/NA12878_exome/NA12878.bam -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta \
-L $HLA/HLA.intervals | grep -v RESULT | sed 's/chr6/6/g' >> NA12878.HLA.sam
/home/radon01/sjia/bin/SamToBam.csh NA12878.HLA
Use FindClosestHLA to find closest matching HLA alleles and to detect possible misalignments:
java -jar $GATK -T FindClosestHLA -I NA12878.HLA.bam -R $REF -L $HLA/HLA_EXONS.intervals -useInterval $HLA/HLA_EXONS.intervals \
-HLAdictionary $HLA/HLA_DICTIONARY.txt -PolymorphicSites $HLA/HLA_POLYMORPHIC_SITES.txt | grep -v INFO > NA12878.HLA.filter
READ_NAME START-END S %Match Matches Discord Alleles
20FUKAAXX100202:7:63:8309:75917 30018423-30018523 1.0 1.000 1 0 HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...
20GAVAAXX100126:3:24:13495:18608 30018441-30018541 1.0 1.000 3 0 HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,...
20FUKAAXX100202:8:44:16857:92134 30018442-30018517 1.0 1.000 1 0 HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,HLA_A*010105,...
20FUKAAXX100202:8:5:4309:85338 30018452-30018552 1.0 1.000 3 0 HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20GAVAAXX100126:3:28:7925:160832 30018453-30018553 1.0 1.000 3 0 HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20FUKAAXX100202:1:2:10539:169258 30018459-30018530 1.0 1.000 1 0 HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,.
20FUKAAXX100202:8:43:18611:44456 30018460-30018560 1.0 1.000 3 0 HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...
Use CalculateBaseLikelihoods to determine genotype likelihoods at every base position:
java -jar $GATK -T CalculateBaseLikelihoods -I NA12878.HLA.bam -R $REF -L $HLA/HLA_EXONS.intervals \
-filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v INFO | grep -v MISALIGNED > NA12878.HLA.baselikelihoods
chr:pos Ref Counts AA AC AG AT CC CG CT GG GT TT
6:30018513 G A[0]C[0]T[1]G[39] -113.58 -113.58 -13.80 -113.29 -113.58 -13.80 -113.29 -3.09 -13.50 -113.11
6:30018514 C A[0]C[39]T[0]G[0] -119.91 -13.00 -119.91 -119.91 -2.28 -13.00 -13.00 -119.91 -119.91 -119.91
6:30018515 T A[0]C[0]T[39]G[0] -118.21 -118.21 -118.21 -13.04 -118.21 -118.21 -13.04 -118.21 -13.04 -2.35
6:30018516 C A[0]C[38]T[1]G[0] -106.91 -13.44 -106.91 -106.77 -3.05 -13.44 -13.30 -106.91 -106.77 -106.66
6:30018517 C A[0]C[38]T[0]G[0] -103.13 -13.45 -103.13 -103.13 -3.64 -13.45 -13.45 -103.13 -103.13 -103.13
6:30018518 C A[0]C[38]T[0]G[0] -112.23 -12.93 -112.23 -112.23 -2.71 -12.93 -12.93 -112.23 -112.23 -112.23
...
Run HLACaller using outputs from previous steps to determine the most likely alleles at each locus:
java -jar $GATK -T HLACaller -I NA12878.HLA.bam -R $REF -L $HLA/HLA_EXONS.intervals -useInterval $HLA/HLA_EXONS.intervals \
-bl NA12878.HLA.baselikelihoods -filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 5 \
-HLAdictionary $HLA/HLA_DICTIONARY.txt -HLAfrequencies $HLA/HLA_FREQUENCIES.txt > NA12878.HLA.info
grep -v INFO NA12878.HLA.info > NA12878.HLA.calls
Locus A1 A2 Geno Phase Frq1 Frq2 L Prob Reads1 Reads2 Locus EXP White Black Asian
A 0101 1101 -1229.5 -15.2 -0.82 -0.73 -1244.7 1.00 180 191 229 1.62 -1.99 -3.13 -2.07
B 0801 5601 -832.3 -37.3 -1.01 -2.15 -872.1 1.00 58 59 100 1.17 -3.31 -4.10 -3.95
C 0102 0701 -1344.8 -37.5 -0.87 -0.86 -1384.2 1.00 91 139 228 1.01 -2.35 -2.95 -2.31
DPA1 0103 0201 -842.1 -1.8 -0.12 -0.79 -846.7 1.00 72 48 120 1.00 -0.90 -INF -1.27
DPB1 0401 1401 -991.5 -18.4 -0.45 -1.55 -1010.7 1.00 64 48 113 0.99 -2.24 -3.14 -2.64
DQA1 0101 0501 -1077.5 -15.9 -0.90 -0.62 -1095.4 1.00 160 77 247 0.96 -1.53 -1.60 -1.87
DQB1 0201 0501 -709.6 -18.6 -0.77 -0.76 -729.7 0.95 50 87 137 1.00 -1.76 -1.54 -2.23
DRB1 0101 0301 -1513.8 -317.3 -1.06 -0.94 -1832.6 1.00 52 32 101 0.83 -1.99 -2.83 -2.34
Make a SAM/BAM file of the called alleles:
awk '{if (NR > 1){print $1 "*" $2 "\n" $1 "*" $3}}' NA12878.HLA.calls | sort -u > NA12878.HLA.calls.unique
cp $HLA/samheader NA12878.HLA.calls.sam
awk '{split($1,a,"*"); print "grep \"" a[1] "[*]" a[2] "\" '$HLA/HLA_DICTIONARY.sam' >> 'NA12878.HLA'.tmp";}' NA12878.HLA.calls.unique | sh
sort -k4 -n NA12878.HLA.tmp >> NA12878.HLA.calls.sam
/home/radon01/sjia/bin/SamToBam.csh NA12878.HLA.calls
rm NA12878.HLA.tmp
There exist a few performance / accuracy tradeoffs in the HLA caller, as in any algorithm. The following are a few key considerations that the user should keep in mind when using the software for HLA typing.
In polymorphic regions of the genome like the HLA, misaligned reads (presence of erroneous reads or lack of proper sequences) and sequencing errors (indels, systematic PCR errors) may cause the HLA caller to call rare alleles with polymorphisms at the affected bases. The user can manually spot these errors when the algorithm calls a rare allele (Frq1 and Frq2 columns in the output of HLACaller indicate log10 of the allele frequencies). Alternatively, the user can choose to consider only non-rare alleles (use the "-minFreq 0.001" option in HLACaller) to make the algorithm (faster and) more robust against sequencing or alignment errors. The drawback to this approach is that the algorithm may not be able to correctly identifying rare alleles when they are truly present. We recommend using the -minFreq option for genome-wide sequencing datasets, but not for high-quality (targeted PCR 454) data specifically captured for HLA typing in large cohorts.
The FindClosestAllele walker (optional step) is recommended for two reasons:
The ability to detect misalignments for reads that don't match very well to the closest appearing HLA allele - removing these misaligned reads improves calling accuracy.
Creating a list of closest-matching HLA alleles reduces the search space (over 3,000 HLA alleles across the class I and class II loci) that HLACaller has to iterate through, reducing computational burdon.
However, using this pre-processing step is not without costs:
Any cutoff chosen for %concordance, min base matches, or max base mismatches will not distinguish between correctly aligned and misaligned reads 100% of the time - there is a chance that correctly aligned reads may be removed, and misaligned reads not removed.
The list of closest-matching alleles in some cases may not contain the true allele if there is sufficient sequencing error, in which case the true allele will not be considered by the HLACaller walker. In our experience, the advantages of using this pre-processing FindClosestAllele walker greatly outweighs the disadvantages, as recommend including it in the pipeline long as the user understands the possible risks of using this function.
The HLA caller algorithm was was developed by Xiaoming (Sherman) Jia with generous support of the GATK team (especially Mark Depristo, Eric Banks), and Paul de Bakker.
We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.
There are two logins to choose from depending on whether you want to upload or download something:
location: ftp.broadinstitute.org
username: gsapubftp-anonymous
password: <blank>
location: ftp.broadinstitute.org
username: gsapubftp
password: 5WvQWSfi
The GATK uses two files to access and safety check access to the reference files: a .dict dictionary of the contig names and sizes and a .fai fasta index file to allow efficient random access to the reference bases. You have to generate these files in order to be able to use a Fasta file as reference.
NOTE: Picard and samtools treat spaces in contig names differently. We recommend that you avoid using spaces in contig names.
We use CreateSequenceDictionary.jar from Picard to create a .dict file from a fasta file.
> java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:11 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict
[Fri Jun 19 14:09:58 EDT 2009] net.sf.picard.sam.CreateSequenceDictionary done.
Runtime.totalMemory()=2112487424
44.922u 2.308s 0:47.09 100.2% 0+0k 0+0io 2pf+0w
This produces a SAM-style header file describing the contents of our fasta file.
> cat Homo_sapiens_assembly18.dict
@HD VN:1.0 SO:unsorted
@SQ SN:chrM LN:16571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d2ed829b8a1628d16cbeee88e88e39eb
@SQ SN:chr1 LN:247249719 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6
@SQ SN:chr2 LN:242951149 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d
@SQ SN:chr3 LN:199501827 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0e48ed7f305877f66e6fd4addbae2b9a
@SQ SN:chr4 LN:191273063 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cf37020337904229dca8401907b626c2
@SQ SN:chr5 LN:180857866 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:031c851664e31b2c17337fd6f9004858
@SQ SN:chr6 LN:170899992 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfe8005c536131276d448ead33f1b583
@SQ SN:chr7 LN:158821424 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:74239c5ceee3b28f0038123d958114cb
@SQ SN:chr8 LN:146274826 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1eb00fe1ce26ce6701d2cd75c35b5ccb
@SQ SN:chr9 LN:140273252 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:ea244473e525dde0393d353ef94f974b
@SQ SN:chr10 LN:135374737 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4ca41bf2d7d33578d2cd7ee9411e1533
@SQ SN:chr11 LN:134452384 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:425ba5eb6c95b60bafbf2874493a56c3
@SQ SN:chr12 LN:132349534 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d17d70060c56b4578fa570117bf19716
@SQ SN:chr13 LN:114142980 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c4f3084a20380a373bbbdb9ae30da587
@SQ SN:chr14 LN:106368585 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:c1ff5d44683831e9c7c1db23f93fbb45
@SQ SN:chr15 LN:100338915 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:5cd9622c459fe0a276b27f6ac06116d8
@SQ SN:chr16 LN:88827254 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3e81884229e8dc6b7f258169ec8da246
@SQ SN:chr17 LN:78774742 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2a5c95ed99c5298bb107f313c7044588
@SQ SN:chr18 LN:76117153 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:3d11df432bcdc1407835d5ef2ce62634
@SQ SN:chr19 LN:63811651 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2f1a59077cfad51df907ac25723bff28
@SQ SN:chr20 LN:62435964 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f126cdf8a6e0c7f379d618ff66beb2da
@SQ SN:chr21 LN:46944323 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f1b74b7f9f4cdbaeb6832ee86cb426c6
@SQ SN:chr22 LN:49691432 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:2041e6a0c914b48dd537922cca63acb8
@SQ SN:chrX LN:154913754 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d7e626c80ad172a4d7c95aadb94d9040
@SQ SN:chrY LN:57772954 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:62f69d0e82a12af74bad85e2e4a8bd91
@SQ SN:chr1_random LN:1663265 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:cc05cb1554258add2eb62e88c0746394
@SQ SN:chr2_random LN:185571 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:18ceab9e4667a25c8a1f67869a4356ea
@SQ SN:chr3_random LN:749256 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cc571e918ac18afa0b2053262cadab6
@SQ SN:chr4_random LN:842648 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:9cab2949ccf26ee0f69a875412c93740
@SQ SN:chr5_random LN:143687 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:05926bdbff978d4a0906862eb3f773d0
@SQ SN:chr6_random LN:1875562 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:d62eb2919ba7b9c1d382c011c5218094
@SQ SN:chr7_random LN:549659 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:28ebfb89c858edbc4d71ff3f83d52231
@SQ SN:chr8_random LN:943810 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:0ed5b088d843d6f6e6b181465b9e82ed
@SQ SN:chr9_random LN:1146434 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:1e3d2d2f141f0550fa28a8d0ed3fd1cf
@SQ SN:chr10_random LN:113275 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:50be2d2c6720dabeff497ffb53189daa
@SQ SN:chr11_random LN:215294 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bfc93adc30c621d5c83eee3f0d841624
@SQ SN:chr13_random LN:186858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:563531689f3dbd691331fd6c5730a88b
@SQ SN:chr15_random LN:784346 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:bf885e99940d2d439d83eba791804a48
@SQ SN:chr16_random LN:105485 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:dd06ea813a80b59d9c626b31faf6ae7f
@SQ SN:chr17_random LN:2617613 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:34d5e2005dffdfaaced1d34f60ed8fc2
@SQ SN:chr18_random LN:4262 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f3814841f1939d3ca19072d9e89f3fd7
@SQ SN:chr19_random LN:301858 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:420ce95da035386cc8c63094288c49e2
@SQ SN:chr21_random LN:1679693 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:a7252115bfe5bb5525f34d039eecd096
@SQ SN:chr22_random LN:257318 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:4f2d259b82f7647d3b668063cf18378b
@SQ SN:chrX_random LN:1719168 UR:file:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f
We use the faidx command in samtools to prepare the fasta index file. This file describes byte offsets in the fasta file for each contig, allowing us to compute exactly where a particular reference base at contig:pos is in the fasta file.
> samtools faidx Homo_sapiens_assembly18.fasta
108.446u 3.384s 2:44.61 67.9% 0+0k 0+0io 0pf+0w
This produces a text file with one record per line for each of the fasta contigs. Each record is of the: contig, size, location, basesPerLine, bytesPerLine. The index file produced above looks like:
> cat Homo_sapiens_assembly18.fasta.fai
chrM 16571 6 50 51
chr1 247249719 16915 50 51
chr2 242951149 252211635 50 51
chr3 199501827 500021813 50 51
chr4 191273063 703513683 50 51
chr5 180857866 898612214 50 51
chr6 170899992 1083087244 50 51
chr7 158821424 1257405242 50 51
chr8 146274826 1419403101 50 51
chr9 140273252 1568603430 50 51
chr10 135374737 1711682155 50 51
chr11 134452384 1849764394 50 51
chr12 132349534 1986905833 50 51
chr13 114142980 2121902365 50 51
chr14 106368585 2238328212 50 51
chr15 100338915 2346824176 50 51
chr16 88827254 2449169877 50 51
chr17 78774742 2539773684 50 51
chr18 76117153 2620123928 50 51
chr19 63811651 2697763432 50 51
chr20 62435964 2762851324 50 51
chr21 46944323 2826536015 50 51
chr22 49691432 2874419232 50 51
chrX 154913754 2925104499 50 51
chrY 57772954 3083116535 50 51
chr1_random 1663265 3142044962 50 51
chr2_random 185571 3143741506 50 51
chr3_random 749256 3143930802 50 51
chr4_random 842648 3144695057 50 51
chr5_random 143687 3145554571 50 51
chr6_random 1875562 3145701145 50 51
chr7_random 549659 3147614232 50 51
chr8_random 943810 3148174898 50 51
chr9_random 1146434 3149137598 50 51
chr10_random 113275 3150306975 50 51
chr11_random 215294 3150422530 50 51
chr13_random 186858 3150642144 50 51
chr15_random 784346 3150832754 50 51
chr16_random 105485 3151632801 50 51
chr17_random 2617613 3151740410 50 51
chr18_random 4262 3154410390 50 51
chr19_random 301858 3154414752 50 51
chr21_random 1679693 3154722662 50 51
chr22_random 257318 3156435963 50 51
chrX_random 1719168 3156698441 50 51
Imagine a simple question like, "What's the depth of coverage at position A of the genome?"
First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.
Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.
This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:
VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. See this page for detailed specifications.
VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.
The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:
##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 873762 . T G 5231.78 PASS AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G 3931.66 PASS AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T 71.77 PASS AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 T C 29.84 LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
It seems a bit complex, but the structure of the file is actually quite simple:
[HEADER LINES]
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.
The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.
CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.
ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.
REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).
QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.
FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.
For more details about these fields, please see this page.
In the excerpt shown above, here is how we interpret the line corresponding to each variant:
The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:
chr1 873762 . T G [CLIPPED] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 A G [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
Looking at that last column, here is what the tags mean:
GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset.
AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).
PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.
With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.
chr1 899282 rs28548431 C T [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26
At this site, the called genotype is GT = 0/1, which is C/T. The confidence (GQ=25.92) isn't so good, largely because there were only a total of 4 reads at this site (DP=4), 1 of which was ref (=had the reference base) and 3 of which were alt (=had the alternate base) (AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value), whereas there's a serious chance that the subject is hom-var (=homozygous with the variant allele) since PL(1/1) = 26 = 10^(-2.6) = 0.25%. Either way, though, it's clear that the subject is definitely not home-ref (=homozygous with the reference allele) here since PL(0/0) = 103 = 10^(-10.3) which is a very small number.
Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.
chr1 873762 [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1 877664 [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1 899282 [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148
Here are some commonly used built-in annotations and what they mean:
| Annotation tag in VCF | Meaning |
|---|---|
| AC,AF,AN | See the Technical Documentation for Chromosome Counts. |
| DB | If present, then the variant is in dbSNP. |
| DP | See the Technical Documentation for Coverage. |
| DS | Were any of the samples downsampled because of too much coverage? |
| Dels | See the Technical Documentation for SpanningDeletions. |
| MQ and MQ0 | See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero. |
| BaseQualityRankSumTest | See the Technical Documentation for Base Quality Rank Sum Test. |
| MappingQualityRankSumTest | See the Technical Documentation for Mapping Quality Rank Sum Test. |
| ReadPosRankSumTest | See the Technical Documentation for Read Position Rank Sum Test. |
| HRun | See the Technical Documentation for Homopolymer Run. |
| HaplotypeScore | See the Technical Documentation for Haplotype Score. |
| QD | See the Technical Documentation for Qual By Depth. |
| VQSLOD | Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model. |
| FS | See the Technical Documentation for Fisher Strand |
| SB | How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls). |
Run a basic analysis command on example data, parallelized with Queue.
One very cool feature of Queue is that you can test your script by doing a "dry run". That means Queue will prepare the analysis and build the scatter commands, but not actually run them. This makes it easier to check the sanity of your script and command.
Here we're going to set up a dry run of a CountReads analysis. You should be familiar with the CountReads walker and the example files from the bundles, as used in the basic "GATK for the first time" tutorial. In addition, we're going to use the example QScript called ExampleCountReads.scala provided in the Queue package download.
Type the following command:
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam
where -S ExampleCountReads.scala specifies which QScript we want to run, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.
After a few seconds you should see output that looks nearly identical to this:
INFO 00:30:45,527 QScriptManager - Compiling 1 QScript
INFO 00:30:52,869 QScriptManager - Compilation complete
INFO 00:30:53,284 HelpFormatter - ----------------------------------------------------------------------
INFO 00:30:53,284 HelpFormatter - Queue v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:18:21
INFO 00:30:53,284 HelpFormatter - Copyright (c) 2012 The Broad Institute
INFO 00:30:53,284 HelpFormatter - Fro support and documentation go to http://www.broadinstitute.org/gatk
INFO 00:30:53,285 HelpFormatter - Program Args: -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam
INFO 00:30:53,285 HelpFormatter - Date/Time: 2012/08/09 00:30:53
INFO 00:30:53,285 HelpFormatter - ----------------------------------------------------------------------
INFO 00:30:53,285 HelpFormatter - ----------------------------------------------------------------------
INFO 00:30:53,290 QCommandLine - Scripting ExampleCountReads
INFO 00:30:53,364 QCommandLine - Added 1 functions
INFO 00:30:53,364 QGraph - Generating graph.
INFO 00:30:53,388 QGraph - -------
INFO 00:30:53,402 QGraph - Pending: 'java' '-Xmx1024m' '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp' '-cp' '/Users/vdauwera/sandbox/Q2/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'CountReads' '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam' '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'
INFO 00:30:53,403 QGraph - Log: /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads-1.out
INFO 00:30:53,403 QGraph - Dry run completed successfully!
INFO 00:30:53,404 QGraph - Re-run with "-run" to execute the functions.
INFO 00:30:53,409 QCommandLine - Script completed successfully with 1 total jobs
INFO 00:30:53,410 QCommandLine - Writing JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.txt
If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK and Queue are properly installed.
If you do see this output, congratulations! You just successfully ran you first Queue dry run!
Once you have verified that the Queue functions have been generated successfully, you can execute the pipeline by appending -run to the command line.
Instead of this command, which we used earlier:
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam
this time you type this:
java -Djava.io.tmpdir=tmp -jar Queue.jar -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run
See the difference?
You should see output that looks nearly identical to this:
INFO 00:56:33,688 QScriptManager - Compiling 1 QScript
INFO 00:56:39,327 QScriptManager - Compilation complete
INFO 00:56:39,487 HelpFormatter - ----------------------------------------------------------------------
INFO 00:56:39,487 HelpFormatter - Queue v2.0-36-gf5c1c1a, Compiled 2012/08/08 20:18:21
INFO 00:56:39,488 HelpFormatter - Copyright (c) 2012 The Broad Institute
INFO 00:56:39,488 HelpFormatter - Fro support and documentation go to http://www.broadinstitute.org/gatk
INFO 00:56:39,489 HelpFormatter - Program Args: -S ExampleCountReads.scala -R exampleFASTA.fasta -I exampleBAM.bam -run
INFO 00:56:39,490 HelpFormatter - Date/Time: 2012/08/09 00:56:39
INFO 00:56:39,490 HelpFormatter - ----------------------------------------------------------------------
INFO 00:56:39,491 HelpFormatter - ----------------------------------------------------------------------
INFO 00:56:39,498 QCommandLine - Scripting ExampleCountReads
INFO 00:56:39,569 QCommandLine - Added 1 functions
INFO 00:56:39,569 QGraph - Generating graph.
INFO 00:56:39,589 QGraph - Running jobs.
INFO 00:56:39,623 FunctionEdge - Starting: 'java' '-Xmx1024m' '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp' '-cp' '/Users/vdauwera/sandbox/Q2/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'CountReads' '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam' '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'
INFO 00:56:39,623 FunctionEdge - Output written to /Users/GG/codespace/GATK/Q2/resources/ExampleCountReads-1.out
INFO 00:56:50,301 QGraph - 0 Pend, 1 Run, 0 Fail, 0 Done
INFO 00:57:09,827 FunctionEdge - Done: 'java' '-Xmx1024m' '-Djava.io.tmpdir=/Users/vdauwera/sandbox/Q2/resources/tmp' '-cp' '/Users/vdauwera/sandbox/Q2/resources/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'CountReads' '-I' '/Users/vdauwera/sandbox/Q2/resources/exampleBAM.bam' '-R' '/Users/vdauwera/sandbox/Q2/resources/exampleFASTA.fasta'
INFO 00:57:09,828 QGraph - 0 Pend, 0 Run, 0 Fail, 1 Done
INFO 00:57:09,835 QCommandLine - Script completed successfully with 1 total jobs
INFO 00:57:09,835 QCommandLine - Writing JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.txt
INFO 00:57:10,107 QCommandLine - Plotting JobLogging GATKReport to file /Users/vdauwera/sandbox/Q2/resources/ExampleCountReads.jobreport.pdf
WARN 00:57:18,597 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info.
Great! It works!
The results of the traversal will be written to a file in the current directory. The name of the file will be printed in the output, ExampleCountReads.out in this example.
If for some reason the run was interrupted, in most cases you can resume by just launching the command. Queue will pick up where it left off without redoing the parts that ran successfully.
Run with -bsub to run on LSF, or for early Grid Engine support see Queue with Grid Engine.
See also QFunction and Command Line Options for more info on Queue options.
Run a basic analysis command on example data.
A very simple analysis that you can do with the GATK is getting a count of the reads in a BAM file. The GATK is capable of much more powerful analyses, but this is a good starting example because there are very few things that can go wrong.
So we are going to count the reads in the file exampleBAM.bam, which you can find in the GATK resource bundle along with its associated index (same file name with .bai extension), as well as the example reference exampleFASTA.fasta and its associated index (same file name with .fai extension) and dictionary (same file name with .dict extension). Copy them to your working directory so that your directory contents look like this:
[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x 9 vdauwera CHARLES\Domain Users 306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera CHARLES\Domain Users 204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 20 Apr 10 07:39 exampleFASTA.fasta.fai
Type the following command:
java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
where -T CountReads specifies which analysis tool we want to use, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.
For any analysis that you want to run on a set of aligned reads, you will always need to use at least these three arguments:
-T for the tool name, which specifices the corresponding analysis-R for the reference sequence file-I for the input BAM file of aligned readsThey don't have to be in that order in your command, but this way you can remember that you need them if you TRI...
After a few seconds you should see output that looks like to this:
INFO 16:17:45,945 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:45,946 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO 16:17:45,947 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:17:45,947 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:17:45,947 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
INFO 16:17:45,947 HelpFormatter - Date/Time: 2012/07/25 16:17:45
INFO 16:17:45,947 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:45,948 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:45,950 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:17:45,982 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:17:45,993 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:17:46,060 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:17:46,060 TraversalEngine - Location processed.reads runtime per.1M.reads completed total.runtime remaining
INFO 16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33
INFO 16:17:46,061 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours
INFO 16:17:46,100 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO 16:17:46,729 GATKRunReport - Uploaded run statistics report to AWS S3
Depending on the GATK release, you may see slightly different information output, but you know everything is running correctly if you see the line:
INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33
somewhere in your output.
If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK is properly installed.
If you do see this output, congratulations! You just successfully ran you first GATK analysis!
Basically the output you see means that the CountReadsWalker (which you invoked with the command line option -T CountReads) counted 33 reads in the exampleBAM.bam file, which is exactly what we expect to see.
Wait, what is this walker thing?
In the GATK jargon, we call the tools walkers because the way they work is that they walk through the dataset --either along the reference sequence (LocusWalkers), or down the list of reads in the BAM file (ReadWalkers)-- collecting the requested information along the way.
Now that you're rocking the read counts, you can start to expand your use of the GATK command line.
Let's say you don't care about counting reads anymore; now you want to know the number of loci (positions on the genome) that are covered by one or more reads. The name of the tool, or walker, that does this is CountLoci. Since the structure of the GATK command is basically always the same, you can simply switch the tool name, right?
Instead of this command, which we used earlier:
java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam
this time you type this:
java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
See the difference?
You should see something like this output:
INFO 16:18:26,183 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:18:26,185 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO 16:18:26,185 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:18:26,185 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:18:26,186 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
INFO 16:18:26,186 HelpFormatter - Date/Time: 2012/07/25 16:18:26
INFO 16:18:26,186 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:18:26,186 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:18:26,189 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:18:26,222 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:18:26,351 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
2052
INFO 16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO 16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO 16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3
Great! But wait -- where's the result? Last time the result was given on this line:
INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33
But this time there is no line that says [REDUCE RESULT]! Is something wrong?
Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen.
So we repeat the command, but this time we specify an output file, like this:
java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt
where -o (lowercase o, not zero) is used to specify the output.
You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included):
INFO 16:29:15,451 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41
INFO 16:29:15,453 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt
INFO 16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15
INFO 16:29:15,454 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:29:15,454 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01
INFO 16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:29:15,618 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours
INFO 16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%)
INFO 16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3
This time however, if we look inside the working directory, there is a newly created file there called output.txt.
[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x 9 vdauwera CHARLES\Domain Users 306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera CHARLES\Domain Users 204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera CHARLES\Domain Users 20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r-- 1 vdauwera CHARLES\Domain Users 5 Jul 25 16:29 output.txt
This file contains the result of the analysis:
[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt
2052
This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.
Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.
Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.
Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:
java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta
the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------
You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.
So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!
There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.
So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!
Test that the GATK is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks the GATK to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your GATK package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to GenomeAnalysisTK.jar> --help
replacing the <path to GenomeAnalysisTK.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L
<intervals>] [-R <reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H
<hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e <err>] [-oe <outerr>] [-A] [-M
<maximum_reads>] [-sort <sort_on_the_fly>] [-compress <bam_compression>] [-fmq0] [-dfrac
<downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S
<validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l
<logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h]
-T,--analysis_type <analysis_type> Type of analysis to run
-I,--input_file <input_file> SAM or BAM file(s)
-L,--intervals <intervals> A list of genomic intervals over which
to operate. Can be explicitly specified
on the command line or in a file.
-R,--reference_sequence <reference_sequence> Reference sequence file
-B,--rodBind <rodBind> Bindings for reference-ordered data, in
the form <name>,<type>,<file>
-D,--DBSNP <DBSNP> DBSNP file
-H,--hapmap <hapmap> Hapmap file
-hc,--hapmap_chip <hapmap_chip> Hapmap chip file
-o,--out <out> An output file presented to the walker.
Will overwrite contents if file exists.
-e,--err <err> An error output file presented to the
walker. Will overwrite contents if file
exists.
-oe,--outerr <outerr> A joint file for 'normal' and error
output presented to the walker. Will
overwrite contents if file exists.
...
If you see this message, your GATK installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
Test that Queue is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to Queue.jar> --help
replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
[-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
<emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>]
[-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
<status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
[-debug] [-h]
-S,--script <script> QScript scala file
-jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs.
-jobQueue,--job_queue <job_queue> Default queue for compute farm jobs.
-jobProject,--job_project <job_project> Default project for compute farm jobs.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather
output for compute farm jobs.
-memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory> Root directory to run functions from.
-tempDir,--temp_directory <temp_directory> Temp directory to pass to functions.
-emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL Email should use SSL. Defaults to false.
-emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none.
-emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not
secure! See emailPassFile.
-emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none.
-bsub,--bsub_all_jobs Use bsub to submit jobs
-run,--run_scripts Run QScripts. Without this flag set only
performs a dry run.
-dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to
a .dot file. Otherwise overwrites the
dot_graph
-startFromScratch,--start_from_scratch Runs all command line functions even if the
outputs were previously output successfully.
-status,--status Get status of jobs for the qscript
-statusFrom,--status_email_from <status_email_from> Email address to send emails from upon
completion or on error.
-statusTo,--status_email_to <status_email_to> Email address to send emails to upon
completion or on error.
-keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of
any Function marked as intermediate.
-retry,--retry_failed <retry_failed> Retry the specified number of times after a
command fails. Defaults to no retries.
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file> Set the logging location
-quiet,--quiet_output_mode Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help Generate this help message
If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can
The GATK provides and experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.
The basic workflow for this interface is as follows:
After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.
Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.
For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:
marker alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892
20:60251 T C 10.00 1.26 0.00 9.77 2.45 0.00
20:60321 G T 10.00 5.01 0.01 10.00 0.31 0.00
20:60467 G C 9.55 2.40 0.00 9.55 1.20 0.00
Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).
IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.
We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example
java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun
Extra BEAGLE arguments can be added as required.
Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.
BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:
# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like
Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.
The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.
The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).
The resulting VCF can now be used for further downstream analysis.
Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.
java -jar /path/to/dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R reffile.fasta \
--out genome_wide_output.vcf \
-V:input1 beagle_output_chr1.vcf \
-V:input2 beagle_output_chr2.vcf \
.
.
.
-V:inputX beagle_output_chrX.vcf \
-type UNION -priority input1,input2,...,inputX
Three-stage procedure:
Create a master set of sites from your N batch VCFs that you want to genotype in all samples. At this stage you need to determine how you want to resolve disagreements among the VCFs. This is your master sites VCF.
Take the master sites VCF and genotype each sample BAM file at these sites
(Optionally) Merge the single sample VCFs into a master VCF file
The first step of batch merging is to create a master set of sites that you want to genotype in all samples. To make this problem concrete, suppose I have two VCF files:
Batch 1:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891
20 9999996 . A ATC . PASS . GT:GQ 0/1:30
20 10000000 . T G . PASS . GT:GQ 0/1:30
20 10000117 . C T . FAIL . GT:GQ 0/1:30
20 10000211 . C T . PASS . GT:GQ 0/1:30
20 10001436 . A AGG . PASS . GT:GQ 1/1:30
Batch 2:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 9999996 . A ATC . PASS . GT:GQ 0/1:30
20 10000117 . C T . FAIL . GT:GQ 0/1:30
20 10000211 . C T . FAIL . GT:GQ 0/1:30
20 10000598 . T A . PASS . GT:GQ 1/1:30
20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30
In order to merge these batches, I need to make a variety of bookkeeping and filtering decisions, as outlined in the merged VCF below:
Master VCF:
20 9999996 . A ATC . PASS . GT:GQ 0/1:30 [pass in both]
20 10000000 . T G . PASS . GT:GQ 0/1:30 [only in batch 1]
20 10000117 . C T . FAIL . GT:GQ 0/1:30 [fail in both]
20 10000211 . C T . FAIL . GT:GQ 0/1:30 [pass in 1, fail in 2, choice in unclear]
20 10000598 . T A . PASS . GT:GQ 1/1:30 [only in batch 2]
20 10001436 . A AGGCT . PASS . GT:GQ 1/1:30 [A/AGG in batch 1, A/AGGCT in batch 2, including this site may be problematic]
These issues fall into the following categories:
There are two difficult situations that must be addressed by the needs of the project merging batches:
Unfortunately, we cannot determine which is actually the correct choice, especially given the goals of the project. We leave it up the project bioinformatician to handle these cases when creating the master VCF. We are hopeful that at some point in the future we'll have a consensus approach to handle such merging, but until then this will be a manual process.
The GATK tool CombineVariants can be used to merge multiple VCF files, and parameter choices will allow you to handle some of the above issues. With tools like SelectVariants one can slice-and-dice the merged VCFs to handle these complexities as appropriate for your project's needs. For example, the above master merge can be produced with the following CombineVariants:
java -jar dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R human_g1k_v37.fasta \
-V:one,VCF combine.1.vcf -V:two,VCF combine.2.vcf \
--sites_only \
-minimalVCF \
-o master.vcf
producing the following VCF:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO
20 9999996 . A ACT . PASS set=Intersection
20 10000000 . T G . PASS set=one
20 10000117 . C T . FAIL set=FilteredInAll
20 10000211 . C T . PASS set=filterIntwo-one
20 10000598 . T A . PASS set=two
20 10001436 . A AGG,AGGCT . PASS set=Intersection
Having created the master set of sites to genotype, along with their alleles, as in the previous section, you now use the UnifiedGenotyper to genotype each sample independently at the master set of sites. This GENOTYPE_GIVEN_ALLELES mode of the UnifiedGenotyper will jump into the sample BAM file, and calculate the genotype and genotype likelihoods of the sample at the site for each of the genotypes available for the REF and ALT alleles. For example, for site 10000211, the UnifiedGenotyper would evaluate the likelihoods of the CC, CT, and TT genotypes for the sample at this site, choose the most likely configuration, and generate a VCF record containing the genotype call and the likelihoods for the three genotype configurations.
As a concrete example command line, you can genotype the master.vcf file using in the bundle sample NA12878 with the following command:
java -Xmx2g -jar dist/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R bundle/b37/human_g1k_v37.fasta \
-I bundle/b37/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam \
-alleles master.vcf \
-L master.vcf \
-gt_mode GENOTYPE_GIVEN_ALLELES \
-out_mode EMIT_ALL_SITES \
-stand_call_conf 0.0 \
-glm BOTH \
-G none \
The -L master.vcf argument tells the UG to only genotype the sites in the master file. If you don't specify this, the UG will genotype the master sites in GGA mode, but it will also genotype all other sites in the genome in regular mode.
The last item,-G ` prevents the UG from computing annotations you don't need. This command produces something like the following output:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 9999996 . A ACT 4576.19 . . GT:DP:GQ:PL 1/1:76:99:4576,229,0
20 10000000 . T G 0 . . GT:DP:GQ:PL 0/0:79:99:0,238,3093
20 10000211 . C T 857.79 . . GT:AD:DP:GQ:PL 0/1:28,27:55:99:888,0,870
20 10000598 . T A 1800.57 . . GT:AD:DP:GQ:PL 1/1:0,48:48:99:1834,144,0
20 10001436 . A AGG,AGGCT 1921.12 . . GT:DP:GQ:PL 0/2:49:84.06:1960,2065,0,2695,222,84
Several things should be noted here:
This genotyping command can be performed independently per sample, and so can be parallelized easily on a farm with one job per sample, as in the following:
foreach sample in samples:
run UnifiedGenotyper command above with -I $sample.bam -o $sample.vcf
end
You can use a similar command for CombineVariants above to merge back together all of your single sample genotyping runs. Suppose all of my UnifiedGenotyper jobs have completed, and I have VCF files named sample1.vcf, sample2.vcf, to sampleN.vcf. The single command:
java -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37.fasta -V:sample1 sample1.vcf -V:sample2 sample2.vcf [repeat until] -V:sampleN sampleN.vcf -o combined.vcf
GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:
Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.
With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.
You have two options: donwload the binary distribution (prepackaged, ready to run program) or build it from source.
This is obviously the easiest way to go. Links are on the Downloads page.
Briefly, here's what you need to know/do:
Queue is part of the Sting repository. Download the source from our repository on Github. Run the following command:
git clone git://github.com/broadgsa/gatk.git Sting
Use ant to build the source.
cd Sting
ant queue
Queue uses the Ivy dependency manager to fetch all other dependencies. Just make sure you have suitable versions of the JDK and Ant!
See this article on how to test your installation of Queue.
See this article on running Queue for the first time for full details.
Queue arguments can be listed by running with --help
java -jar dist/Queue.jar --help
To list the arguments required by a QScript, add the script with -S and run with --help.
java -jar dist/Queue.jar -S script.scala --help
Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.
See QFunction and Command Line Options for more info on adjusting Queue options.
Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.
Every QScript includes the following steps:
add() to Queue for dispatch and monitoringThe basic command-line to run the Queue pipelines on the command line is
java -jar Queue.jar -S <script>.scala
See the main article Queue QScripts for more info on QScripts.
While most QScripts are analysis pipelines that are custom-built for specific projects, some have been released as supported tools. See
The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples
Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.
Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile .libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.
Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.
To clarify your pipeline, override the dotString() function:
class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
@Input(doc="foo") var bam = bamIn
@Input(doc="foo") var bamIndex = bai(bamIn)
@Output(doc="foo") var recalData = recalDataIn
memoryLimit = Some(4)
override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}
Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:
Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.
The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)
To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.
Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.
You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.
Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.
To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.
The MapReduce architecture of the GATK allows most walkers in the GATK to be run in a parallel-processing mode. The GATK supports two basic parallel processing models known as shared memory and scatter-gather.
Shared memory parallelism
Parallelism within a single multi-threading process with access to a large, shared RAM. Shared memory parallelism is stable and supported by many tools that access pileups of bases.
Scatter/gather (SG) parallelism
In SG parallelism, the target genomic regions are divided up into N independent GATK instances that are run separately on a single machine or across a computing cluster. The outputs of each independent walker, are merged together once all are completed. SG works very efficiently in the GATK, provided the output of a walker is independent per site (e.g. Unified Genotyper) or per chromosome (e.g. Indel Realigner). SG parallelism is a completely stable approach in the GATK, and used routinely by the GATK team in processing large data sets; it is also natively supported by GATK-Queue, which automatically scatters and gathers GATK processes given a desired N number of processes to execute simultaneously.
Note that parallel-processing will significantly speed up data processing but may produce statistically insignificant differences. While this non-determinism is not ideal in practice the minute differences have been mathematically meaningless while producing consistent results in a reasonable amount of time for whole genome and whole exome data. However, if absolute determinism is more important than speed we recommend you do not use parallelism with the GATK.
There are costs and benefits to each type of parallelism in the GATK, as outlined in the following table.
Comparison of standard parallelism approaches in the GATK
| Property | Shared Memory | Scatter/Gather |
|---|---|---|
| Stability | Stable | Stable | Retired in codebase |
| Applicable walker types | By locus and by ROD only. ReadWalkers are not supported. | All walker types. ReadWalkers can only be split safely by chromosome in general |
| Example walkers | UnifiedGenotyper, BaseRecalibrator, VariantEval | All walkers, including ReadWalkers like IndelRealigner |
| Scalability | Fewer than 32 cores. Each thread operates completely independently, so N threads requires N times more memory than 1 thread alone. Best scaling at 8 or fewer threads. | Hundreds of processes. Limited by capabilities of the underlying storage system, in general. Isilon-class storage can run thousands of jobs effectively. |
| How to enable | Use the -nt argument in the Engine, on any walker that supports shared memory parallelism (the engine will tell you if it does not). |
1. Provide -L interval lists to the GATK; a different one for each of the N independent GATK tools. For example -L chr1 for first process, and -L chr2 for the second. When all processes have finished, merge the output together, as appropriate (e.g. use MergeSam.jar for BAMs, and cat or CombineVariants for VCFs). 2. Use GATK-Queue to automatically divide up your GATK jobs. For example, using this.scatterCount = 10 argument will result in 10 independent processes. |
| Pros | - Easy to enable. - Single output file merged together by internally by the GATK engine - Efficiently uses multi-core processors sharing a single memory space | - Works for all walker types, including ReadWalkers - Scales to hundreds of independent jobs - Easy to enable with single -L argument - Directly supported and managed by GATK-Queue - Totally independent processing per interval -- failed parts can be easily resumed without repeating already successfully processed regions |
| Cons | - Limited to fewer than 32 processors without significant overhead - Limited to cores physically present on the machine, cannot take advantage of computing cluster resources - Does not work for ReadWalkers (Table Recalibrator, Indel Realigner) | - Requires manual preparation of sliced genomic intervals for processing (if you aren't using GATK-Queue). - For ReadWalkers and other tools that can only be processed by chromosome, leading to load balancing problems (chr1 is much bigger than chr22) - Sensitive to data density variation over the genome. Dividing chr20 processing in 63 1MB chunks leads to 10x differences in completion times due to data pileups around the centromere, for example. - Must wait until all parts of the scatter have completed before gathering, making the process sensitive to farm scheduling problems. If a job will run for M minutes, but waits Z minutes to start on the farm, the entire SG process will not complete for at least M + Z minutes. |
Almost certainly, either shared memory or scatter/gather parallelism is the right choice for your NGS pipeline. Our go-to option for parallelism here at the Broad Institute is S/G, since S/G allows you to cut up your jobs into hundreds of pieces to run on our standard computing farm, using GATK-Queue. When I have a small job that needs to be run quickly, am testing out program options or need a quick VariantEval result, I'm using shared memory parallelism with ~10 threads on a single large computer with a lot (>64 GB) of memory.
Basically, if I have N processors, and I want to choose between shared memory or S/G, here's how I would choose:
If all N processors are on a single computer, and my walker supports it, use shared memory parallelism
If not, use S/G
The GATK currently supports a hierarchical version of parallelism. In this form of parallelism, data is divided into shards, each shard is map/reduced independently, and the results are combined with a 'tree reduce' step. While the framework handles much of the heavy lifting of data division required for parallelism, each walker must individually be reviewed to ensure that it isn't tracking state internally in a non-threadsafe way. Many tools support shared memory parallelism, including critical tools such as:
Please review the source to discover if your walker is parallelizable, or attempt to run it with -nt 2 and if it the engine doesn't complain the walker is parallelized.
In shared memory parallelism, each thread operates on a 16 kbp shard of reference sequence in a completely independent memory space from all other threads. Up to 24 threads can run efficiently in this design, but greater parallelism is limited by resource starvation from the single reduce thread and/or memory inefficiency by keeping each thread’s data totally independent. See gatkParallelism performance 082112 these plots for an analysis of the scalability of several key GATK tools as a function of nt.
Run the GATK, using the -nt command-line argument to specify the number of threads that the GATK should attempt to use.
[[image:HierarchicalParallelism.png|thumb|Shared memory parallelism architecture]]
First, be aware that some walkers may, by design, require a rewrite for complete parallelization.
When implementing a standard (non-parallelized) walker, one must implement the reduce method, which combines an individual datum returned by the map function with the aggregate of the prior map calls. When implementing a parallelizable walker, one must also implement the org.broadinstitute.sting.gatk.walkers.TreeReducible interface and the treeReduce() function. The TreeReduce function tells the GATK how to combine two adjacent reduces, rather than one map result and one reduce.
The GATK supports writing to output files from either the map or the reduce when running in parallel. However, only unbuffered writers are currently supported. Please use PrintStream rather than PrintWriter at this time.
The GATK's support for parallelism is currently limited. The following classes of walkers are not supported by our parallelization framework:
Note that each thread operates completely independently in the current GATK implementation of shared memory parallelism. So if you provide 1Gb to the GATK with nt 1, then you should provide 4Gb to run with nt 4. If you don't do this, it is possible to starve out the GATK so that it runs much much slower.
The performance of the multi-threaded GATK is really dependent on whether you are IO or CPU limited and the relative overhead of the parallelism on your computer. Additionally, nt can start to have very high overheads with nt > 20 due to our implementation and memory contention issues.
The best option for nt is a value less or equal to the number of available cores with sufficient memory to run each threads (nt x amount provided to 1 core), capped additionally by the available IO bandwidth so that the individual threads don't starve each other.
Scatter / gather is a simple approach for process-independent parallelism with the GATK. First you scatter multiple independent GATK instances out over a network of computing nodes, each operating on slightly different genomic intervals, and when they all complete, the output of each is gathered up into a merged output dataset. In the GATK S/G is extremely easy to use, as all of the GATK tools support the -L option to operate over only genomic specific intervals, and many tools emit files that can be merged together across multiple regions. Unified Genotyper, for example, can operate over the whole genome in one process, or on each chromosome independently. The output of this later approach, after merging together, should be the same as the whole genome results, minus any slight differences due to random number effects. The simplicity and efficiency of S/G parallelism makes this a key option for getting things done quickly with the GATK.
Using S/G parallelism is extremely easy, either by hand or using the automated Scatter/Gather in GATK-Queue. Suppose I have the following command line:
java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1
This runs a single process of the GATK over chr1, and let's say it takes an hour when I run it. In order to run it with S/G parallelism mode, just partition chr1 into two independent parts:
This file distributed.tracker.txt will contain genomic locations and GATK process ids that are processing each location, in text format, so you can cat it. If you run at the command line:
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:1-125,000,000 -o my.1.vcf &
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:125,000,001-249,250,621 -o my.2.vcf &
When these two jobs finish, I just merge the two VCFs together and I've got a complete data set in half the time.
To call variants with the GATK using pedigree information, you should base your workflow on the Best Practices recommendations -- the principles detailed there all apply to pedigree analysis.
But there is one crucial addition: you should make sure to pass a pedigree file (PED file) to all GATK walkers that you use in your workflow. Some will deliver better results if they see the pedigree data.
At the moment there are two of the standard annotations affected by pedigree:
In the specific case of trios, an additional GATK walker, PhaseByTransmission, should be used to obtain trio-aware genotypes as well as phase by descent.
The annotations mentioned above have been adapted for PED files starting with GATK v.1.6. If you already have VCF files generated by an older version of the GATK or have not passed a PED file while running the UnifiedGenotyper or VariantAnnotator, you should do the following:
-G StandardAnnotation to VariantAnnotator. Make sure you pass your PED file to the VariantAnnotator as well!The PED files used as input for these tools are based on PLINK pedigree files. The general description can be found here.
For these tools, the PED files must contain only the first 6 columns from the PLINK format PED file, and no alleles, like a FAM file in PLINK.
The GATK provides an implementation of the Per-Base Alignment Qualities (BAQ) developed by Heng Li in late 2010. See this SamTools page for more details.
The BAQ algorithm is applied by the GATK engine itself, which means that all GATK walkers can potentially benefit from it. By default, BAQ is OFF, meaning that the engine will not use BAQ quality scores at all.
The GATK engine accepts the argument -baq with the following enum values:
public enum CalculationMode {
OFF, // don't apply a BAQ at all, the default
CALCULATE_AS_NECESSARY, // do HMM BAQ calculation on the fly, as necessary, if there's no tag
RECALCULATE // do HMM BAQ calculation on the fly, regardless of whether there's a tag present
}
If you want to enable BAQ, the usual thing to do is CALCULATE_AS_NECESSARY, which will calculate BAQ values if they are not in the BQ read tag. If your reads are already tagged with BQ values, then the GATK will use those. RECALCULATE will always recalculate the BAQ, regardless of the tag, which is useful if you are experimenting with the gap open penalty (see below).
If you are really an expert, the GATK allows you to specify the BAQ gap open penalty (-baqGOP) to use in the HMM. This value should be 40 by default, a good value for whole genomes and exomes for highly sensitive calls. However, if you are analyzing exome data only, you may want to use 30, which seems to result in more specific call set. We continue to play with these values some. Some walkers, where BAQ would corrupt their analyses, forbid the use of BAQ and will throw an exception if -baq is provided.
For UnifiedGenotyper to get more specific SNP calls.
For PrintReads to write out a BAM file with BAQ tagged reads
For TableRecalibrator or IndelRealigner to write out a BAM file with BAQ tagged reads. Make sure you use -baq RECALCULATE so the engine knows to recalculate the BAQ after these tools have updated the base quality scores or the read alignments. Note that both of these tools will not use the BAQ values on input, but will write out the tags for analysis tools that will use them.
Note that some tools should not have BAQ applied to them.
This last option will be a particularly useful for people who are already doing base quality score recalibration. Suppose I have a pipeline that does:
RealignerTargetCreator
IndelRealigner
CountCovariates
TableRecalibrate
UnifiedGenotyper
A highly efficient BAQ extended pipeline would look like
RealignerTargetCreator
IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator
CountCovariates
TableRecalibrate -baq RECALCULATE // now the reads will have a BAQ tag added. Slows the tool down some
UnifiedGenotyper -baq CALCULATE_AS_NECESSARY // UG will use the tags from TableRecalibrate, keeping UG fast
Walkers can control via the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.
This script can be used for sorting an input file based on a reference.
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
sub usage {
print "\nUsage:\n";
print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";
print " Sorts lines of the input file INFILE according\n";
print " to the reference contig order specified by the\n";
print " reference dictionary REF_DICT (.fai file).\n";
print " The sort is stable. If -k option is not specified,\n";
print " it is assumed that the contig name is the first\n";
print " field in each line.\n\n";
print " INPUT input file to sort. If '-' is specified, \n";
print " then reads from STDIN.\n";
print " REF_DICT .fai file, or ANY file that has contigs, in the\n";
print " desired soting order, as its first column.\n";
print " --k POS : contig name is in the field POS (1-based)\n";
print " of input lines.\n\n";
exit(1);
}
my $pos = 1;
GetOptions( "k:i" => \$pos );
$pos--;
usage() if ( scalar(@ARGV) == 0 );
if ( scalar(@ARGV) != 2 ) {
print "Wrong number of arguments\n";
usage();
}
my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];
open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");
my %ref_order;
my $n = 0;
while ( <DICT> ) {
chomp;
my ($contig, $rest) = split "\t";
die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );
$ref_order{$contig} = $n;
$n++;
}
close DICT;
#we have loaded contig ordering now
my $INPUT;
if ($input_file eq "-" ) {
$INPUT = "STDIN";
} else {
open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}
my %temp_outputs;
while ( <$INPUT> ) {
my @fields = split '\s';
die("Specified field position exceeds the number of fields:\n$_")
if ( $pos >= scalar(@fields) );
my $contig = $fields[$pos];
if ( $contig =~ m/:/ ) {
my @loc = split(/:/, $contig);
# print $contig . " " . $loc[0] . "\n";
$contig = $loc[0]
}
chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line
my $order;
if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
else {
$order = $n; # input line has contig that was not in the dict;
$n++; # this contig will go at the end of the output,
# after all known contigs
}
my $fhandle;
if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
else {
#print "opening $order $$ $_\n";
open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
die ( "Can not open temporary file $order: $!");
$temp_outputs{$order} = $fhandle;
}
# we got the handle to the temp file that keeps all
# lines with contig $contig
print $fhandle $_; # send current line to its corresponding temp file
}
close $INPUT;
foreach my $f ( values %temp_outputs ) { close $f; }
# now collect back into single output stream:
for ( my $i = 0 ; $i < $n ; $i++ ) {
# if we did not have any lines on contig $i, then there's
# no temp file and nothing to do
next if ( ! defined $temp_outputs{$i} ) ;
my $f;
open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
while ( <$f> ) { print ; }
close $f;
unlink "/tmp/sortByRef.$$.$i.tmp";
}
For a complete, detailed argument reference, refer to the GATK document page here.
DepthOfCoverage is a coverage profiler for a (possibly multi-sample) bam file. It uses a granular histogram that can be user-specified to present useful aggregate coverage data. It reports the following metrics over the entire .bam file:
Because the common question "What proportion of my targeted bases are well-powered to discover SNPs?" is answered by the last matrix on the above list, it is strongly recommended that this walker be run on all samples simultaneously.
For humans, DepthOfCoverage can also be configured to output these statistics aggregated over genes, by providing it with a RefSeq ROD.
DepthOfCoverage also outputs, by default, the total coverage at every locus, and the coverage per sample and/or read group. This behavior can optionally be turned off, or switched to base count mode, where base counts will be output at each locus, rather than total depth.
To get a summary of coverage by each gene, you may supply a refseq (or alternative) gene list via the argument
-geneList /path/to/gene/list.txt
The provided gene list must be of the following format:
585 NM_001005484 chr1 + 58953 59871 58953 59871 1 58953, 59871, 0 OR4F5 cmpl cmpl 0,
587 NM_001005224 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F3 cmpl cmpl 0,
587 NM_001005277 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F16 cmpl cmpl 0,
587 NM_001005221 chr1 + 357521 358460 357521 358460 1 357521, 358460, 0 OR4F29 cmpl cmpl 0,
589 NM_001005224 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F3 cmpl cmpl 0,
589 NM_001005277 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F16 cmpl cmpl 0,
589 NM_001005221 chr1 - 610958 611897 610958 611897 1 610958, 611897, 0 OR4F29 cmpl cmpl 0,
If you are on the broad network, the properly-formatted file containing refseq genes and transcripts is located at
/humgen/gsa-hpprojects/GATK/data/refGene.sorted.txt
If you supply the -geneList argument, DepthOfCoverage v3.0 will output an additional summary file that looks as follows:
Gene_Name Total_Cvg Avg_Cvg Sample_1_Total_Cvg Sample_1_Avg_Cvg Sample_1_Cvg_Q3 Sample_1_Cvg_Median Sample_1_Cvg_Q1
SORT1 594710 238.27 594710 238.27 165 245 330
NOTCH2 3011542 357.84 3011542 357.84 222 399 >500
LMNA 563183 186.73 563183 186.73 116 187 262
NOS1AP 513031 203.50 513031 203.50 91 191 290
Note that the gene coverage will be aggregated only over samples (not read groups, libraries, or other types). The -geneList argument also requires specific intervals within genes to be given (say, the particular exons you are interested in, or the entire gene), and it functions by aggregating coverage from the interval level to the gene level, by referencing each interval to the gene in which it falls. Because by-gene aggregation looks for intervals that overlap genes, -geneList is ignored if -omitIntervals is thrown.
Introduction
SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.
Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.
Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.
For a complete, detailed argument reference, refer to the GATK document page here.
Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.
BOB MARY LINDA
1/0:20 0/0:30 1/1:50
In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).
Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:
BOB MARY
1/0:20 0/0:30
The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.
Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like
BOB MARY
1/0:20 ./.:.
with AN=2, AC=1, AF=0.5, and DP=20.
SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:
1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().
isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?
In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.
For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:
1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.
Some VCFs may have repeated header entries with the same key name, for instance:
##fileformat=VCFv3.3
##FILTER=ABFilter,"AB > 0.75"
##FILTER=HRunFilter,"HRun > 3.0"
##FILTER=QDFilter,"QD < 5.0"
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...
Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.
For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.
2 SNPs with significant strand bias

Several SNPs with excessive coverage

For a complete, detailed argument reference, refer to the GATK document page here.
In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.
Description of the haplotype score annotation

The list below is not comprehensive. Please use the --list argument to get a list of all possible annotations available. Also, see the FAQ article on understanding the Unified Genotyper's VCF files for a description of some of the more standard annotations.
Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. When running the tool we recommend that you add the -L argument with the variant rod to your command line for efficiency and speed.
The variant quality score recalibrator builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant is a true genetic variant or a machine artifact. VQSR must be run twice in succession in order to build a separate error model for SNPs and INDELs. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R path/to/reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -recalFile path/to/output.recal \ -tranchesFile path/to/output.tranches \ -nt 4 \ [SPECIFY TRUTH AND TRAINING SETS] \ [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING] \ [SPECIFY WHICH CLASS OF VARIATION TO MODEL] \
For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. In addition we take the highest confidence SNPs from the project's callset. These datasets are available in the GATK resource bundle. Arguments for VariantRecalibrator command:
-percentBad 0.01 -minNumBad 1000 \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an DP \ -mode SNP \
Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, DP, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.
Also, using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.
Some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.
Depth of coverage (the DP annotation invoked by Coverage) should not be used when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured! In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
Additionally, the UnifiedGenotyper produces a statistic called the HaplotypeScore which should be used for SNPs. This statistic isn't necessary for the HaplotypeCaller because that mathematics is already built into the likelihood function itself when calling full haplotypes.
In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP and/or indel callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:
When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle. Arguments for VariantRecalibrator:
--maxGaussians 4 -percentBad 0.01 -minNumBad 1000 \ -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.b37.vcf \ -an DP -an FS -an ReadPosRankSum -an MQRankSum \ -mode INDEL \
Note that indels use a different set of annotations than SNPs. Most annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner. This covariation is not necessarily indicative of being an error in the same way that it is for SNPs.
InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. If your study design has more than 10 samples then it is recommended to be included.
The power of the VQSR is that it assigns a calibrated probability to every putative mutation in the callset. The user is then able to decide at what point on the theoretical ROC curve their project wants to live. Some projects, for example, are interested in finding every possible mutation and can tolerate a higher false positive rate. On the other hand, some projects want to generate a ranked list of mutations that they are very certain are real and well supported by the underlying data. The VQSR provides the necessary statistical machinery to effectively apply this sensitivity/specificity tradeoff.
java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference/human_g1k_v37.fasta \ -input raw.input.vcf \ -tranchesFile path/to/input.tranches \ -recalFile path/to/input.recal \ -o path/to/output.recalibrated.filtered.vcf \ [SPECIFY THE DESIRED LEVEL OF SENSITIVITY TO TRUTH SITES] \ [SPECIFY WHICH CLASS OF VARIATION WAS MODELED] \
For SNPs we used HapMap 3.3 and the Omni 2.5M chip as our truth set. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.
--ts_filter_level 99.9 \ -mode SNP \
For indels we use the Mills / 1000 Genomes indel truth set described above. The default recommendation is to achieve 99.9% sensitivity to the accessible truth sites. Naturally projects involving a higher degree of diversity in terms of world populations can expect to achieve a higher truth sensitivity.
--ts_filter_level 99.9 \ -mode INDEL \
JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUAL is a key: the name of the annotation we want to look at30.0 is a value: the threshold that we want to use to evaluate variant quality against> is an operator: it determines which "side" of the threshold we want to selectThe complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
where && is the logical "AND".
Or if you want to select variants that have at least one of several conditions fulfilled:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
where || is the logical "OR".
It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record. It will throw an exception (i.e. fail with an error) when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases (although some tools provide options to change this behavior). This is extremely important especially when constructing complex expressions, because it affects how you should interpret the result.
For example, looking again at that last expression:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
When run against a VCF record with INFO field
QD=10.0;FS=300.0;ReadPosRankSum=-10.0
it will evaluate to TRUE because the FS value is greater than 200.0.
But when run against a VCF record with INFO field
QD=10.0;FS=300.0
it will evaluate to FALSE because there is no ReadPosRankSum value defined at all and JEXL fails to evaluate it.
This means that when you're trying to filter out records with VariantFiltration, for example, the previous record would be marked as PASSing, even though it contains a bad FS value.
For this reason, we highly recommend that complex expressions involving OR operations be split up into separate expressions whenever possible. For example, the previous example would have 3 distinct expressions: "QD < 2.0", "ReadPosRankSum < -20.0", and "FS > 200.0". This way, although the ReadPosRankSum expression evaluates to FALSE when the annotation is missing, the record can still get filtered (again using the example of VariantFiltration) when the FS value is greater than 200.0.
Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.
The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
VariantEval accepts two types of modules: stratification and evaluation modules.
CpG is a three-state stratification:
A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.
EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.
Sample is an N-state stratification, where N is the number of samples in the eval files.
Filter is a three-state stratification:
FunctionalClass is a four-state stratification:
CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.
Degeneracy is a six-state stratification:
See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.
JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]
Novelty is a three-state stratification:
CountVariants is an evaluation module that computes the following metrics:
| Metric | Definition |
|---|---|
| nProcessedLoci | Number of processed loci |
| nCalledLoci | Number of called loci |
| nRefLoci | Number of reference loci |
| nVariantLoci | Number of variant loci |
| variantRate | Variants per loci rate |
| variantRatePerBp | Number of variants per base |
| nSNPs | Number of snp loci |
| nInsertions | Number of insertion |
| nDeletions | Number of deletions |
| nComplex | Number of complex loci |
| nNoCalls | Number of no calls loci |
| nHets | Number of het loci |
| nHomRef | Number of hom ref loci |
| nHomVar | Number of hom var loci |
| nSingletons | Number of singletons |
| heterozygosity | heterozygosity per locus rate |
| heterozygosityPerBp | heterozygosity per base pair |
| hetHomRatio | heterozygosity to homozygosity ratio |
| indelRate | indel rate (insertion count + deletion count) |
| indelRatePerBp | indel rate per base pair |
| deletionInsertionRatio | deletion to insertion ratio |
CompOverlap is an evaluation module that computes the following metrics:
| Metric | Definition |
|---|---|
| nEvalSNPs | number of eval SNP sites |
| nCompSNPs | number of comp SNP sites |
| novelSites | number of eval sites outside of comp sites |
| nVariantsAtComp | number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same) |
| compRate | percentage of eval sites at comp sites |
| nConcordant | number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele) |
| concordantRate | the concordance rate |
A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):
$ grep -v '##' dbsnp.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132
Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:
$ grep -v '##' eval_correct_allele.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6
1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:
$ grep -v '##' eval_incorrect_allele.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6
1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Running VariantEval with just the CompOverlap module:
$ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
-R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
-L 1:10327 \
-B:dbsnp,VCF dbsnp.vcf \
-B:eval_correct_allele,VCF eval_correct_allele.vcf \
-B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
-noEV \
-EV CompOverlap \
-o eval.table
We find that the eval.table file contains the following:
$ grep -v '##' eval.table | column -t
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000
CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000
CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000
CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000
CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000
CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000
As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.
TiTvVariantEvaluator is an evaluation module that computes the following metrics:
| Metric | Definition |
|---|---|
| nTi | number of transition loci |
| nTv | number of transversion loci |
| tiTvRatio | the transition to transversion ratio |
| nTiInComp | number of comp transition sites |
| nTvInComp | number of comp transversion sites |
| TiTvRatioStandard | the transition to transversion ratio for comp sites |
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see this article for more information on preparing FASTA reference sequences for use with the GATK.
If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The contig ordering in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last. The GATK will detect misordered contigs (for example, lexicographically sorted) and throw an error. This draconian approach, though unnecessary technically, ensures that all supplementary data provided with the GATK works correctly. You can use ReorderSam to fix a BAM file aligned to a missorted reference sequence.
Our Best Practice recommendation is that you use a standard GATK reference from the GATK resource bundle.
The only input format for NGS reads that the GATK supports is the [Sequence Alignment/Map (SAM)] format. See [SAM/BAM] for more details on the SAM/BAM format as well as Samtools and Picard, two complementary sets of utilities for working with SAM/BAM files.
In addition to being in SAM format, we require the following additional constraints in order to use your file with the GATK:
.bam file extension).Below is an example well-formed SAM field header and fields from the 1000 Genomes Project:
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5
@SQ SN:4 LN:191154276 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:23dccd106897542ad87d2765d28a19a1
@SQ SN:5 LN:180915260 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:0740173db9ffd264d728f32784845cd7
@SQ SN:6 LN:171115067 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1d3a93a248d92a729ee764823acbbc6b
@SQ SN:7 LN:159138663 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:618366e953d6aaad97dbe4777c29375e
@SQ SN:8 LN:146364022 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:96f514a9929e410c6651697bded59aec
@SQ SN:9 LN:141213431 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:3e273117f15e0a400f01055d9f393768
@SQ SN:10 LN:135534747 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:988c28e000e84c26d552359af1ea2e1d
@SQ SN:11 LN:135006516 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:98c59049a2df285c76ffb1c6db8f8b96
@SQ SN:12 LN:133851895 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:51851ac0e1a115847ad36449b0015864
@SQ SN:13 LN:115169878 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:283f8d7892baa81b510a015719ca7b0b
@SQ SN:14 LN:107349540 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:98f3cae32b2a2e9524bc19813927542e
@SQ SN:15 LN:102531392 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:e5645a794a8238215b2cd77acb95a078
@SQ SN:16 LN:90354753 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:fc9b1a7b42b97a864f56b348b06095e6
@SQ SN:17 LN:81195210 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:351f64d4f4f9ddd45b35336ad97aa6de
@SQ SN:18 LN:78077248 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
@SQ SN:19 LN:59128983 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1aacd71f30db8e561810913e0b72636d
@SQ SN:20 LN:63025520 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:0dec9660ec1efaaf33281c0d5ea2560f
@SQ SN:21 LN:48129895 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:2979a6085bfe28e3ad6f552f361ed74d
@SQ SN:22 LN:51304566 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:a718acaa6135fdca8357d5bfe94211dd
@SQ SN:X LN:155270560 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:7e0e2e580297b7764e31dbc80c2540dd
@SQ SN:Y LN:59373566 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:1fa3474750af0948bdf97d5a0ee52e51
@SQ SN:MT LN:16569 AS:NCBI37 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:c68f52674c9fb33aef52dcf399755519
@RG ID:ERR000162 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR000252 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001684 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001685 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001686 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001687 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001688 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001689 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR001690 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002307 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002308 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002309 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002310 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002311 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002312 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002313 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@RG ID:ERR002434 PL:ILLUMINA LB:g1k-sc-NA12776-CEU-1 PI:200 DS:SRP000031 SM:NA12776 CN:SC
@PG ID:GATK TableRecalibration VN:v2.2.16 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, DinucCovariate, CycleCovariate], use_original_quals=true, defau
t_read_group=DefaultReadGroup, default_platform=Illumina, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, except on_if_no_tile=false, pQ=5, maxQ=40, smoothing=137 UR:file:/lustre/scratch102/projects/g1k/ref/main_project/human_g1k_v37.fasta M5:b4eb71ee878d3706246b7c1dbef69299
@PG ID:bwa VN:0.5.5
ERR001685.4315085 16 1 9997 25 35M * 0 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT ?8:C7ACAABBCBAAB?CCAABBEBA@ACEBBB@? XT:A:U XN:i:4 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001685 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:>>:>2>>>>>>>>>>>>>>>>>>?>>>>??>???>
ERR001689.1165834 117 1 9997 0 * = 9997 0 CCGATCTAGGGTTAGGGTTAGGGTTAGGGTTAGGG >7AA<@@C?@?B?B??>9?B??>A?B???BAB??@ RG:Z:ERR001689 OQ:Z:>:<<8<<<><<><><<>7<>>>?>>??>???????
ERR001689.1165834 185 1 9997 25 35M = 9997 0 CCGATCTCCCTAACCCTAACCCTAACCCTAACCCT 758A:?>>8?=@@>>?;4<>=??@@==??@?==?8 XT:A:U XN:i:4 SM:i:25 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 RG:Z:ERR001689 NM:i:6 MD:Z:0N0N0N0N1A0A28 OQ:Z:;74>7><><><>>>>><:<>>>>>>>>>>>>>>>>
ERR001688.2681347 117 1 9998 0 * = 9998 0 CGATCTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG 5@BA@A6B???A?B??>B@B??>B@B??>BAB??? RG:Z:ERR001688 OQ:Z:=>>>><4><<?><??????????????????????
The GATK requires that the BAM file be sorted in the same order as the reference. Unfortunately, many BAM files have headers that are sorted in some other order -- lexicographical order is a common alternative. To resort the BAM file please use ReorderSam.
The GATK accept interval files for processing subsets of the genome in Picard-style interval lists. These files have a .interval_list extension and look like this:
@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:249250621 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 SP:Homo Sapiens
@SQ SN:2 LN:243199373 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a0d9851da00400dec1098a9255ac712e SP:Homo Sapiens
@SQ SN:3 LN:198022430 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:fdfd811849cc2fadebc929bb925902e5 SP:Homo Sapiens
@SQ SN:4 LN:191154276 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:23dccd106897542ad87d2765d28a19a1 SP:Homo Sapiens
@SQ SN:5 LN:180915260 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:0740173db9ffd264d728f32784845cd7 SP:Homo Sapiens
@SQ SN:6 LN:171115067 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1d3a93a248d92a729ee764823acbbc6b SP:Homo Sapiens
@SQ SN:7 LN:159138663 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:618366e953d6aaad97dbe4777c29375e SP:Homo Sapiens
@SQ SN:8 LN:146364022 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:96f514a9929e410c6651697bded59aec SP:Homo Sapiens
@SQ SN:9 LN:141213431 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:3e273117f15e0a400f01055d9f393768 SP:Homo Sapiens
@SQ SN:10 LN:135534747 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:988c28e000e84c26d552359af1ea2e1d SP:Homo Sapiens
@SQ SN:11 LN:135006516 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:98c59049a2df285c76ffb1c6db8f8b96 SP:Homo Sapiens
@SQ SN:12 LN:133851895 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:51851ac0e1a115847ad36449b0015864 SP:Homo Sapiens
@SQ SN:13 LN:115169878 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:283f8d7892baa81b510a015719ca7b0b SP:Homo Sapiens
@SQ SN:14 LN:107349540 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:98f3cae32b2a2e9524bc19813927542e SP:Homo Sapiens
@SQ SN:15 LN:102531392 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:e5645a794a8238215b2cd77acb95a078 SP:Homo Sapiens
@SQ SN:16 LN:90354753 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:fc9b1a7b42b97a864f56b348b06095e6 SP:Homo Sapiens
@SQ SN:17 LN:81195210 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:351f64d4f4f9ddd45b35336ad97aa6de SP:Homo Sapiens
@SQ SN:18 LN:78077248 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c SP:Homo Sapiens
@SQ SN:19 LN:59128983 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1aacd71f30db8e561810913e0b72636d SP:Homo Sapiens
@SQ SN:20 LN:63025520 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:0dec9660ec1efaaf33281c0d5ea2560f SP:Homo Sapiens
@SQ SN:21 LN:48129895 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:2979a6085bfe28e3ad6f552f361ed74d SP:Homo Sapiens
@SQ SN:22 LN:51304566 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:a718acaa6135fdca8357d5bfe94211dd SP:Homo Sapiens
@SQ SN:X LN:155270560 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:7e0e2e580297b7764e31dbc80c2540dd SP:Homo Sapiens
@SQ SN:Y LN:59373566 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:1fa3474750af0948bdf97d5a0ee52e51 SP:Homo Sapiens
@SQ SN:MT LN:16569 AS:GRCh37 UR:http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta M5:c68f52674c9fb33aef52dcf399755519 SP:Homo Sapiens
1 30366 30503 + target_1
1 69089 70010 + target_2
1 367657 368599 + target_3
1 621094 622036 + target_4
1 861320 861395 + target_5
1 865533 865718 + target_6
...
consisting of a SAM-file-like sequence dictionary (the header), and targets in the form of .dict extension) and your intervals into one file.
You can also specify a list of intervals in a .interval_list file formatted as
Finally, we also accept BED style interval lists. Warning: this file format is 0-based for the start coordinates, so coordinates taken from 1-based formats should be offset by 1.
The GATK can associate arbitrary reference ordered data (ROD) files with named tracks for all tools. Some tools require specific ROD data files for processing, and developers are free to write tools that access arbitrary data sets using the ROD interface. The general ROD system has the following syntax:
-argumentName:name,type file
Where name is the name in the GATK tool (like "eval" in VariantEval), type is the type of the file, such as VCF or dbSNP, and file is the path to the file containing the ROD data.
The GATK supports several common file formats for reading ROD data:
Note that we no longer support the PED format. See here for converting .ped files to VCF.
Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default.
The information provided by the phone-home feature is critical in driving improvements to the GATK
Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
<GATK-run-report>
<id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id>
<exception>
<message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message>
<stacktrace class="java.util.ArrayList">
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string>
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
</stacktrace>
<cause>
<message>Position: '10,000,001x' contains invalid chars.</message>
<stacktrace class="java.util.ArrayList">
<string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string>
<string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string>
<string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
<string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
<string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
<string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
<string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
<string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
</stacktrace>
<is-user-exception>false</is-user-exception>
</cause>
<is-user-exception>true</is-user-exception>
</exception>
<start-time>2012/03/10 20.19.52</start-time>
<end-time>2012/03/10 20.19.52</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>0</iterations>
</GATK-run-report>
Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory.
The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature.
At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK.
Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment.
Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report.
For such users we have provided an -et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below.
To obtain a GATK key, please fill out the request form.
Running the GATK with a key is simple: you just need to append a -K your.key argument to your customary command line, where your.key is the path to the key file you obtained from us:
java -jar dist/GenomeAnalysisTK.jar \
-T PrintReads \
-I public/testdata/exampleBAM.bam \
-R public/testdata/exampleFASTA.fasta \
-et NO_ET \
-K your.key
The -K argument is only necessary when running the GATK with the NO_ET option.
If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''gsahelp@broadinstitute.org''' to ask for a replacement key.
If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an ant clean and re-compiling. If all else fails, please ask for help on our community forum.
We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review.
You probably know by now that GATK-Lite is a free-for-everyone and completely open-source version of the GATK (licensed under the original MIT license).
But what's in the box? What can GATK-Lite do -- or rather, what can it not do that the full version (let's call it GATK-Full) can? And what does that mean exactly, in terms of functionality, reliability and power?
To really understand the differences between GATK-Lite and GATK-Full, you need some more information on how the GATK works, and how we work to develop and improve it.
As explained here, the engine handles all the common work that's related to data access, conversion and traversal, as well as high-performance computing features. The engine is supported by an infrastructure of software libraries. If the GATK was a car, that would be the engine and chassis. What we call the **tools* are attached on top of that, and they provide the various analytical and processing functionalities like variant calling and base or variant recalibration. On your car, that would be headlights, airbags and so on.

We do all our development work on a single codebase. This means that everything --the engine and all tools-- is on one common workbench. There are not different versions that we work on in parallel -- that would be crazy to manage! That's why the version numbers of GATK-Lite and GATK-Full always match: if the latest GATK-Full version is numbered 2.1-13, then the latest GATK-Lite is also numbered 2.1-13.
The most important consequence of this setup is that when we make improvements to the infrastructure and engine, the same improvements will end up in GATK Lite and in GATK Full. So for the purposes of power, speed and robustness of the GATK that is determined by the engine, there is no difference between them.
For the tools, it's a little more complicated -- but not much. When we "build" the GATK binaries (the .jar files), we put everything from the workbench into the Full build, but we only put a subset into the Lite build. Note that this Lite subset is pretty big -- it contains all the tools that were previously available in GATK 1.x versions, and always will. We also reserve the right to add previews or not-fully-featured versions of the new tools that are in Full, at our discretion, to the Lite build.
We have a new tool that performs a brand new function (which wasn't available in GATK 1.x), and we only include it in the Full build.
We have a tool that has some new add-on capabilities (which weren't possible in GATK 1.x); we put the tool in both the Lite and the Full build, but the add-ons are only available in the Full build.

Reprising the car analogy, GATK-Lite and GATK-Full are like two versions of the same car -- the basic version and the fully-equipped one. They both have the exact same engine, and most of the equipment (tools) is the same -- for example, they both have the same airbag system, and they both have headlights. But there are a few important differences:
The GATK-Full car comes with a GPS (sat-nav for our UK friends), for which the Lite car has no equivalent. You could buy a portable GPS unit from a third-party store for your Lite car, but it might not be as good, and certainly not as convenient, as the Full car's built-in one.
Both cars have windows of course, but the Full car has power windows, while the Lite car doesn't. The Lite windows can open and close, but you have to operate them by hand, which is much slower.
The underlying engine is exactly the same in both GATK-Lite and GATK-Full. Most functionalities are available in both builds, performed by the same tools. Some functionalities are available in both builds, but they are performed by different tools, and the tool in the Full build is better. New, cutting-edge functionalities are only available in the Full build, and there is no equivalent in the Lite build.
We hope this clears up some of the confusion surrounding GATK-Lite. If not, please leave a comment and we'll do our best to clarify further!
One of the key challenges of working with next-gen sequence data is that input files are usually very large. We can’t just make the program open the files, load all the data into memory and perform whatever analysis is needed on all of it in one go. It’s just too much work, even for supercomputers.
Instead, we make the program cut the job into smaller tasks that the computer can easily process separately. Then we have it combine the results of each step into the final result.
Map/Reduce is the technique we use to achieve this. It consists of three steps formally called filter, map and reduce. Let’s apply it to an example case where we want to find out what is the average depth of coverage in our dataset for a certain region of the genome.
filter determines what subset of the data needs to be processed in each task. In our example, the program lists all the reference positions in our region of interest.
map applies the function, i.e. performs the analysis on each subset of data. In our example, for each position in the list, the program looks into the BAM file, pulls out the pileup of bases and outputs the depth of coverage at that position.
reduce combines the elements in the list of results output by the map function. In our example, the program takes the coverage numbers that were calculated separately for all the reference positions and calculates their average, which is the final result we want.
This may seem trivial for such a simple example, but it is a very powerful method with many advantages. Among other things, it makes it relatively easy to parallelize operations, which makes the tools run much faster on large datasets.
All the tools in the GATK are built from the ground up to take advantage of this method. That’s why we call them walkers: because they “walk” across the genome, getting things done.
Note that even though it’s not included in the Map/Reduce technique’s name, the filter step is very important. It determines what data get presented to the tool for analysis, selecting only the appropriate data for each task and discarding anything that’s not relevant. This is a key part of the Map/Reduce technique, because that’s what makes each task “bite-sized” enough for the computer to handle easily.
Each tool has filters that are tailored specifically for the type of analysis it performs. The filters rely on traversal engines, which are little programs that are designed to “traverse” the data (i.e. walk through the data) in specific ways.
There are three major types of traversal: Locus Traversal, Read Traversal and Active Region Traversal. In our interval coverage example, the tool’s filter uses the Locus Traversal engine, which walks through the data by locus, i.e. by position along the reference genome. Because of that, the tool is classified as a Locus Walker. Similarly, the Read Traversal engine is used, you’ve guessed it, by Read Walkers.
The GATK engine comes packed with many other ways to walk through the genome and get the job done seamlessly, but those are the ones you’ll encounter most often.
A primer on parallelism with the GATK How can I use parallelism to make GATK tools run faster?
A GATKReport is simply a text document that contains well-formatted, easy to read representation of some tabular data. Many GATK tools output their results as GATKReports, so it's important to understand how they are formatted and how you can use them in further analyses.
Here's a simple example:
#:GATKReport.v1.0:2
#:GATKTable:true:2:9:%.18E:%.15f:;
#:GATKTable:ErrorRatePerCycle:The error rate per sequenced position in the reads
cycle errorrate.61PA8.7 qualavg.61PA8.7
0 7.451835696110506E-3 25.474613284804366
1 2.362777171937477E-3 29.844949954504095
2 9.087604507451836E-4 32.875909752547310
3 5.452562704471102E-4 34.498999090081895
4 9.087604507451836E-4 35.148316651501370
5 5.452562704471102E-4 36.072234352256190
6 5.452562704471102E-4 36.121724890829700
7 5.452562704471102E-4 36.191048034934500
8 5.452562704471102E-4 36.003457059679770
#:GATKTable:false:2:3:%s:%c:;
#:GATKTable:TableName:Description
key column
1:1000 T
1:1001 A
1:1002 C
This report contains two individual GATK report tables. Every table begins with a header for its metadata and then a header for its name and description. The next row contains the column names followed by the data.
We provide an R library called gsalib that allows you to load GATKReport files into R for further analysis. Here are the five simple steps to getting gsalib, installing it and loading a report.
Please visit the Downloads page for instructions.
gsalib library$ ant gsalib
Buildfile: build.xml
gsalib:
[exec] * installing *source* package ?gsalib? ...
[exec] ** R
[exec] ** data
[exec] ** preparing package for lazy loading
[exec] ** help
[exec] *** installing help indices
[exec] ** building package indices ...
[exec] ** testing if installed package can be loaded
[exec]
[exec] * DONE (gsalib)
BUILD SUCCESSFUL
$ cat .Rprofile
.libPaths("/path/to/Sting/R/")
$ R
R version 2.11.0 (2010-04-22)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(gsalib)
> d = gsa.read.gatkreport("/path/to/my.gatkreport")
> summary(d)
Length Class Mode
CountVariants 27 data.frame list
CompOverlap 13 data.frame list
Inside of the Broad, the latest bundle will always be available in:
/humgen/gsa-hpprojects/GATK/bundle/current
with a subdirectory containing for each reference sequence and associated data files.
External users can download these files (or corresponding .gz versions) from the GSA FTP Server in the directory bundle. Gzipped files should be unzipped before attempting to use them. Note that there is no "current link" on the FTP; users should download the highest numbered directory under current (this is the most recent data set).
Additionally, these files all have supplementary indices, statistics, and other QC data available.
Includes the UCSC-style hg18 reference along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the 1000 Genomes pilot b36 formated reference sequence (human_b36_both.fasta) along with all lifted over VCF files. The refGene track and BAM files are not available. We only provide data files for this genome-build that can be lifted over "easily" from our master b37 repository. Sorry for whatever inconvenience that this might cause.
Also includes a chain file to lift over to b37.
Includes the UCSC-style hg19 reference along with all lifted over VCF files.
The following links should be help as a review or an introduction to concepts and terminology related to next-generation sequencing:
A basic review of the sequencing process.
Sequencing technologies, the next generation, (M. Metzker, Nature Reviews - Genetics)
An excellent, detailed overview of the myriad next-gen sequencing methdologies.
Next-generation sequencing: adjusting to data overload (M. Baker, Nature Methods)
A nice piece explaining the problems inherent in trying to analyze terabytes of data. The GATK addresses this issue by requiring all datasets be in reference order, so only small chunks of the genome need to be in memory at once, as explained here.
Primer on NGS analysis, from Broad Institute Primers in Medical Genetics
We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):
This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.
Here's the rough library properties of the BAMs:

These data files can be downloaded from the 1000 Genomes DCC
Here are the datasets we used in the GATK paper cited below.
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.
Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/
FILTER field status-- targets used in the analysis of the exome capture dataPlease note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.
Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.
If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.
The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.
Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.
If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.
Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs), so before you post this issue in our support forum you will first need to do a little investigation on your own.
To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.
Here is a checklist of questions you should ask yourself:
The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.
Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.
A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the UG will only consider a certain number of alternate alleles. This value can be set using the --max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.
SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.
Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which can dramatically affect the sensitivity and reliability of the results.
In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.
If you're working on human genomes, you're in luck. We provide sets of known sites in the human genome as part of our resource bundle, and we can give you specific Best Practices recommendations on which sets to use for each tool in the variant calling pipeline. See the next section for details.
If you're working on genomes of other organisms, things may be a little harder -- but don't panic, we'll try to help as much as we can. We've started a community discussion in the forum on What are the standard resources for non-human genomes? in which we hope people with non-human genomics experience will share their knowledge.
And if it turns out that there is as yet no suitable set of known sites for your organisms, here's how to make your own for the purposes of BaseRecalibration: First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence. Good luck!
Some experimentation will be required to figure out the best way to find the highest confidence SNPs for use here. Perhaps one could call variants with several different calling algorithms and take the set intersection. Or perhaps one could do a very strict round of filtering and take only those variants which pass the test.
| Tool | dbSNP 129 - | - dbSNP >132 - | - Mills indels - | - 1KG indels - | - HapMap - | - Omni |
|---|---|---|---|---|---|---|
| RealignerTargetCreator | X | X | ||||
| IndelRealigner | X | X | ||||
| BaseRecalibrator | X | X | X | |||
| (UnifiedGenotyper/ HaplotypeCaller) | X | |||||
| VariantRecalibrator | X | X | X | X | ||
| VariantEval | X |
These tools require known indels passed with the -known argument to function properly. We use both the following files:
This tool requires known SNPs and indels passed with the -knownSites argument to function properly. We use all the following files:
These tools do NOT require known sites, but if SNPs are provided with the -dbsnp argument they will use them for variant annotation. We use this file:
This tool requires known SNPs and indels passed with the -resource argument to function properly. We use all the following files:
For best results, these resources should be passed with these parameters:
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-resource:mills,VCF,known=false,training=true,truth=true,prior=12.0 gold.standard.indel.b37.vcf
This tool requires known SNPs passed with the -dbsnp argument to function properly. We use the following file:
As featured in this forum question.
Two main things account for these kinds of differences, both linked to default behaviors of the tools:
In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.