Our current best practice for making SNP and indel calls is divided into four sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes.

Example commands for each tool are available on the individual tool's wiki entry. There is also a list of which resource files to use with which tool.
Note that due to the specific attributes of a project the specific values used in each of the commands may need to be selected/modified by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits the data and project design.
There are four major organizational units for next-generation DNA sequencing processes that used throughout this documentation:
This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.
In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data from the GATK resource bundle. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.
In our GATK 2.0 slide archive.
The GATK data processing pipeline assumes that one of the many NGS read aligners (see this review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits SAM files natively (which are then easy to convert to BAM).
The three key processes used here are:
There are several options here from the easy and fast basic protocol to the more comprehensive but computationally expensive pipeline. For example, there are two types of realignment which constitute a vastly different amount of processing power required:
This protocol uses lane-level local realignment around known indels (very fast, as there's no sample level processing) to clean up lane-level alignments. This results in better quality scores, as they are less biased for indel alignment artefacts.
for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)
Here we are essentially just merging the recalibrated lane.bams for a sample, dedupping the reads, and calling it quite. It doesn't perform indel realignment across lanes, so it leaves in some indels artifacts. For humans, which now have an extensive list of indels (get them from the GATK bundle!) the lane-level realignment around known indels is going to make up for the lack of cross-lane realignment. This protocol is appropriate if you are going to use callers like the HaplotypeCaller, UnifiedGenotyper with BAQ, or samtools with BAQ that are less sensitive to the initial alignment of reads, or if your project has limited coverage per sample (< 8x) where per-sample indel realignment isn't more empowered than per-lane realignment. For other situations or for organisms with limited database of segregating indels, it's better to use the advanced protocol if you have deep enough data per sample.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
sample.bam <- dedup.bam
As with the basic protocol, this protocol assumes the per-lane processing has been already completed. This protocol is essentially the basic protocol but with per-sample indel realignment.
for each sample
recals.bam <- merged lane-level recal.bams for sample
dedup.bam <- MarkDuplicates(recals.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
sample.bam <- realigned.bam
This is the protocol we use at the Broad in our fully automated pipeline because it gives an optimal balance of performance, accuracy and convenience.
Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.
for each sample
lanes.bam <- merged lane.bams for sample
dedup.bam <- MarkDuplicates(lanes.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
recal.bam <- recal(realigned.bam)
sample.bam <- recal.bam
This protocol can be hard to implement in practice unless you can afford to wait until all of the data is available to do data processing for your samples.
ReduceReads is a novel (perhaps even breakthrough?) GATK 2.0 data compression algorithm. The purpose of ReducedReads is to take a BAM file with NGS data and reduce it down to just the information necessary to make accurate SNP and indel calls, as well as genotype reference sites (hard to achieve) using GATK tools like UnifiedGenotyper or HaplotypeCaller. ReduceReads accepts as an input a BAM file and produces a valid BAM file (it works in IGV!) but with a few extra tags that the GATK can use to make accurate calls.
You can find more information about reduced reads in some of our presentations in the archive.
ReduceReads works well for exomes or high-coverage (at least 20x average coverage) whole genome BAM files. In this case we highly recommend using ReduceReads to minimize the file sizes. Note that ReduceReads performs a lossy compression of the sequencing data that works well with the downstream GATK tools, but may not be supported by external tools. Also, we recommend that you archive your original BAM file, or at least a copy of your original FASTQs, as ReduceReads is highly lossy and doesn't quality as an archive data compression format.
Using ReduceReads on your BAM files will cut down the sizes to approximately 1/100 of their original sizes, allowing the GATK to process tens of thousands of samples simultaneously without excessive IO and processing burdens. Even for single samples ReduceReads cuts the memory requirements, IO burden, and CPU costs of downstream tools significantly (10x or more) and so we recommend you preprocess analysis-ready BAM files with ReducedReads.
for each sample
sample.reduced.bam <- ReduceReads(sample.bam)
After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass.
For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.
The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Haplotype Caller or Unified Genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.
A common question is the confidence score threshold to use for variant detection. We recommend:
raw.vcf <- HaplotypeCaller(sample1.bam, sample2.bam, ..., sampleN.bam)
raw.vcf <- UnifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)
-nt option. However you can use Queue to parallelize execution of HaplotypeCaller.This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the false positive machine artifacts from the true positive genetic variants.
The process used here is the Variant quality score recalibrator which builds an adaptive error model using known variant sites and then applies this model to estimate the probability that each variant in the callset is a true genetic variant or a machine/alignment artifact. All filtering criteria are learned from the data itself.
Take a look at our FAQ page for specific recommendations on which training sets and command-line arguments to use with various project designs. It is expected that analysis will be required by the user when determining the best way to use this tool for different projects. These recommendations are only meant as jumping off points!
We've found that the best results are obtained with the VQSR when it is used to build separate adaptive error models for SNPs versus INDELs:
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP)
indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL)
recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP)
analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples. Also, for experiments that employ targeted resequencing of a small region (for example, a few hundred genes), VQSR may not be empowered regardless of the number of samples in the experiment. For users with experiments containing fewer exome samples or with a small target region there are several options to explore (listed in priority order of what we think will give the best results):
--maxGaussians 4 --percentBad 0.05 to your command line. Note that this is very dependent on your dataset, and you may need to try some very different settings. It may even not work at all. Unfortunately we cannot give you any specific advice, so please do not post questions on the forum asking for help finding the right parameters. These recommended arguments for VariantFiltration are only to used when ALL other options are not available:
You will need to compose filter expressions (see here, here and here for details) to filter on the following annotations and values:
For SNPs:
QD < 2.0MQ < 40.0FS > 60.0HaplotypeScore > 13.0 MQRankSum < -12.5ReadPosRankSum < -8.0For indels:
QD < 2.0ReadPosRankSum < -20.0InbreedingCoeff < -0.8FS > 200.0Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
For shallow-coverage (<10x): you cannot use filtering to reliably separate true positives from false positives. You must use the protocol involving variant quality score recalibration.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by variant quality score recalibration.
This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

The Data Processing Pipeline is a Queue script that performs all raw data processing described in this page following our most current stable recommendations for best practices.
Our current best practice for making SNP and indel calls is divided into 4 sequential steps: initial mapping, refinement of the initial reads, multi-sample indel and SNP calling, and finally variant quality score recalibration. These steps are the same for targeted resequencing, whole exomes, deep whole genomes, and low-pass whole genomes. The exact commands for each tool are available on the individual tool's wiki entry. See here for a visual representation of the workflow.
Note that, due to the specific attributes of a project, that the the specific values used in each of the commands may need to be selected by the analyst. Care should be taken by the analyst running our tools to understand what each parameter does and to evaluate which value best fits his/her data.
There are four major organizational units for next-generation DNA sequencing processes:
This document describes how to call variation within a single analysis cohort, comprised for one or many samples, each of one or many libraries that were sequenced on at least one lane of an NGS machine.
Note that many GATK commands can be run at the lane level, but will give better results seeing all of the data for a single sample, or even all of the data for all samples. Unfortunately, there's a trade-off in computational cost by running these commands across all of your data simultaneously.
In order to help individuals get up to speed, evaluate their command lines, and generally become familiar with the GATK tools we recommend you download the raw and realigned, recalibrated NA12878 test data. It should be possible to apply all of the approaches outlined below to get excellent results for realignment, recalibration, SNP calling, indel calling, filtering, and variant quality score recalibration using this data.
The GATK data processing pipeline assumes that one of the many NGS read aligners (see [1] for a review) has been applied to your raw FASTQ files. For Illumina data we recommend BWA because it is accurate, fast, well-supported, open-source, and emits BAM files natively.
The three key tools here are Base quality score recalibration, Local realignment around indels, and MarkDuplicates. Although ideally one follows the recommended workflow, in practice MarkDuplicates can be run before local realignment, in order to handle cases where duplicates overlapping indels get marginally different alignments (unlikely but possible) and so will not be considered as potential duplicates because MarkDuplicates looks only at read pair start and stop positions. There are several options here, from the easy and fast basic protocol to the more.
There are two types of realignment:
This is the protocol we used at the Broad Institute for the last year.
for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
recal.bam <- recal(dedup.bam)
for each sample
lanes.bam <- merged recal.bam's for sample
dedup.bam <- MarkDuplicates(lanes.bam)
realigned.bam <- realign(dedup.bam) [with known sites if possible]
This protocol adds lane-level local realignment around known indels, making it very fast (no sample level processing) and gives better results for human samples than the previous recommendation:
for each lane.bam
realigned.bam <- realign(lane.bam) [at only known sites]
dedup.bam <- MarkDuplicate(realigned.bam)
recal.bam <- recal(dedup.bam)
for each sample
recals.bam <- merged lane-level recal.bam's for sample
dedup.bam <- MarkDuplicates(recals.bam)
This protocol adds sample-level realignment after library / sample level dedupping, so that novel indels in each sample can be discovered and realigned around.
for each lane.bam
realigned.bam <- realign(lane.bam) [at only known sites]
dedup.bam <- MarkDuplicate(realigned.bam)
recal.bam <- recal(dedup.bam)
for each sample
recals.bam <- merged lane-level recal.bam's for sample
dedup.bam <- MarkDuplicates(recals.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
Rather than doing the lane level cleaning and recalibration, this process aggregates all of the reads for each sample and then does a full dedupping, realign, and recalibration, yielding the best single-sample results. The big change here is sample-level cleaning followed by recalibration, giving you the most accurate quality scores possible for a single sample.
for each sample
lanes.bam <- merged lane.bam's for sample
dedup.bam <- MarkDuplicates(lanes.bam)
realigned.bam <- realign(dedup.bam) [with known sites included if available]
recal.bam <- recal(realigned.bam)
Finally, if you really want to get the absolute best results, whatever the computational cost, then we recommend doing multiple sample realignment so that novel indels in one sample help to realign reads in other samples. Although not generally necessary for deep sequencing data, this is important for low-coverage multi-sample SNP calling projects like the 1000 Genomes Project. Note that the computational cost here is so extreme that we only do this analysis in special circumstances, such as large-scale data freeze for the project.
Note that for contrastive calling projects -- such as cancer tumor/normals -- that we recommend cleaning both the tumor and the normal together in general to avoid slight alignment differences between the two tissue types.
for each sample
lanes.bam <- merged lane.bam's for sample
dedup.bam <- MarkDuplicates(lanes.bam)
samples.bam <- merged dedup.bam's for all samples
realigned.bam <- realign(samples.bam)
recal.bam <- recal(realigned.bam)
After the raw data processing step, the GATK variant detection process assumes that you have aligned, duplicate marked, and recalibrated BAM files for all of the samples in your cohort. Because the GATK can dynamically merge BAM files, it isn't critical to have merged files by lane into sample bams, or even samples bams into cohort bams. In general we try to create sample level bams for deep data sets (deep WG or exomes) and merged cohort files by chromosome for WG low-pass. A nice size for BAMs is 10-300 Gb, just for organizing on disk.
For this part of the this document, I'm going to assume that you have a single realigned, recalibrated, dedupped BAM per sample, called sampleX.bam, for X from 1 to N samples in your cohort. Note that some of the data processing steps, such as multiple sample local realignment, will merge BAMS for many samples into a single BAM. If you've gone down this route, you just need to modify the GATK commands as necessary to take not multiple BAMs, one for each sample, but a single BAM for all samples.
The next step in the standard GATK data processing pipeline, whole genome or targeted, deep or shallow, is to apply the Unified genotyper to identify sites among the cohort samples that are statistically non-reference. This will produce a multi-sample VCF file, with sites discovered across samples and genotypes assigned to each sample in the cohort. It's in this stage that we use the meta-data in the BAM files extensively -- read groups for reads, with samples, platforms, etc -- to enable us to do the multi-sample merging and genotyping correctly. It was a pain for data processing, yes, but now life is easy for downstream calling and analysis.
Note that by default the Unified Genotyper calls SNPs only. To enable the indel calling capabilities (in addition to SNPs) instead use the -glm BOTH argument.
A common question is the confidence score threshold to use for variant detection. We recommend:
raw.vcf <- unifiedGenotyper(sample1.bam, sample2.bam, ..., sampleN.bam)
This raw VCF file should be as sensitive to variation as you'll get without imputation. At this stage, you can assess things like sensitivity to known variant sites or genotype chip concordance. The problem is that the raw VCF will have many sites that aren't really genetic variants but are machine artifacts that make the site statistically non-reference. All of the subsequent steps are designed to separate out the FP machine artifacts from the TP genetic variants.
The tool used here is the Variant quality score recalibration which builds an adaptive error model using known variant sites and then apply this model to estimate the probability that each variant is a true genetic variant or a machine artifact. The tool builds a separate model for SNPs and indels and must be run twice in succession in order to recalibrate both classes of variants. One major improvement from previous recommended protocols is that hand filters do not need to be applied at any point in the process now. All filtering criteria are learned from the data itself.
snp.model <- BuildErrorModelWithVQSR(raw.vcf, SNP) indel.model <- BuildErrorModelWithVQSR(raw.vcf, INDEL) recalibratedSNPs.rawIndels.vcf <- ApplyRecalibration(raw.vcf, snp.model, SNP) analysisReady.vcf <- ApplyRecalibration(recalibratedSNPs.rawIndels.vcf, indel.model, INDEL)
Common VariantRecalibrator command
Please review the Variant quality score recalibration wiki page for details of how one specifies truth and training sets.
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R path/to/reference/human_g1k_v37.fasta \ -input,VCF raw.vcf \ [SPECIFY TRUTH AND TRAINING SETS] -recalFile path/to/output.recal \ -tranchesFile path/to/output.tranches \ -rscriptFile path/to/output.plots.R [SPECIFY WHICH ANNOTATIONS TO USE IN MODELING]
For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle.
Arguments for VariantRecalibrator command:
-resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \ -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff -an DP \ -mode SNP \
Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, MQ, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.
Also, note that some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated.
Using the provided sites-only truth data files is important here as parsing the genotypes for VCF files with many samples increases the runtime of the tool significantly.
When modeling indels with the VQSR we use a training dataset that was created at the Broad by strictly curating the (Mills, Devine, Genome Research, 2011) dataset as as well as adding in very high confidence indels from the 1000 Genomes Project. This dataset is available in the GATK resource bundle.
Arguments for VariantReacalibrator:
-resource:mills,VCF,known=true,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.sites.vcf \ -an QD -an FS -an HaplotypeScore -an ReadPosRankSum -an InbreedingCoeff \ -mode INDEL \
Note that indels use a different set of annotations than SNPs. The annotations related to mapping quality have been removed since there is a conflation with the length of an indel in a read and the degradation in mapping quality that is assigned to the read by the aligner.
Common ApplyRecalibration command
The Variant quality score recalibration wiki page has an example command for the ApplyRecalibration step which works for both SNPs and indels. One would just need to add -mode SNP (the default) or -mode INDEL to the command line.
For exome SNPs, the tool used here, as in the whole genome case, is the Variant quality score recalibration which builds an adaptive error model using known variant sites and then apply this model to estimate the probability that each variant is a true genetic variant or a machine artifact. In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP callset with at least 30 samples. For users with experiments containing fewer exome samples there are several options to explore:
For exome indels there generally isn't enough data to build a robust error model and so we recommend applying hand filters to the exome indels. The protocol is to first use SelectVariants to separate out SNPs and indels. Then recalibrate the SNPs and hand filter the indels in parallel. Finally, use CombineVariants to combine the high quality variants back into one VCF file.
raw.snps.vcf <- Select(raw.vcf, SNP) raw.indels.vcf <- Select(raw.vcf, INDEL) snp.model <- BuildErrorModelWithVQSR(raw.snps.vcf) recalibratedSNPs.vcf <- ApplyRecalibration(raw.snps.vcf, snp.model) filteredIndels.vcf <- Filter(raw.indels.vcf) analysisReady.vcf <- CombineTogether(recalibratedSNPs.vcf, filteredIndels.vcf)
For SNPs we use both HapMap v3.3 and the Omni chip array from the 1000 Genomes Project as training data. These datasets are available in the GATK resource bundle.
Arguments for VariantRecalibrator command:
--maxGaussians 6 \ -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:dbsnp,VCF,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \ -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \ -mode SNP \
Note that, for the above to work, the input vcf needs to be annotated with the corresponding values (QD, FS, MQ, etc.). If any of these values are somehow missing, then VariantAnnotator needs to be run first so that VariantRecalibration can run properly.
Also, note that some of these annotations might not be the best for your particular dataset. For example, InbreedingCoeff is a population level statistic that requires at least 10 samples in order to be calculated. Additionally, notice that DP was removed when working with hybrid capture datasets since there is extreme variation in the depth to which targets are captured. In whole genome experiments this variation is indicative of error but that is not the case in capture experiments.
Common ApplyRecalibration command
The Variant quality score recalibration wiki page has an example command for the ApplyRecalibration step.
Arguments for VariantFiltrationWalker:
--filterExpression "QD < 2.0" \ --filterExpression "ReadPosRankSum < -20.0" \ --filterExpression "InbreedingCoeff < -0.8" \ --filterExpression "FS > 200.0" \ --filterName QDFilter \ --filterName ReadPosFilter \ --filterName InbreedingFilter \ --filterName FSFilter \
Note the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
Variant quality score recalibration requires a reasonable amount of data to work properly so with targeted resequencing of a small region (for example, a few hundred genes) VQSR may be inappropriate leaving hand filtering as the only option.
Arguments for VariantFiltrationWalker:
--filterExpression filter1 --filterName filterName1 --filterExpression filter2 --filterName filterName2 . . .
where DATA_TYPE_SPECIFIC_FILTERS has project-specific filtering strings selected below:
Note that the InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.
The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that For exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.
That said, all of the caveats about determining the right parameters, etc, are annoying and are largely eliminated by Variant quality score recalibration.
All of the following tables were generated using VariantEval. You can generate your own data points for comparing with that tool and the correct input / comparison VCF files.
The associated table (see attachment) provides some expectations for running deep whole genomes, single whole exomes, and multi-sample low-pass (from the 1000 genomes) in terms of sensitivity, specificity, and Ti/Tv ratios for known and novel calls. Obviously individual data sets will be different but this demonstrates that running the pipeline described here ( realigned -> recalibration -> calling -> filtering -> variant quality score recalibration) can produce excellent results for a variety of data types. Note that the deep data sets are single sample; in our hands multi-sample deep data results in much better calls overall.
We provide two useful data points (see attachment) for evaluating the quality of SNP calls whole genome or in the targeted whole exome (Agilent). You can follow a similar methodology to establish the Ti/Tv expectations for your region of interest, if captured separately, by running VariantEval on the 1000 Genomes Trios or Complete Genomes or other highly reliable data sets given the targeted interval.
This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

These are videos of the presentations given at the GATK workshop on 4-5 Dec 2012 on "Best Practices for variant calling with the GATK".
http://www.broadinstitute.org/gatk/guide/events?id=2038
Hello,
I am new at using GATK (v 2.1-3). I do exome sequencing by sample using the following steps: Alignment with BWA (0.6.2) GATK :Local realignment around INDELs PICARD (1.67): FixMateInformation GATK: Recalibration (BaseRecalibrator + PrintReads -BQSR) Samtools for calling variants
cd Sample_09
+ samtools mpileup -BE -ug -q 20 -Q 20 -D -f human_g1k_v37.fasta realigned_fixed_recal.bam -C50
+ bcftools view -bvcg -
[mpileup] 1 samples in 1 input files
I have seen that some groups use after realignment Picard:AddOrReplaceReadGroups and I wonder if I should use before calling the variant with samtools.
Thanks in advance for any advice you can give me.
Chris
Hello,
I asked this question in a comment under BestPractices but never got a response. Hopefully I will here. Here goes:
I have been running GATK v1.6.2 on several samples. It seems the way I had initially had run GATK for indel-realignment and quality re-calibration steps are reversed. For example, in order of processing, I ran:
What are the consequences and effect to SNP and INDEL calling if the GATK steps are ran as above?. I'm aware that this is not according to the best-practices document (thank you for the thorough GATK documentation), but I wanted to know if it is essential for me to re-analyze the samples. The variant caller I'm using looks at BaseQuality scores when making calls.
Any help on this would be appreciated.
Mike
Hi:
for the best practice V4, for the step: 2. Raw reads to analysis-ready reads at phase I, I only have lane-level sample data (each sample in one lane, no sample in multi-lanes), so I only did the Fast: lane-level realignment (at known sites only) and lane-level recalibration, which is
for each lane.bam
dedup.bam <- MarkDuplicate(lane.bam)
realigned.bam <- realign(dedup.bam) [at only known sites, if possible, otherwise skip]
recal.bam <- recal(realigned.bam)
Since I do not have multi-lane samples, I do not have to run Fast + per-sample processing, Better: per-sample realignment with known indels and recalibration, or Best: multi-sample realignment with known sites and recalibration, right? It seems that all of these schemes, Better, Best etc, are only for situation when some samples run at multiple lanes, is my understanding correct?
at the paragraph for Fast + per-sample processing, it mentioned: cross-lane realignment, not sure what it means exactly. if every sample of mine are in a single lane, it does not need, right?
Thanks
Mike
Hi,
I wonder how well GATK works with Ion Torrent data. Is there any recommended practice to handle Ion Torrent data, especially SNP and indel calling, with GATK?
Thanks, XZ
Hi all!
I'm currently working on high-coverage non-human data (mammals).
After mapping via BWA, soritng and merging, my pipeline goes like this:
--consensusDeterminationModel USE_SW flagI currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.
Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.
What is your recommendation for this for non-human data?
Thank you very much!
Sagi
Hello,
I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.
Please can you clarify?
I have pasted the bit of text I read below, just to make it clear to which bit I am referring.
Many thanks!
A common question is the confidence score threshold to use for variant detection. We recommend:
Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.
Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.
Hi, I am working on SOLiD 5500xl data and used SHRiMP2.2.3 for performing mapping. The library type is paired-end. I have read some discussions regarding SOLiD problem but I still have some doubts regarding some steps in best practices
So after mapping, I simply used GATK UnifiedGenotyper to call SNPs and InDels under default values. I end up getting around 40 million variants. Is there any way I can get a more refined variant calling? Do you still consider me applying the above pre-processing steps or do you recommend me applying some variant filteratiion on the called variants? If yes for the previous, then could you explain how my above concerns are taken care of? I was trying to look at some general recommended filter values on INFO fields in VCF format such as BQ, MQ, MQ0, DP, SB etc. Do you recommend some generally used values of these fields on which I can filter and hence refine my variant data?
I may have posted a subset of the above question, which I am not sure was posted successfully since at that time I just created an account. If you have already answered this question then I apologize for that. Could you then provide me the link where you answered it?
Thanks in advance
We are attempting to see if using ReducedReads will help with the overwhelming file sizes for the SNP calling we are doing on whole genome BAM files. We have been using a protocol similar to the one described in best practices document: Best: multi-sample realignment with known sites and recalibration. My question is what is the best point in the pipeline to use ReducedReads?
Hello,
I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.
The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?
Is there any difference or reason to choose one of the following two workflows over the other?
Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Or some third option I'm missing
Any help is appreciated.
Thanks
Hi all, I have a 4 exome experiment (high coverage). According to best practices I can't apply VQSR, so I must filter on SNP properties. One thing is not clear: should all the option apply? In other words should I select using boolean "and" (i.e. QD < 2.0 & MQ < 40.0 & FS > 60.0 & HaplotypeScore > 13.0 & MQRankSum < -12.5 & ReadPosRankSum < -8.0 for SNP) or "or" (i.e. QD < 2.0 | MQ < 40.0 | FS > 60.0 | HaplotypeScore > 13.0 | MQRankSum < -12.5 | ReadPosRankSum < -8.0 for SNP)?
Hi, I am estimating SNP number for a genome of a speceis.
I found that the number of SNP in the fastq going through GATK is 10 times more than the first fastq. Interestingly, if I use picard to do duplicats-removomg again to the GATK bam and used samtools to convert the bam to fastq file. The SNP jumps back to 10 time fewer.
What can be the reason that the SNP number can be 10 time different between the two methods? Actually, I expect the GATK output file has fewer SNP given the effect of recaliraiton or relingement. But the result is opposite.
Chih-Ming