VariantRecalibrator

Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants.

Category Variant Discovery Tools

Traversal LocusWalker

PartitionBy NONE


Overview

This walker is the first pass in a two-stage processing step. This walker is designed to be used in conjunction with ApplyRecalibration walker.

The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call. The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, MQ, HaplotypeScore, and ReadPosRankSum, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

NOTE: In order to create the model reporting plots Rscript needs to be in your environment PATH (this is the scripting version of R, not the interactive version). See http://www.r-project.org for more info on how to download and install R.

Input

The input raw variants to be recalibrated.

Known, truth, and training sets to be used by the algorithm. How these various sets are used is described below.

Output

A recalibration table file in VCF format that is used by the ApplyRecalibration walker.

A tranches file which shows various metrics of the recalibration callset as a function of making several slices through the data.

Example

 java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
   -R reference/human_g1k_v37.fasta \
   -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
   -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
   -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
   -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
   -mode SNP \
   -recalFile path/to/output.recal \
   -tranchesFile path/to/output.tranches \
   -rscriptFile path/to/output.plots.R
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by VariantRecalibrator.

Parallelism options

This tool can be run in multi-threaded mode using this option.


Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

VariantRecalibrator specific arguments

This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.

Name Type Default value Summary
Required
--input List[RodBinding[VariantContext]] NA The raw input variants to be recalibrated
--recal_file VariantContextWriter NA The output recal file used by ApplyRecalibration
--resource List[RodBinding[VariantContext]] [] A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
--tranches_file File NA The output tranches file used by ApplyRecalibration
--use_annotation String[] NA The names of the annotations which should used for calculations
Optional
--dirichlet double 0.0010 The dirichlet parameter in the variational Bayes algorithm.
--ignore_filter String[] NA If specified the variant recalibrator will use variants even if the specified filter name is marked in the input VCF file
--maxGaussians int 10 The maximum number of Gaussians to try during variational Bayes algorithm
--maxIterations int 100 The maximum number of VBEM iterations to be performed in variational Bayes algorithm. Procedure will normally end when convergence is detected.
--minNumBadVariants int 2500 The minimum amount of worst scoring variants to use when building the Gaussian mixture model of bad variants. Will override -percentBad argument if necessary.
--mode Mode SNP Recalibration mode to employ: 1.) SNP for recalibrating only snps (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both snps and indels simultaneously.
--numKMeans int 30 The number of k-means iterations to perform in order to initialize the means of the Gaussians in the Gaussian mixture model.
--percentBadVariants double 0.03 What percentage of the worst scoring variants to use when building the Gaussian mixture model of bad variants. 0.07 means bottom 7 percent.
--priorCounts double 20.0 The number of prior counts to use in the variational Bayes algorithm.
--qualThreshold double 80.0 If a known variant has raw QUAL value less than -qual then don't use it for building the Gaussian mixture model.
--rscript_file File NA The output rscript file generated by the VQSR to aid in visualization of the input data and learned model
--shrinkage double 1.0 The shrinkage parameter in the variational Bayes algorithm.
--stdThreshold double 14.0 If a variant has annotations more than -std standard deviations away from mean then don't use it for building the Gaussian mixture model.
--target_titv double 2.15 The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
--ts_filter_level double 99.0 The truth sensitivity level at which to start filtering, used here to indicate filtered variants in the model reporting plots
--TStranche double[] [100.0, 99.9, 99.0, 90.0] The levels of novel false discovery rate (FDR, implied by ti/tv) at which to slice the data. (in percent, that is 1.0 for 1 percent)
Advanced
--trustAllPolymorphic Boolean false Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--dirichlet / -dirichlet ( double with default value 0.0010 )

The dirichlet parameter in the variational Bayes algorithm..

--ignore_filter / -ignoreFilter ( String[] )

If specified the variant recalibrator will use variants even if the specified filter name is marked in the input VCF file.

--input / -input ( required List[RodBinding[VariantContext]] )

The raw input variants to be recalibrated. These calls should be unfiltered and annotated with the error covariates that are intended to use for modeling. --input binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

--maxGaussians / -mG ( int with default value 10 )

The maximum number of Gaussians to try during variational Bayes algorithm.

--maxIterations / -mI ( int with default value 100 )

The maximum number of VBEM iterations to be performed in variational Bayes algorithm. Procedure will normally end when convergence is detected..

--minNumBadVariants / -minNumBad ( int with default value 2500 )

The minimum amount of worst scoring variants to use when building the Gaussian mixture model of bad variants. Will override -percentBad argument if necessary..

--mode / -mode ( Mode with default value SNP )

Recalibration mode to employ: 1.) SNP for recalibrating only snps (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both snps and indels simultaneously..
The --mode argument is an enumerated type (Mode), which can have one of the following values:

SNP
INDEL
BOTH

--numKMeans / -nKM ( int with default value 30 )

The number of k-means iterations to perform in order to initialize the means of the Gaussians in the Gaussian mixture model..

--percentBadVariants / -percentBad ( double with default value 0.03 )

What percentage of the worst scoring variants to use when building the Gaussian mixture model of bad variants. 0.07 means bottom 7 percent..

--priorCounts / -priorCounts ( double with default value 20.0 )

The number of prior counts to use in the variational Bayes algorithm..

--qualThreshold / -qual ( double with default value 80.0 )

If a known variant has raw QUAL value less than -qual then don't use it for building the Gaussian mixture model..

--recal_file / -recalFile ( required VariantContextWriter )

The output recal file used by ApplyRecalibration.

--resource / -resource ( required List[RodBinding[VariantContext]] with default value [] )

A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run). Any set of VCF files to use as lists of training, truth, or known sites. Training - Input variants which are found to overlap with these training sites are used to build the Gaussian mixture model. Truth - When deciding where to set the cutoff in VQSLOD sensitivity to these truth sites is used. Known - The known / novel status of a variant isn't used by the algorithm itself and is only used for reporting / display purposes. Bad - In addition to using the worst 3% of variants as compared to the Gaussian mixture model, we can also supplement the list with a database of known bad variants. --resource binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

--rscript_file / -rscriptFile ( File )

The output rscript file generated by the VQSR to aid in visualization of the input data and learned model.

--shrinkage / -shrinkage ( double with default value 1.0 )

The shrinkage parameter in the variational Bayes algorithm..

--stdThreshold / -std ( double with default value 14.0 )

If a variant has annotations more than -std standard deviations away from mean then don't use it for building the Gaussian mixture model..

--target_titv / -titv ( double with default value 2.15 )

The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!. The expected transition / transversion ratio of true novel variants in your targeted region (whole genome, exome, specific genes), which varies greatly by the CpG and GC content of the region. See expected Ti/Tv ratios section of the GATK best practices documentation (http://www.broadinstitute.org/gatk/guide/topic?name=best-practices) for more information. Normal whole genome values are 2.15 and for whole exome 3.2. Note that this parameter is used for display purposes only and isn't used anywhere in the algorithm!

--tranches_file / -tranchesFile ( required File )

The output tranches file used by ApplyRecalibration.

--trustAllPolymorphic / -allPoly ( Boolean with default value false )

Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation..

--ts_filter_level / -ts_filter_level ( double with default value 99.0 )

The truth sensitivity level at which to start filtering, used here to indicate filtered variants in the model reporting plots.

--TStranche / -tranche ( double[] with default value [100.0, 99.9, 99.0, 90.0] )

The levels of novel false discovery rate (FDR, implied by ti/tv) at which to slice the data. (in percent, that is 1.0 for 1 percent). Add truth sensitivity slices through the call set at the given values. The default values are 100.0, 99.9, 99.0, and 90.0 which will result in 4 estimated tranches in the final call set: the full set of calls (100% sensitivity at the accessible sites in the truth set), a 99.9% truth sensitivity tranche, along with progressively smaller tranches at 99% and 90%.

--use_annotation / -an ( required String[] )

The names of the annotations which should used for calculations. See the input VCF file's INFO field for a list of all available annotations.


See also Guide Index | Technical Documentation Index | Support Forum

GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.