# VariantRecalibrator

Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants.

## Overview

This walker is the first pass in a two-stage processing step. This walker is designed to be used in conjunction with the ApplyRecalibration walker.

The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. You can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call. The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, MQ, HaplotypeScore, and ReadPosRankSum, for example) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

### Inputs

The input raw variants to be recalibrated.

Known, truth, and training sets to be used by the algorithm. How these various sets are used is described below.

### Output

A recalibration table file in VCF format that is used by the ApplyRecalibration walker.

A tranches file which shows various metrics of the recalibration callset as a function of making several slices through the data.

### Example

 java -Xmx4g -jar GenomeAnalysisTK.jar \
-T VariantRecalibrator \
-R reference/human_g1k_v37.fasta \
-input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
-mode SNP \
-recalFile path/to/output.recal \
-tranchesFile path/to/output.tranches \
-rscriptFile path/to/output.plots.R


### Caveat

• The values used in the example above are only meant to show how the command lines are composed. They are not meant to be taken as specific recommendations of values to use in your own work, and they may be different from the values cited elsewhere in our documentation. For the latest and greatest recommendations on how to set parameter values for you own analyses, please read the Best Practices section of the documentation.
• In order to create the model reporting plots Rscript needs to be in your environment PATH (this is the scripting version of R, not the interactive version). See http://www.r-project.org for more info on how to download and install R.

These Read Filters are automatically applied to the data by the Engine before processing by VariantRecalibrator.

### Parallelism options

This tool can be run in multi-threaded mode using this option.

### Downsampling settings

This tool applies the following downsampling settings by default.

• Mode: BY_SAMPLE
• To coverage: 1,000

## Command-line Arguments

### Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

### VariantRecalibrator specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--input
NA The raw input variants to be recalibrated
--resource
[] A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
Required Outputs
--recal_file
-recalFile
NA The output recal file used by ApplyRecalibration
--tranches_file
-tranchesFile
NA The output tranches file used by ApplyRecalibration
Required Parameters
--mode
SNP Recalibration mode to employ
--use_annotation
-an
NA The names of the annotations which should used for calculations
Optional Inputs
--aggregate
NA Additional raw input variants to be used in building the model
Optional Outputs
--rscript_file
-rscriptFile
NA The output rscript file generated by the VQSR to aid in visualization of the input data and learned model
Optional Parameters
--ignore_filter
-ignoreFilter
NA If specified, the variant recalibrator will also use variants marked as filtered by the specified filter name in the input VCF file
--target_titv
-titv
2.15 The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
--TStranche
-tranche
[100.0, 99.9, 99.0, 90.0] The levels of novel false discovery rate (FDR, implied by ti/tv) at which to slice the data. (in percent, that is 1.0 for 1 percent)
-5.0 LOD score cutoff for selecting bad variants
--dirichlet
0.001 The dirichlet parameter in the variational Bayes algorithm.
--maxGaussians
-mG
8 Max number of Gaussians for the positive model
--maxIterations
-mI
150 Maximum number of VBEM iterations
--maxNegativeGaussians
-mNG
2 Max number of Gaussians for the negative model
--maxNumTrainingData
2500000 Maximum number of training data
1000 Minimum number of bad variants
--numKMeans
-nKM
100 Number of k-means iterations
--priorCounts
20.0 The number of prior counts to use in the variational Bayes algorithm.
--shrinkage
1.0 The shrinkage parameter in the variational Bayes algorithm.
--stdThreshold
-std
10.0 Annotation value divergence threshold (number of standard deviations from the means)
--trustAllPolymorphic
-allPoly
false Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

### Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

### --aggregate / -aggregate

Additional raw input variants to be used in building the model
These additional calls should be unfiltered and annotated with the error covariates that are intended to be used for modeling.

--aggregate binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

List[RodBinding[VariantContext]]

LOD score cutoff for selecting bad variants
Variants scoring lower than this threshold will be used to build the Gaussian model of bad variants.

double  -5.0  [ [ -∞  ∞ ] ]

### --dirichlet / -dirichlet

The dirichlet parameter in the variational Bayes algorithm.

double  0.001  [ [ -∞  ∞ ] ]

### --ignore_filter / -ignoreFilter

If specified, the variant recalibrator will also use variants marked as filtered by the specified filter name in the input VCF file
For this to work properly, the -ignoreFilter argument should also be applied to the ApplyRecalibration command.

String[]

### --input / -input

The raw input variants to be recalibrated
These calls should be unfiltered and annotated with the error covariates that are intended to be used for modeling.

R List[RodBindingCollection[VariantContext]]

### --maxGaussians / -mG

Max number of Gaussians for the positive model
This parameter determines the maximum number of Gaussians that should be used when building a positive model using the variational Bayes algorithm.

int  8  [ [ -∞  ∞ ] ]

### --maxIterations / -mI

Maximum number of VBEM iterations
This parameter determines the maximum number of VBEM iterations to be performed in the variational Bayes algorithm. The procedure will normally end when convergence is detected.

int  150  [ [ -∞  ∞ ] ]

### --maxNegativeGaussians / -mNG

Max number of Gaussians for the negative model
This parameter determines the maximum number of Gaussians that should be used when building a negative model using the variational Bayes algorithm. The actual maximum used is the smaller value between the mG and mNG arguments, meaning that if -mG is smaller than -mNG, -mG will be used for both. Note that this number should be small (e.g. 4) to achieve the best results.

int  2  [ [ -∞  ∞ ] ]

### --maxNumTrainingData / -maxNumTrainingData

Maximum number of training data
The number of variants to use in building the Gaussian mixture model. Training sets larger than this will be randomly downsampled.

int  2500000  [ [ -∞  ∞ ] ]

This parameter determines the minimum number of variants that will be selected from the list of worst scoring variants to use for building the Gaussian mixture model of bad variants.

int  1000  [ [ -∞  ∞ ] ]

### --mode / -mode

Recalibration mode to employ
Use either SNP for recalibrating only SNPs (emitting indels untouched in the output VCF) or INDEL for indels (emitting SNPs untouched in the output VCF). There is also a BOTH option for recalibrating both SNPs and indels simultaneously, but this is meant for testing purposes only and should not be used in actual analyses.

The --mode argument is an enumerated type (Mode), which can have one of the following values:

SNP
INDEL
BOTH

R Mode  SNP

### --numKMeans / -nKM

Number of k-means iterations
This parameter determines the number of k-means iterations to perform in order to initialize the means of the Gaussians in the Gaussian mixture model.

int  100  [ [ -∞  ∞ ] ]

### --priorCounts / -priorCounts

The number of prior counts to use in the variational Bayes algorithm.

double  20.0  [ [ -∞  ∞ ] ]

### --recal_file / -recalFile

The output recal file used by ApplyRecalibration

R VariantContextWriter

### --resource / -resource

A list of sites for which to apply a prior probability of being correct but which aren't used by the algorithm (training and truth sets are required to run)
Any set of VCF files to use as lists of training, truth, or known sites. Training - The program builds the Gaussian mixture model using input variants that overlap with these training sites. Truth - The program uses these truth sites to determine where to set the cutoff in VQSLOD sensitivity. Known - The program only uses known sites for reporting purposes (to indicate whether variants are already known or novel). They are not used in any calculations by the algorithm itself. Bad - A database of known bad variants can be used to supplement the set of worst ranked variants (compared to the Gaussian mixture model) that the program selects from the data to model "bad" variants.

--resource binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R List[RodBinding[VariantContext]]  []

### --rscript_file / -rscriptFile

The output rscript file generated by the VQSR to aid in visualization of the input data and learned model

File

### --shrinkage / -shrinkage

The shrinkage parameter in the variational Bayes algorithm.

double  1.0  [ [ -∞  ∞ ] ]

### --stdThreshold / -std

Annotation value divergence threshold (number of standard deviations from the means)
If a variant has annotations more than -std standard deviations away from mean, it won't be used for building the Gaussian mixture model.

double  10.0  [ [ -∞  ∞ ] ]

### --target_titv / -titv

The expected novel Ti/Tv ratio to use when calculating FDR tranches and for display on the optimization curve output figures. (approx 2.15 for whole genome experiments). ONLY USED FOR PLOTTING PURPOSES!
The expected transition / transversion ratio of true novel variants in your targeted region (whole genome, exome, specific genes), which varies greatly by the CpG and GC content of the region. See expected Ti/Tv ratios section of the GATK best practices documentation (http://www.broadinstitute.org/gatk/guide/best-practices) for more information. Normal values are 2.15 for human whole genome values and 3.2 for human whole exomes. Note that this parameter is used for display purposes only and isn't used anywhere in the algorithm!

double  2.15  [ [ -∞  ∞ ] ]

### --tranches_file / -tranchesFile

The output tranches file used by ApplyRecalibration

R File

### --trustAllPolymorphic / -allPoly

Trust that all the input training sets' unfiltered records contain only polymorphic sites to drastically speed up the computation.

Boolean  false

### --TStranche / -tranche

The levels of novel false discovery rate (FDR, implied by ti/tv) at which to slice the data. (in percent, that is 1.0 for 1 percent)
Add truth sensitivity slices through the call set at the given values. The default values are 100.0, 99.9, 99.0, and 90.0 which will result in 4 estimated tranches in the final call set: the full set of calls (100% sensitivity at the accessible sites in the truth set), a 99.9% truth sensitivity tranche, along with progressively smaller tranches at 99% and 90%.

double[]  [100.0, 99.9, 99.0, 90.0]

### --use_annotation / -an

The names of the annotations which should used for calculations
See the input VCF file's INFO field for a list of all available annotations.

R String[]

GATK version 3.2-2-gec30cee built at 2014/09/12 22:29:29. GTD: NA