# GenotypeAndValidate

Genotypes a dataset and validates the calls of another dataset using the Unified Genotyper.

## Overview

#### Note that this is an old tool that makes use of the UnifiedGenotyper, which has since been deprecated in favor of the HaplotypeCaller.

Genotype and Validate is a tool to evaluate the quality of a dataset for calling SNPs and Indels given a secondary (validation) data source. The data sources are BAM or VCF files. You can use them interchangeably (i.e. a BAM to validate calls in a VCF or a VCF to validate calls on a BAM).

The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.

Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.

### Input

A BAM file to make calls on and a VCF file to use as truth validation dataset. You also have the option to invert the roles of the files using the command line options listed below.

### Output

GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:

ALT REF Predictive Value
called alt True Positive (TP) False Positive (FP) Positive PV
called ref False Negative (FN) True Negative (TN) Negative PV

The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnosed.

The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.

The VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters. This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).

Here is an example of an annotated VCF file (info field clipped for clarity)

 #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
1   20568807    .   C   T   0    HapMapHet        AC=1;AF=0.50;AN=2;DP=0;GV=T  GT  0/1
1   22359922    .   T   C   282  WG-CG-HiSeq      AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ  1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99    ./.
13  102391461   .   G   A   341  Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ  ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99   ./.
1   175516757   .   C   G   655  SnpCluster,WG    AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ  ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99   ./.


• You should always use -L on your VCF track, so that the GATK only looks at the sites on the VCF file. This speeds up the process a lot.
• The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).

### Examples

1. Genotypes BAM file from new technology using the VCF as a truth dataset:
2.   java
-jar /GenomeAnalysisTK.jar
-T  GenotypeAndValidate
-R human_g1k_v37.fasta
-alleles handAnnotatedVCF.vcf
-L handAnnotatedVCF.vcf

3. Using a BAM file as the truth dataset:
4.   java
-jar /GenomeAnalysisTK.jar
-T  GenotypeAndValidate
-R human_g1k_v37.fasta
-I myTruthDataset.bam
-alleles callsToValidate.vcf
-L callsToValidate.vcf
-bt
-o gav.vcf


These Read Filters are automatically applied to the data by the Engine before processing by GenotypeAndValidate.

### Parallelism options

This tool can be run in multi-threaded mode using this option.

### Downsampling settings

This tool applies the following downsampling settings by default.

• Mode: BY_SAMPLE
• To coverage: 1,000

### Window size

This tool uses a sliding window on the reference.

• Window start: -200 bp before the locus
• Window stop: 200 bp after the locus

## Command-line Arguments

### Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

### GenotypeAndValidate specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--alleles
NA The set of alleles at which to genotype
Optional Outputs
--out
-o
NA Output VCF file with annotated variants
Optional Parameters
--condition_on_depth
-depth
-1 Condition validation on a minimum depth of coverage by the reads
--maximum_deletion_fraction
-deletions
-1.0 Maximum deletion fraction for calling a genotype
--minimum_base_quality_score
-mbq
-1 Minimum base quality score for calling a genotype
--standard_min_confidence_threshold_for_calling
-stand_call_conf
-1.0 the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls
--standard_min_confidence_threshold_for_emitting
-stand_emit_conf
-1.0 the minimum phred-scaled Qscore threshold to emit low confidence calls
Optional Flags
--set_bam_truth
-bt
NA Use the calls on the reads (bam file) as the truth dataset and validate the calls on the VCF

### Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

### --alleles / -alleles

The set of alleles at which to genotype
The callset to be used as truth (default) or validated (if BAM file is set to truth).

--alleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R RodBinding[VariantContext]

### --condition_on_depth / -depth

Condition validation on a minimum depth of coverage by the reads
Only validate sites that have at least a given depth

int  [ [ -∞  ∞ ] ]

### --maximum_deletion_fraction / -deletions

Maximum deletion fraction for calling a genotype
The maximum deletion fraction allowed in a site for calling a genotype. This argument is passed to the Unified Genotyper.

double  [ [ -∞  ∞ ] ]

### --minimum_base_quality_score / -mbq

Minimum base quality score for calling a genotype
The minimum base quality score necessary for a base to be considered when calling a genotype. This argument is passed to the Unified Genotyper.

int  [ [ -∞  ∞ ] ]

### --out / -o

Output VCF file with annotated variants
The optional output file that will have all the variants used in the Genotype and Validation essay. The new annotation callStatus will carry the value called in the validation VCF or BAM file."

VariantContextWriter

### --set_bam_truth / -bt

Use the calls on the reads (bam file) as the truth dataset and validate the calls on the VCF
Makes the Unified Genotyper calls to the BAM file the truth dataset and validates the alleles ROD binding callset.

boolean

### --standard_min_confidence_threshold_for_calling / -stand_call_conf

the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls
the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. This argument is passed to the Unified Genotyper.

double  [ [ -∞  ∞ ] ]

### --standard_min_confidence_threshold_for_emitting / -stand_emit_conf

the minimum phred-scaled Qscore threshold to emit low confidence calls
the minimum phred-scaled Qscore threshold to emit low confidence calls. This argument is passed to the Unified Genotyper.

double  [ [ -∞  ∞ ] ]