Genotypes a dataset and validates the calls of another dataset using the Unified Genotyper.
Genotype and Validate is a tool to evaluate the quality of a dataset for calling SNPs and Indels given a secondary (validation) data source. The data sources are BAM or VCF files. You can use them interchangeably (i.e. a BAM to validate calls in a VCF or a VCF to validate calls on a BAM).
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
A BAM file to make calls on and a VCF file to use as truth validation dataset. You also have the option to invert the roles of the files using the command line options listed below.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
|called alt||True Positive (TP)||False Positive (FP)||Positive PV|
|called ref||False Negative (FN)||True Negative (TN)||Negative PV|
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnosed.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters. This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
Here is an example of an annotated VCF file (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
java -jar /GenomeAnalysisTK.jar -T GenotypeAndValidate -R human_g1k_v37.fasta -I myNewTechReads.bam -alleles handAnnotatedVCF.vcf -L handAnnotatedVCF.vcf
java -jar /GenomeAnalysisTK.jar -T GenotypeAndValidate -R human_g1k_v37.fasta -I myTruthDataset.bam -alleles callsToValidate.vcf -L callsToValidate.vcf -bt -o gav.vcf
These Read Filters are automatically applied to the data by the Engine before processing by GenotypeAndValidate.
This tool can be run in multi-threaded mode using this option.
This tool applies the following downsampling settings by default.
This tool uses a sliding window on the reference.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.
|--alleles||RodBinding[VariantContext]||NA||The set of alleles at which to genotype|
|--out||VariantContextWriter||stdout||Generate a VCF file with the variants considered by the walker, with a new annotation "callStatus" which will carry the value called in the validation VCF or BAM file|
|--condition_on_depth||int||-1||Condition validation on a minimum depth of coverage by the reads|
|--maximum_deletion_fraction||double||-1.0||Maximum deletion fraction for calling a genotype|
|--minimum_base_quality_score||int||-1||Minimum base quality score for calling a genotype|
|-stand_call_conf||double||-1.0||the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls|
|-stand_emit_conf||double||-1.0||the minimum phred-scaled Qscore threshold to emit low confidence calls|
|--set_bam_truth||boolean||false||Use the calls on the reads (bam file) as the truth dataset and validate the calls on the VCF|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
The set of alleles at which to genotype. The callset to be used as truth (default) or validated (if BAM file is set to truth). --alleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
Condition validation on a minimum depth of coverage by the reads. Only validate sites that have at least a given depth
Maximum deletion fraction for calling a genotype. The maximum deletion fraction allowed in a site for calling a genotype. This argument is passed to the Unified Genotyper.
Minimum base quality score for calling a genotype. The minimum base quality score necessary for a base to be considered when calling a genotype. This argument is passed to the Unified Genotyper.
Generate a VCF file with the variants considered by the walker, with a new annotation "callStatus" which will carry the value called in the validation VCF or BAM file. The optional output file that will have all the variants used in the Genotype and Validation essay.
Use the calls on the reads (bam file) as the truth dataset and validate the calls on the VCF. Makes the Unified Genotyper calls to the BAM file the truth dataset and validates the alleles ROD binding callset.
the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. This argument is passed to the Unified Genotyper.
the minimum phred-scaled Qscore threshold to emit low confidence calls. the minimum phred-scaled Qscore threshold to emit low confidence calls. This argument is passed to the Unified Genotyper.
GATK version 2.8-1-g2a26ec9 built at 2013/12/06 16:54:02.