Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
Usage of GenotypeAndValidate and its command line arguments are described here.
The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
|called alt||True Positive (TP)||False Positive (FP)||Positive PV|
|called ref||False Negative (FN)||True Negative (TN)||Negative PV|
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)
The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).
Genotypes BAM file from new technology using the VCF as a truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myNewTechReads.bam \ -alleles handAnnotatedVCF.vcf \ -BTI alleles \ -o gav.vcf
An annotated VCF example (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
Using a BAM file as the truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myTruthDataset.bam \ -alleles callsToValidate.vcf \ -BTI alleles \ -bt \ -o gav.vcf
Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:
Note that earlier versions of the GATK used a different tool.
For a complete, detailed argument reference, refer to the GATK document page here
This tool generates amplicon sequences for use with the Sequenom primer design tool. The output of this tool is fasta-formatted, where the characters [A/B] specify the allele to be probed (see Validation Amplicons Output further below). It can mask nearby variation (either by 'N' or by lower-casing characters), and can try to restrict sequenom design to regions of the amplicon likely to generate a highly specific primer. This tool will also flag sites with properties that could shift the mass-spec peak from its expected value, such as indels in the amplicon sequence, SNPs within 4 bases of the variant attempting to be probed, or multiple variants selected for validation falling into the same amplicon.
Ns in the amplicon sequence instructs primer design software (such as Sequenom) not to use that base in the primer: any primer will fall entirely before, or entirely after, that base. Lower-case letters instruct the design software to try to avoid using the base (presumably by applying a penalty for doing so), but will not prevent it from doing so if a good primer (i.e. a primer with suitable melting temperature and low probability of hairpin formation) is found.
ValidationAmplicons relies on the GATK Sting BWA/C bindings to assess the specificity of potential primers. The wiki page for Sting BWA/C bindings contains required information about how to download the appropriate version of BWA, how to create a BWT reference, and how to set your classpath appropriately to run this tool. If you have not followed the directions to set up the BWA/C bindings, you will not be able to create validation amplicon sequences using the GATK. There is an argument (see below) to disable the use of BWA, and lower repeats within the amplicon only. Use of this argument is not recommended.
Validation Amplicons requires three input files: a VCF of alleles you want to validate, a VCF of variants you want to mask, and a Table of intervals around the variants describing the size of the amplicons. For instance:
Alleles to Validate
##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO 20 207414 . G A 85.09 PASS . // SNP to validate 20 792122 . TCCC T 22.24 PASS . // DEL to validate 20 994145 . G GAAG 48.21 PASS . // INS to validate 20 1074230 . C T 2.29 QD . // SNP to validate (but filtered) 20 1084330 . AC GT 42.21 PASS . // MNP to validate
HEADERpos name 20:207334-207494 20_207414 20:792042-792202 20_792122 20:994065-994225 20_994145 20:1074150-1074310 20_1074230 20:1084250-1084410 20_1084330
Alleles to Mask
##fileformat=VCFv4.1 #CHROM POS ID REF ALT QUAL FILTER INFO 20 207414 . G A 77.12 PASS . 20 207416 . A AGGC 49422.34 PASS . 20 792076 . A G 2637.15 HaplotypeScore . 20 792080 . T G 161.83 PASS . 20 792087 . CGGT C 179.84 ReadPosRankSum . 20 792106 . C G 32.59 PASS . 20 792140 . C G 409.75 PASS . 20 1084319 . T A,C 22.24 PASS . 20 1084348 . TACCACCCCACACA T 482.84 PASS .
The output from Validation Amplicons is a fasta-formatted file, with a small adaptation to represent the site being probed. Using the test files above, the output of the command
java -jar $GATK/dist/GenomeAnalysisTK.jar \ -T ValidationAmplicons \ -R /humgen/1kg/reference/human_g1k_v37.fasta \ -BTI ProbeIntervals \ --ProbeIntervals:table interval_table.table \ --ValidateAlleles:vcf sites_to_validate.vcf \ --MaskAlleles:vcf mask_sites.vcf \ --virtualPrimerSize 30 \ -o probes.fasta \ -l WARN
>20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414 CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC >20:792122 Valid 20_792122 TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT >20:994145 Valid 20_994145 TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC >20:1074230 SITE_IS_FILTERED=1, 20_1074230 ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC >20:1084330 DELETION=1, 20_1084330 CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC
Note that SNPs have been masked with 'N's, filtered 'mask' variants do not appear, the insertion has been flanked by Ns, the unfiltered deletion has been replaced by Ns, and the filtered site in the validation VCF is not marked as valid. In addition, bases that fall inside at least one non-unique 30-mer (meaning no multiple MQ0 alignments using BWA) are lower-cased. The identifier for each sequence is the position of the allele to be probed, a 'validation status' (defined below), and a string representing the amplicon. Validation status values are:
Valid // amplicon is valid SITE_IS_FILTERED=1 // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant") VARIANT_TOO_NEAR_PROBE=1 // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak MULTIPLE_PROBES=1, // multiple variants to be validated found inside the same amplicon DELETION=6,INSERTION=5, // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate DELETION=1, // deletion found inside the amplicon region, could shift mass-spec peak START_TOO_CLOSE, // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer END_TOO_CLOSE, // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer NO_VARIANTS_FOUND, // no variants found within the amplicon region INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself)
The files provided to Validation Amplicons should be such that all generated amplicons are valid. That means:
There are no variants within 4bp of the site to be validated There are no indels in the amplicon region Amplicon windows do not include other sites to be probed Amplicon windows are not too short, and the variant therein is not within 50bp of either edge All amplicon windows contain a variant to be validated Variants to be validated are unfiltered or pass filters
The tool will warn you each time any of these conditions are not met.
ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions: Sample restrictions and Frequency restrictions. Sample restrictions alter the polymorphic/monomorphic status of sites by restricting the sample set to a given number of samples. Frequency restrictions bias the site sampling method to sample either uniformly, or in accordance with the allele frequency spectrum of the input VCF.
For example command lines and a full list of arguments, please see the GATK documentation for this tool at Validation Site Selector.
The -sampleMode argument controls the mode of sample-based site consideration. The options are:
Note that Poly_based_on_gl uses the exact allele frequency calculation model to estimate P[site is nonref]. The site is considered for validation if P[site is nonref] > [this argument]. So if you want to validate sites that are >95% confidently nonref (based on the likelihoods), you would set -sampleMode POLY_BASED_ON_GL -samplePNonref 0.95
The -frequencySelectionMode argument controls the mode of frequency matching for site selection. The options are: