Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
Usage of GenotypeAndValidate and its command line arguments are described here.
The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
|called alt||True Positive (TP)||False Positive (FP)||Positive PV|
|called ref||False Negative (FN)||True Negative (TN)||Negative PV|
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)
The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).
Genotypes BAM file from new technology using the VCF as a truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myNewTechReads.bam \ -alleles handAnnotatedVCF.vcf \ -BTI alleles \ -o gav.vcf
An annotated VCF example (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
Using a BAM file as the truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myTruthDataset.bam \ -alleles callsToValidate.vcf \ -BTI alleles \ -bt \ -o gav.vcf
Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:
I have an inbred mouse strain that I am sequencing and there should be little to NO heterozygosity. Yet with the default settings of UGT -heterozygosity (which is 0.001) many homs are being called as hets. When 230/250 reads are alternate and 20/250 are reference, it calls a het, even though it should be homozygous alternate.
What do you recommendations for this setting for inbred animals?
thanks, GATK is great!
Hi. I have converted the tsv file from Complete genomics(CGA tool) to vcf format. but, when I run programs, it says some error in my vcf format. I carefully examined the vcf file and found, in some SNVs the GT is 1|.(dot). Is it a valid vcf file or is there any problem in vcf file? The program showed the error in those lines in which GT is 1|.(dot) i.e half genotype information. The line looks like this:
1 55164 . C A . . NS=1;AN=1;AC=1;CGA_XR=dbsnp.103|rs3091274;CGA_RPT=L2|L2|49.7;CGA_SDO=2 GT:PS:FT:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP 1/.:.:VQLOW:30,.:30,.:8,.:-30,0,0:-8,0,0:18:18,.:0
While running VariantEval, I'm trying to stratify by a JexlExpression by setting using
-ST Sample -ST JexlExpression -select "GQ>20"
This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.
My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?
Would you like this behavior or are you happy with the current operation of jexlExpression?
I am observing the following scenario at one particular SNP (C/G) using two different enrichment technologies:
(I am using IGV syntax: ALLELE|number of reads w/ allele|%of total reads|+strand reads|- strand reads)
C: 15 47% 15+,0- G: 17 53% 17+,0- - technology2:
C: 17 37% 13+,4- G: 29 63% 26+,3- As you can see both technologies have good coverage of the SNP and also good representation of each allele. SNP(C/G) does not get called in technology1.
My questions are: 1- Does the GATK algorithm have some sort of constraint on the proportion of reads coming from only one strand (as with technology1) in order to try to predict or discard duplicates? 2- I know that the base call of a particular base is bounded by the mapping quality of its read. If my --stand_call_conf is 30 and one of the bases at this SNP position has MQ<30 does this avoid this position getting called? Or is it more like the avg(MQ) has to be >30 (meaning more than one read at this position is taken into account)?
Thanks for any clarification, Gene
I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.
Please can you clarify?
I have pasted the bit of text I read below, just to make it clear to which bit I am referring.
A common question is the confidence score threshold to use for variant detection. We recommend:
Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.
Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.
I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.
I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:
This seems to me to be a serious bug. Is this anything that's been noted before?
I am running GATKLite version 2.1-3-ge1dbcc8
Dear all, I'm trying to run UnifiedGenotyper only at specific sites (a small subset of dbSNP) but I only get empty output. My command is
java -jar GATK.jar \ -R hg19.fa -I file1.bam -I file2.bam \ -T UnifiedGenotyper \ -nct 8 -stand_call_conf 20.0 \ -glm BOTH -o output.vcf \ -U ALLOW_SEQ_DICT_INCOMPATIBILITY \ -L dbsnpsubset.vcf \ -out_mode EMIT_ALL_SITES
How should I run GATK to call (alternate or reference) on all given sites?
I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller? Thanks!
Hello the team,
For some genotypes, it seems are wrong, I know it's model based, and base q, map q, etc are considered in the model. I also read this link: http://gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv#latest But my case are special, the format is (ref allele count)/(alternative allele count) genotype call: 22/24 0/0 109/125 0/0 85/109 0/0 26/32 0/0 40/161 0/0 195/6 1/1 239/5 1/1 83/6 1/1 46/28 1/1
In one case, the two variants are adjacent to each other. In some case, they are one base indels.