Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.
The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.
Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.
Usage of GenotypeAndValidate and its command line arguments are described here.
The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.
GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:
|called alt||True Positive (TP)||False Positive (FP)||Positive PV|
|called ref||False Negative (FN)||True Negative (TN)||Negative PV|
The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.
The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.
The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).
You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)
The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).
Genotypes BAM file from new technology using the VCF as a truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myNewTechReads.bam \ -alleles handAnnotatedVCF.vcf \ -BTI alleles \ -o gav.vcf
An annotated VCF example (info field clipped for clarity)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
Using a BAM file as the truth dataset:
java \ -jar /GenomeAnalysisTK.jar \ -T GenotypeAndValidate \ -R human_g1k_v37.fasta \ -I myTruthDataset.bam \ -alleles callsToValidate.vcf \ -BTI alleles \ -bt \ -o gav.vcf
Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:
generated with gatk 2.8-1-g932cd3a
Although it is rare I see Genotype Fields that are inconsistent with the AD values (Read as table):
CHROM POS ID REF ALT FILTER QUAL ABHet ABHom AC AF AN BaseCounts BaseQRankSum DP Dels FS GC HRun HaplotypeScore LowMQ MLEAC MLEAF MQ MQ0 MQRankSum MeanDP MinDP OND PercentNBaseSolid QD ReadPosRankSum Samples Somatic VariantType cosmic.ID 1.AB 1.AD 1.DP 1.F 1.GQ 1.GT 1.MQ0 1.PL 1.Z 2.AB 2.AD 2.DP 2.F 2.GQ 2.GT 2.MQ0 2.PL 2.Z 3.AB 3.AD 3.DP 3.F 3.GQ 3.GT 3.MQ0 3.PL 3.Z 4.AB 4.AD 4.DP 4.F 4.GQ 4.GT 4.MQ0 4.PL 4.Z 5.AB 5.AD 5.DP 5.F 5.GQ 5.GT 5.MQ0 5.PL 5.Z 11 92616485 0 A C PASS 63.71 0.333 0.698 1 0.1 10 89,54,0,0 -5.631 143 0 49.552 71.29 2 4.4154 0.0000,0.0000,143 1 0.1 50.27 0 -1.645 28.6 16 0.242 0 2.36 2.125 R5_A3_1 NA SNP COSM467570 NA 24,9 33 0.2727272727 54 A/A 0 0,54,537 -1.3055824197 0.33 9,18 27 0.6666666667 96 A/C 0 96,0,178 0.8660254038 NA 21,11 32 0.34375 21 A/A 0 0,21,466 -0.8838834765 NA 12,4 16 0.25 27 A/A 0 0,27,272 -1 NA 23,12 35 0.3428571429 42 A/A 0 0,42,537 -0.9296696802
This shows that for example sample 5 has a AD value of '23,12' and a GT of 'A/A' aka homyzougous reference allele. I've included a screenshot wich shows low base quality and complete strand bias (Which I suspect to mis variants). So whats the prob? and how can i recalculate the GT's based on AD? because i cannot filter based on genotypes when they are buggy....
Dear Team, I was looking at a VCF file produced with UnifiedGenotyper (2.4.9). It is a multisample call and, for a limited number of calls, I have genotypes that are telling the exact opposite of AD field, as in this case
I have ten reads supporting the reference allele, 1 read supporting the alternate and the genotype is 1/1. This is happening in ~200 sites per sample in my dataset. I've checked the other way around and I found <100 sites in which the genotype is called 0/0 and the AD suggests 1/1 or (more frequently) 0/1. This seems to happen in sites in which the number of variant samples is low (no more than 3 samples in a set of ~50 samples) and it is puzzling me a lot. Can you give me a comment on why this is happening? Thanks
I'm analyzing seven trio exomes right now with the latest GATK (version 2.7-4-g6f46d11), and was surprised to find a large number of mendelian violations reported by PhaseByTransmission, even after eliminating low/no coverage events. Tracking down the problem, it seems that CombineVariants occasionally propagates the PL field to the new vcf file incorrectly, sometimes in a way which causes GT not to correspond to the lowest PL.
Here's an example, showing just the GT, AD, and PL columns for a few positions in one trio. For each position, the first line contains the genotypes from the original vcf file, and the second shows the genotypes from the merged file.
#CHROM POS ID REF ALT 100403001-1 100403001-1A 100403001-1B 1 5933530 rs905469 A G 0/0:37,0:0,99,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 5933530 rs905469 A G 0/0:37,0:189,15,1192 0/0:35,0:0,90,1101 0/0:44,0:0,117,1412 1 10412636 rs4846215 A T 0/0:119,0:0,358,4297 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 10412636 rs4846215 A T 0/0:119,0:110,9,0 0/0:113,0:0,337,4060 0/0:102,0:0,304,3622 1 11729035 rs79974326 G C 0/0:50,0:0,141,1709 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 11729035 rs79974326 G C 0/0:50,0:1930,0,3851 0/0:53,0:0,150,1788 0/0:71,0:0,187,2246 1 16735764 rs182873855 G A 0/0:54,0:0,138,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 16735764 rs182873855 G A 0/0:54,0:174,0,1691 0/0:57,0:0,153,1841 0/0:47,0:0,120,1441 1 17316577 rs77880760 G T 0/0:42,0:0,123,1470 0/0:38,0:0,111,1317 0/0:53,0:0,153,1817 1 17316577 rs77880760 G T 0/0:42,0:233,17,1470 0/0:38,0:0,111,1317 0/0:225,25:0,153,181 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:0,87,1066 1 28116000 rs2294229 A G 0/0:37,0:0,105,1291 0/0:37,0:0,111,1379 0/0:30,0:1844,159,0 1 31740706 rs3753373 A G 0/0:123,0:0,349,4173 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885 1 31740706 rs3753373 A G 0/0:123,0:117,6,0 0/0:110,0:0,319,3793 0/0:111,0:0,328,3885
Most genotypes are propagated correctly, and in fact, which a propagated incorrectly changes from run to run.
In my case, I'm merging files from disjoint regions, so I can work around the problem, but it would be nice if this were fixed.
Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PANC_R chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0
The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect:
java -jar $GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf
Does anyone have any suggestions? Thanks in advance!
I am doing a WGS project on a family with seven siblings. We have data on the mother but the father passed many years ago. I tried splitting variant recalibrated vcf file and ped file into "trios" with just the mother and a sibling (seven times) then running PhaseByTransmission on the combined vcf. The job was successfully completed but nothing appears phased (all "/," and no "|") in the output vcf. I also tried the variant recalibrated vcf file separately with ReadBackedPhasing. The job was successfully completed as well but again nothing appears phased (all "/" and no "|" or assigned "PQ" scores). The ProduceBeagleInput walker (to use Beagle for genotype refinement) appears to only support unrelated individuals and my set involves related individuals. Do you have any other suggestions for phasing incomplete "trios?" Thanks in advance!
Hi. I have converted the tsv file from Complete genomics(CGA tool) to vcf format. but, when I run programs, it says some error in my vcf format. I carefully examined the vcf file and found, in some SNVs the GT is 1|.(dot). Is it a valid vcf file or is there any problem in vcf file? The program showed the error in those lines in which GT is 1|.(dot) i.e half genotype information. The line looks like this:
1 55164 . C A . . NS=1;AN=1;AC=1;CGA_XR=dbsnp.103|rs3091274;CGA_RPT=L2|L2|49.7;CGA_SDO=2 GT:PS:FT:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP 1/.:.:VQLOW:30,.:30,.:8,.:-30,0,0:-8,0,0:18:18,.:0
While running VariantEval, I'm trying to stratify by a JexlExpression by setting using
-ST Sample -ST JexlExpression -select "GQ>20"
This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.
My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?
Would you like this behavior or are you happy with the current operation of jexlExpression?
I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.
Please can you clarify?
I have pasted the bit of text I read below, just to make it clear to which bit I am referring.
A common question is the confidence score threshold to use for variant detection. We recommend:
Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.
Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.
Hello the team,
For some genotypes, it seems are wrong, I know it's model based, and base q, map q, etc are considered in the model. I also read this link: http://gatkforums.broadinstitute.org/discussion/1235/why-didnt-the-unified-genotyper-call-my-snp-i-can-see-it-right-there-in-igv#latest But my case are special, the format is (ref allele count)/(alternative allele count) genotype call: 22/24 0/0 109/125 0/0 85/109 0/0 26/32 0/0 40/161 0/0 195/6 1/1 239/5 1/1 83/6 1/1 46/28 1/1
In one case, the two variants are adjacent to each other. In some case, they are one base indels.
I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller? Thanks!
I have an inbred mouse strain that I am sequencing and there should be little to NO heterozygosity. Yet with the default settings of UGT -heterozygosity (which is 0.001) many homs are being called as hets. When 230/250 reads are alternate and 20/250 are reference, it calls a het, even though it should be homozygous alternate.
What do you recommendations for this setting for inbred animals?
thanks, GATK is great!
I was running UnifiedGenotyper on a set of 26 bam files. There was one particular position where I was comparing calls to the actual pileup and I noticed a major discrepancy. There was a no-call ("./.") for that position for one of the bam files while most other samples had calls. That non-called sample, though, had a very convincing variant in the pileup, with lots of high quality coverage at that position.
I then tried running just that bam file alone through UnifiedGenotyper, or that bam file along with two others. In both cases, the 1/1 variant is called properly with the following genotype field:
This seems to me to be a serious bug. Is this anything that's been noted before?
I am running GATKLite version 2.1-3-ge1dbcc8
I am observing the following scenario at one particular SNP (C/G) using two different enrichment technologies:
(I am using IGV syntax: ALLELE|number of reads w/ allele|%of total reads|+strand reads|- strand reads)
C: 15 47% 15+,0- G: 17 53% 17+,0- - technology2:
C: 17 37% 13+,4- G: 29 63% 26+,3- As you can see both technologies have good coverage of the SNP and also good representation of each allele. SNP(C/G) does not get called in technology1.
My questions are: 1- Does the GATK algorithm have some sort of constraint on the proportion of reads coming from only one strand (as with technology1) in order to try to predict or discard duplicates? 2- I know that the base call of a particular base is bounded by the mapping quality of its read. If my --stand_call_conf is 30 and one of the bases at this SNP position has MQ<30 does this avoid this position getting called? Or is it more like the avg(MQ) has to be >30 (meaning more than one read at this position is taken into account)?
Thanks for any clarification, Gene