For a complete, detailed argument reference, refer to the technical documentation page.
You can find detailed information about the various modules here.
Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).
We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T CombineVariants \ -V:FOO foo.vcf \ -V:BAR bar.vcf \ -priority FOO,BAR \ -o merged.vcf
java -jar GenomeAnalysisTK.jar \ -T VariantEval \ -R ref.fasta \ -D dbsnp.vcf \ -select 'set=="Intersection"' -selectName Intersection \ -select 'set=="FOO"' -selectName FOO \ -select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \ -select 'set=="BAR"' -selectName BAR \ -select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \ -select 'set=="FilteredInAll"' -selectName FilteredInAll \ -o merged.eval.gatkreport \ -eval merged.vcf \ -l INFO
It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:
java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt grep -v 'set' fields.txt | sort | uniq -c
This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.
The VariantEval output is formatted as a GATKReport.
The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.
##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats : the concordance statistics summary for each sample GenotypeConcordance.sampleSummaryStats CompRod CpG EvalRod JexlExpression Novelty percent_comp_ref_called_var percent_comp_het_called_het percent_comp_het_called_var percent_comp_hom_called_hom percent_comp_hom_called_var percent_non-reference_sensitivity percent_overall_genotype_concordance percent_non-reference_discrepancy_rate GenotypeConcordance.sampleSummaryStats compOMNI all eval none all 0.78 97.65 98.39 99.13 99.44 98.80 99.09 3.60
The key outputs:
All defined below.
VariantEval accepts two types of modules: stratification and evaluation modules.
CpG is a three-state stratification:
A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.
EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.
Sample is an N-state stratification, where N is the number of samples in the eval files.
Filter is a three-state stratification:
FunctionalClass is a four-state stratification:
CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.
Degeneracy is a six-state stratification:
See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.
JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]
Novelty is a three-state stratification:
CountVariants is an evaluation module that computes the following metrics:
|nProcessedLoci||Number of processed loci|
|nCalledLoci||Number of called loci|
|nRefLoci||Number of reference loci|
|nVariantLoci||Number of variant loci|
|variantRate||Variants per loci rate|
|variantRatePerBp||Number of variants per base|
|nSNPs||Number of snp loci|
|nInsertions||Number of insertion|
|nDeletions||Number of deletions|
|nComplex||Number of complex loci|
|nNoCalls||Number of no calls loci|
|nHets||Number of het loci|
|nHomRef||Number of hom ref loci|
|nHomVar||Number of hom var loci|
|nSingletons||Number of singletons|
|heterozygosity||heterozygosity per locus rate|
|heterozygosityPerBp||heterozygosity per base pair|
|hetHomRatio||heterozygosity to homozygosity ratio|
|indelRate||indel rate (insertion count + deletion count)|
|indelRatePerBp||indel rate per base pair|
|deletionInsertionRatio||deletion to insertion ratio|
CompOverlap is an evaluation module that computes the following metrics:
|nEvalSNPs||number of eval SNP sites|
|nCompSNPs||number of comp SNP sites|
|novelSites||number of eval sites outside of comp sites|
|nVariantsAtComp||number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)|
|compRate||percentage of eval sites at comp sites|
|nConcordant||number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)|
|concordantRate||the concordance rate|
A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):
$ grep -v '##' dbsnp.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132
Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:
$ grep -v '##' eval_correct_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:
$ grep -v '##' eval_incorrect_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Running VariantEval with just the CompOverlap module:
$ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -L 1:10327 \ -B:dbsnp,VCF dbsnp.vcf \ -B:eval_correct_allele,VCF eval_correct_allele.vcf \ -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \ -noEV \ -EV CompOverlap \ -o eval.table
We find that the eval.table file contains the following:
$ grep -v '##' eval.table | column -t CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000
As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.
TiTvVariantEvaluator is an evaluation module that computes the following metrics:
|nTi||number of transition loci|
|nTv||number of transversion loci|
|tiTvRatio||the transition to transversion ratio|
|nTiInComp||number of comp transition sites|
|nTvInComp||number of comp transversion sites|
|TiTvRatioStandard||the transition to transversion ratio for comp sites|
Hi, I'm running 1.6-512-gafa4399 (unstable).
When running VariantEval, I got an error: "Couldn't find state for 88 at node org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode"
##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.sting.utils.exceptions.ReviewedStingException: Couldn't find state for 88 at node org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode@1a7df60a at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratNode.find(StratNode.java:117) at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratificationManager.getKeys(StratificationManager.java:208) at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.manager.StratificationManager.values(StratificationManager.java:260) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEvalWalker.getEvaluationContexts(VariantEvalWalker.java:480) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEvalWalker.map(VariantEvalWalker.java:406) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEvalWalker.map(VariantEvalWalker.java:89) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:63) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:257) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)
This happened on running VariantEval like this:
-T VariantEval -L /xchip/cga/reference/hg19/whole_exome_agilent_1.1_refseq_plus_3_boosters_plus_10bp_padding_minus_mito.Homo_sapiens_assembly19.targets.interval_list -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta -nt 1 -o myFile.byAC.eval -eval myFile.vcf -D dbsnp_132_b37.leftAligned.vcf -gold /humgen/gsa-hpprojects/GATK/bundle/current/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -ST EvalRod -ST CompRod -ST Novelty -ST FunctionalClass -ST AlleleCount -noST -EV TiTvVariantEvaluator -EV CountVariants -EV CompOverlap -EV IndelSummary -noEV
Do you know what may be causing this error? If it helps, I can distill a small failing vcf. ./adam
I'm on unstable v2.1-61-gde29ed5, Compiled 2012/08/23 16:03:06
I run VariantEval on the file below and get this:
java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.broadinstitute.sting.utils.variantcontext.CommonInfo.getAttributeAsInt(CommonInfo.java:212) at org.broadinstitute.sting.utils.variantcontext.VariantContext.getAttributeAsInt(VariantContext.java:600) at org.broadinstitute.sting.gatk.walkers.varianteval.stratifications.AlleleCount.getRelevantStates(AlleleCount.java:50) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.getEvaluationContexts(VariantEval.java:481) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:409) at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:91) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65) at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:72) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:268) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)`
The file is this (after header)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A3N A4N A5N A6N A8N A9N H10N H15N H17N H26N H27N H32N H52N H53N H54N H55N H56N H57N H58N H61N H62N H6N H7N 1 901922 rs62639980 G A,C 1332.23 PASS AC=1,2;AF=6.849e-03,0.014;AN=146;BaseQRankSum=-3.705;DB;DP=2675;DS;Dels=0.00;FS=3.860;HaplotypeScore=1.6813;InbreedingCoeff=-0.0001;MLEAC=2,2;MLEAF=0.014,0.014;MQ=58.62;MQ0=1;MQRankSum=2.225;QD=7.84;ReadPosRankSum=3.761;SB=-5.136e+02;VQSLOD=3.9184;culprit=HaplotypeScore GT:AD:DP:GQ:PL 0/0:1,0,0:59:99:0,178,2360,178,2360,2360 0/0:60,0,0:60:99:0,156,2055,156,2055,2055 0/0:60,0,0:60:99:0,160,2191,160,2191,2191 0/0:1,0,0:64:99:0,193,2516,193,2516,2516 0/0:1,0,0:47:99:0,141,1776,141,1776,1776 0/0:1,0,0:62:99:0,187,2480,187,2480,2480 0/0:59,1,0:60:99:0,119,1965,147,1968,1996 0/0:56,0,0:56:99:0,135,1857,135,1857,1857 0/0:1,0,0:57:99:0,172,2280,172,2280,2280 0/0:1,0,0:75:99:0,226,3000,226,3000,3000 0/0:8,0,0:71:99:0,214,2852,214,2852,2852 0/0:60,0,0:60:99:0,147,1942,147,1942,1942 0/0:47,0,0:47:99:0,123,1654,123,1654,1654 0/0:1,0,0:56:99:0,169,2240,169,2240,2240 0/0:1,0,0:55:99:0,166,2122,166,2122,2122 0/2:37,0,23:60:99:647,734,1844,0,1110,1050 0/0:1,0,0:71:99:0,214,2840,214,2840,2840 0/2:44,0,16:60:99:377,483,1801,0,1319,1277 0/0:57,0,0:57:99:0,141,1914,141,1914,1914 0/0:1,0,0:53:99:0,160,2154,160,2154,2154 0/0:1,0,0:74:99:0,223,3008,223,3008,3008 0/0:1,0,0:49:99:0,147,1960,147,1960,1960 0/0:1,0,0:76:99:0,229,2988,229,2988,2988`
The file was created by first calling my ~20 samples with 50 random samples from 1kg and then subsetting to my samples using SelectVariants.
I'm currently running variantEval to count up variants per individual stratified by a variety of annotations.
My GATK call looks like:
java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar \ -T VariantEval \ -R Homo_sapiens_assembly19.fasta \ -o output.txt \ -L input.vcf \ -eval input.vcf \ -ST Sample -noST \ -noEV -EV CountVariants \ -ST JexlExpression --select_names "nonsynon" --select_exps "resource.VAT_CDS == 'nonsynonymous' && resource.FOUNDERS_FRQ > 0.05" \ -ST JexlExpression --select_names "synon" --select_exps "resource.VAT_CDS == 'synonymous' && resource.FOUNDERS_FRQ > 0.05" ...
where the VAT_CDS section of the INFO field in the VCF has a functional annotation or is set to "na" if an annotation is unavailable. I'm getting the following error:
but weirdly the error is not consistent (my data is stratified by chromosome and most chromosomes will run without throwing the error while one or two chromosomes do exhibit the error. Do you have any ideas what's causing this behavior?
I am using VariantEval --evalModule GenotypeConcordance in order to establish concordance and sensitivity metrics against a HapMap reference. In the resulting GATK report I obtain the following fields for a given SNP category (example with HETs):
GenotypeConcordance CompRod EvalRod JexlExpression Novelty variable value--- GenotypeConcordance comp eval none all n_true_HET_called_HET 6220 GenotypeConcordance comp eval none all n_true_HET_called_HOM_REF 0 GenotypeConcordance comp eval none all n_true_HET_called_HOM_VAR 20 GenotypeConcordance comp eval none all n_true_HET_called_MIXED 0 GenotypeConcordance comp eval none all n_true_HET_called_NO_CALL 318 GenotypeConcordance comp eval none all n_true_HET_called_UNAVAILABLE 0
What is the meaning of the _MIXED and _UNAVAILABLE fields?
While running VariantEval, I'm trying to stratify by a JexlExpression by setting using
-ST Sample -ST JexlExpression -select "GQ>20"
This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.
My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?
Would you like this behavior or are you happy with the current operation of jexlExpression?
I've a couple of essentially documentation-centric questions...
Firstly, the SelectVariants documentation describes selecting 1000 random variants from a vcf using '-number 1000', however when I try that (with the command "
java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta --variant variants.vcf -o 1000.vcf -number 1000") it produces the error
'Argument with the name number isn't defined'. Trying --number instead doesn't make any difference, while the --help output does not list this argument (GATK 2.2.2). It this option no longer available?
Secondly, the 'gatkforums.broadinstitute.org/discussion/48/using-varianteval#Understanding_Genotype_Concordance_values_from_Variant_Eval' section of the 'Using VariantEval' page has a series of images explaining the concordance values, however the images are missing. Please could these be restored?
Many thanks, James
I have processed 10 whole-exome samples using the GATK best practices workflow (GATK v2.4-3-g2a7af43). I am currently evaluating my variant call set (generated from HaplotypeCaller) with OMNI 2.5 SNP array (comparison set) and dbSNP 137.
I have included 2 rows from the Ti/Tv Variant Evaluator table:
CompRod EvalRod Novelty Sample nTi nTv tiTvRatio nTiInComp nTvInComp TiTvRatioStandard
OMNI MyCalls all all 79945 30322 2.64 993588 274219 3.62
dbsnp MyCalls all all 79945 30322 2.64 30214009 15253850 1.98
According to literature survey, the Ti/Tv ratio should be approximately 2.1 for whole genome sequencing and 2.8 for whole exome sequencing. Since I am getting Ti/Tv of 2.64 for exome, does this indicate false positives in the data? Also, what could be the rationale for getting such high TiTvRatioStandard for the OMNI whole genome data?
Hello, I've just started using VariantEval. It seems very useful and provides a good deal of output. However, it's not obvious to me what all of the fields mean. Do you know when there will be documentation of the output?
For example, some things I'd like to know:
What is the --dbsnp parameter used for? Is this redundant with --comp?
Many thanks for your help,
I found the materials of the BroadE Workshop very helpful, especially the slide on analyzing variant calls using VariantEval, because there is not much documentation for it on GATK site. As an example 62 whole genome sequencing samples from north Europe were evaluated together with 1000G FIN samples, and also the polymorphic and monomorphic sites on the 1000G genotype chip were used as comparator. I would like very much to do the same for our whole exome data, the question is: is there good quality whole exome data that I can use to evaluate our own exome data?
I have checked the NHLBI ESP project Exome Variant Server site, the vcf files can be downloaded doesn't have the genotype data.
Thanks in advance!
I want to know what's the best way to use VariantEval to get statistics for each sample in a multisample VCF file. If I call it like this:
java -jar GenomeAnalysisTK.jar \
-R ucsc.hg19.fasta \
-T VariantEval \
-o multisample.eval.gatkreport \
--eval annotated.combined.vcf.gz \
where annotated.combined.vcf.gz is a VCF file that contains ~1Mio variants for ~800 samples I get statistics for all samples combined, e.g.
#:GATKTable:CompOverlap:The overlap between eval and comp sites
CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants ...
CompOverlap dbsnp eval none all 471704 191147
CompOverlap dbsnp eval none known 280557 0 CompOverlap dbsnp eval none novel 191147 191147
But I would like to get one such entry per sample. Is there an easy way to do this?