VariantEval
From GSA
Contents |
Introduction
General purpose variation metrics like percent in dbSNP, genotype concordance, Ts/Tv ratios. Can handle both population-level allele evaluations (like HWE rates) as well as genotype-specific analyses like Genotype Concordance calculations.
VariantEval is still under continuous and active development -- results are likely to change, improve, or bugs might be introduced unexpectedly. Also, for now, VariantEval ignores sites with more than just one alternate allele.
Basic arguments/command
- One or more evaluation RODs
- The file(s) containing variations for evaluation. The name bound to the ROD must begin with "eval". Can be in any number of formats, including VCF, GeliText, and GFF.
- Zero or more comparison RODs
- The file(s) containing variations for comparison against each of the eval RODs. The name bound to the ROD must begin with "comp". Can be in any number of formats, including VCF, dbSNP, and HapMap.
- -D dbSNP rod
- The (optional) dbSNP track for db_coverage statistic calculations. Other databases of known variation sites can be specified with the -known argument.
- --evalModule, -E
- By default, the "standard" (stable) modules are run with VariantEval. If you would like to add some of the more experimental (and less supported) modules add one or more of them with this argument (use the --list option to see a list of available modules).
- --doNotUseAllStandardModules, -noStandard
- Instructs the tool not to use the standard evaluation modules by default (although they can be added back individually with the -E argument).
- --list, -ls
- Instructs the tool to list the available evaluation modules and exit; standard modules will be highlighted.
- -o
- The output file; if not specified, the tool prints to standard out.
- -reportType
- The template output type. Possible options: Table (the default), Grep, CSV, R.
java -Xmx2048m -jar GenomeAnalysisTK.jar \
-T VariantEval -R human_b36_both.fasta \
-l INFO \
-B:eval,VCF example1.vcf \
-B:comp,VCF example2.vcf \
-D /humgen/gsa-scr1/GATK_Data/dbsnp_129_b36.rod \
-E CountVariants
-noStandard
-BTI eval
Analysis Name: Count Variants Analysis Description: Counts different classes of variants in the sample evaluation_name comparison_name jexl_expression filter_name novelty_name nProcessedLoci nCalledLoci nRefLoci nVariantLoci variantRate variantRatePerBp nSNPs nInsertions nDeletions nComplex nNoCalls nHets nHomRef nHomVar heterozygosity heterozygosityPerBp hetHomRatio indelRate indelRatePerBp deletionInsertionRatio ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- eval comp none called all 10 8 0 8 0.8000 1.0000 8 0 0 0 0 5 0 3 0.5000 2.0000 1.6667 0.0000 0.0000 0.0000 eval comp none called known 10 4 0 4 0.4000 2.0000 4 0 0 0 0 2 0 2 0.2000 5.0000 1.0000 0.0000 0.0000 0.0000 eval comp none called novel 10 4 0 4 0.4000 2.0000 4 0 0 0 0 3 0 1 0.3000 3.0000 3.0000 0.0000 0.0000 0.0000 eval comp none filtered all 10 2 0 2 0.2000 5.0000 2 0 0 0 0 2 0 0 0.2000 5.0000 2.0000 0.0000 0.0000 0.0000 eval comp none filtered known 10 1 0 1 0.1000 10.0000 1 0 0 0 0 1 0 0 0.1000 10.0000 1.0000 0.0000 0.0000 0.0000 eval comp none filtered novel 10 1 0 1 0.1000 10.0000 1 0 0 0 0 1 0 0 0.1000 10.0000 1.0000 0.0000 0.0000 0.0000 eval comp none raw all 10 10 0 10 1.0000 1.0000 10 0 0 0 0 7 0 3 0.7000 1.0000 2.3333 0.0000 0.0000 0.0000 eval comp none raw known 10 5 0 5 0.5000 2.0000 5 0 0 0 0 3 0 2 0.3000 3.0000 1.5000 0.0000 0.0000 0.0000 eval comp none raw novel 10 5 0 5 0.5000 2.0000 5 0 0 0 0 4 0 1 0.4000 2.0000 4.0000 0.0000 0.0000 0.0000 eval dbsnp none called all 10 8 0 8 0.8000 1.0000 8 0 0 0 0 5 0 3 0.5000 2.0000 1.6667 0.0000 0.0000 0.0000 eval dbsnp none called known 10 4 0 4 0.4000 2.0000 4 0 0 0 0 2 0 2 0.2000 5.0000 1.0000 0.0000 0.0000 0.0000 eval dbsnp none called novel 10 4 0 4 0.4000 2.0000 4 0 0 0 0 3 0 1 0.3000 3.0000 3.0000 0.0000 0.0000 0.0000 eval dbsnp none filtered all 10 2 0 2 0.2000 5.0000 2 0 0 0 0 2 0 0 0.2000 5.0000 2.0000 0.0000 0.0000 0.0000 eval dbsnp none filtered known 10 1 0 1 0.1000 10.0000 1 0 0 0 0 1 0 0 0.1000 10.0000 1.0000 0.0000 0.0000 0.0000 eval dbsnp none filtered novel 10 1 0 1 0.1000 10.0000 1 0 0 0 0 1 0 0 0.1000 10.0000 1.0000 0.0000 0.0000 0.0000 eval dbsnp none raw all 10 10 0 10 1.0000 1.0000 10 0 0 0 0 7 0 3 0.7000 1.0000 2.3333 0.0000 0.0000 0.0000 eval dbsnp none raw known 10 5 0 5 0.5000 2.0000 5 0 0 0 0 3 0 2 0.3000 3.0000 1.5000 0.0000 0.0000 0.0000 eval dbsnp none raw novel 10 5 0 5 0.5000 2.0000 5 0 0 0 0 4 0 1 0.4000 2.0000 4.0000 0.0000 0.0000 0.0000
Here we see a report on the SNP counts for each unique eval, comp, jexl_expression ("select"), filter (called/filtered/raw), and novelty (known/novel/all) set.
Advanced arguments/command
- -known name
- Zero or more names of the ROD track(s) that should be treated as known variation sites when splitting eval rods into known and novel subsets. By default this is set to "dbsnp" only.
- -select
- Zero or more JEXL expressions representing other stratifications to use when evaluating the data. See here for documentation on using JEXL expressions.
- -selectName
- Names to use for the list of stratifications (must be a 1-to-1 mapping).
- -sample
- If provided, the eval and comp contexts will be derived using genotypes only from these samples, when genotypes are available in the original context.
- -family
- If provided, genotypes will be examined for mendelian violations: this argument is a string formatted as dad+mom=child where these parameters determine which sample names are examined.
- --MendelianViolationQualThreshold, -MVQ
- The minimum genotype QUAL score for each trio member required to accept a site as a violation; used with the -MVQ command [default:50].
- --InterestingSitesVCF, -outputVCF
- If provided, interesting sites will be written to this VCF with the INFO field annotated as to why they are interesting.
- --minPhredConfidenceScore, -Q
- Minimum confidence (QUAL) score to consider an evaluation variant; by default, there is no minimum.
- --minPhredConfidenceScoreForComp, -Qcomp
- Minimum confidence (QUAL) score to consider a comp variant; by default, there is no minimum.
- -rsID
- An optional list of rsIDs and build numbers for capping known variants by their build date.
- -maxRsIDBuild
- If provided, only variants with rsIDs <= maxRsIDBuild will be included in the set of known variants.
java -Xmx2048m -jar GenomeAnalysisTK.jar \
-T VariantEval -R human_b36_both.fasta \
-l INFO \
-B:eval1,VCF NA12891.vcf \
-B:eval2,VCF NA12892.vcf \
-B:comp,VCF CEU.vcf \
-D /humgen/gsa-scr1/GATK_Data/dbsnp_129_b36.rod \
-all \
-sample NA12878 -sample NA12891 -sample NA12892 \
-Q 30.0 \
-known dbsnp -known comp \
-select "DP > 100" -select "QUAL < 100" \
-selectName highDepth -selectName lowQual
A useful analysis using VariantEval
We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets (e.g. made by running the [[Unified genotyper]Unified Genotyper] and then [[VariantFiltrationWalker]Variant Filtration]) named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.
1) Combine the VCFs
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T CombineVariants \ -variantMergeOptions UNION \ -B:foo,VCF foo.vcf \ -B:bar,VCF bar.vcf \ -priority foo,bar \ -o merged.vcf
2) Run VariantEval
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T VariantEval \ -reportType Grep \ -B:eval,VCF merged.vcf \ -D dbsnp_129_hg18.rod \ -select 'set=="Intersection"' -selectName Intersection \ -select 'set=="foo-filteredInOther"' -selectName InFoo-FilteredInBar \ -select 'set=="foo"' -selectName foo-only \ -select 'set=="bar-filteredInOther"' -selectName InBar-FilteredInFoo \ -select 'set=="bar"' -selectName bar-only \ -Q 50 \ -E TiTvVariantEvaluator \ -E CompOverlap \ -E CountVariants \ -o myData.eval
One can optionally provide another 'comparison' track by using e.g. -B:comp_baz,VCF baz.vcf
Understanding Genotype Concordance values from Variant Eval
The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.
These data are available
#Table Name : Sample Summary Statistics evaluation_name,comparison_name,jexl_expression,filter_name,novelty_name,row,percent_comp_het_called_het,percent_comp_het_called_var,percent_comp_hom_called_hom,percent_comp_hom_called_var,percent_non_ref_sensitivity,percent_overall_genotype_concordance,percent_non_ref_discrepancy_rate eval,comp,none,called,all,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00 eval,comp,none,called,all,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00 eval,comp,none,called,all,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50 eval,comp,none,called,novel,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00 eval,comp,none,called,novel,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00 eval,comp,none,called,novel,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50 eval,comp,none,raw,all,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00 eval,comp,none,raw,all,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00 eval,comp,none,raw,all,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50 eval,comp,none,raw,novel,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00 eval,comp,none,raw,novel,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00 eval,comp,none,raw,novel,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50
for two simple example samples with bad and good concordance to another set of test data. The key outputs:
- percent_overall_genotype_concordance
- percent_overall_genotype_concordance
- percent_non_ref_discrepancy_rate
All defined below. In fact NA12878.GOOD is the example in the slides.
