VariantEval
From GSA
Contents |
VariantEval
For a complete, detailed argument reference, refer to the GATK document page here.
Modules
Stratification modules
- AlleleFrequency
- AlleleCount
- CompRod
- Contig
- CpG
- Degeneracy
- EvalRod
- Filter
- FunctionalClass
- JexlExpression -- Allows arbitrary selection of subsets of the VCF by Using JEXL expressions; it is particularly important to note the section on "Working with complex expressions".
- Novelty
- Sample
Evaluation modules
- ACTransitionTable
- AlleleFrequencyComparison
- AminoAcidTransition
- CompOverlap
- CountVariants
- GenotypeConcordance
- GenotypePhasingEvaluator
- IndelMetricsByAC
- IndelStatistics
- MendelianViolationEvaluator
- PrintMissingComp
- PrivatePermutations
- SimpleMetricsByAC
- ThetaVariantEvaluator
- TiTvVariantEvaluator
- VariantQualityScore
A useful analysis using VariantEval
We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets (e.g. made by running the Unified Genotyper and then Variant Filtration) named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.
1. Combine the VCFs
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T CombineVariants \ -V:FOO foo.vcf \ -V:BAR bar.vcf \ -priority FOO,BAR \ -o merged.vcf
2. Run VariantEval
java -jar GenomeAnalysisTK.jar \
-T VariantEval \
-R ref.fasta \
-D dbsnp.vcf \
-select 'set=="Intersection"' -selectName Intersection \
-select 'set=="FOO"' -selectName FOO \
-select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
-select 'set=="BAR"' -selectName BAR \
-select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
-select 'set=="FilteredInAll"' -selectName FilteredInAll \
-o merged.eval.gatkreport \
-eval merged.vcf \
-l INFO
Checking the possible values of 'set'
It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:
java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt grep -v 'set' fields.txt | sort | uniq -c
This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.
Reading the VariantEval output file
The VariantEval output is formatted as a GATKReport.
Understanding Genotype Concordance values from Variant Eval
The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.
##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats : the concordance statistics summary for each sample GenotypeConcordance.sampleSummaryStats CompRod CpG EvalRod JexlExpression Novelty percent_comp_ref_called_var percent_comp_het_called_het percent_comp_het_called_var percent_comp_hom_called_hom percent_comp_hom_called_var percent_non-reference_sensitivity percent_overall_genotype_concordance percent_non-reference_discrepancy_rate GenotypeConcordance.sampleSummaryStats compOMNI all eval none all 0.78 97.65 98.39 99.13 99.44 98.80 99.09 3.60
The key outputs:
- percent_overall_genotype_concordance
- percent_non_ref_sensitivity_rate
- percent_non_ref_discrepancy_rate
All defined below.
