VariantEval

From GSA

Jump to: navigation, search

Contents

Introduction

General purpose variation metrics like percent in dbSNP, genotype concordance, Ts/Tv ratios. Can handle both population-level allele evaluations (like HWE rates) as well as genotype-specific analyses like Genotype Concordance calculations.

VariantEval is still under continuous and active development -- results are likely to change, improve, or bugs might be introduced unexpectedly. Also, for now, VariantEval ignores sites with more than just one alternate allele.

Basic arguments/command

One or more evaluation RODs
The file(s) containing variations for evaluation. The name bound to the ROD must begin with "eval". Can be in any number of formats, including VCF, GeliText, and GFF.
Zero or more comparison RODs
The file(s) containing variations for comparison against each of the eval RODs. The name bound to the ROD must begin with "comp". Can be in any number of formats, including VCF, dbSNP, and HapMap.
-D dbSNP rod
The (optional) dbSNP track for db_coverage statistic calculations. Other databases of known variation sites can be specified with the -known argument.
--evalModule, -E
By default, the "standard" (stable) modules are run with VariantEval. If you would like to add some of the more experimental (and less supported) modules add one or more of them with this argument (use the --list option to see a list of available modules).
--doNotUseAllStandardModules, -noStandard
Instructs the tool not to use the standard evaluation modules by default (although they can be added back individually with the -E argument).
--list, -ls
Instructs the tool to list the available evaluation modules and exit; standard modules will be highlighted.
-o
The output file; if not specified, the tool prints to standard out.
-reportType
The template output type. Possible options: Table (the default), Grep, CSV, R.


java -Xmx2048m -jar GenomeAnalysisTK.jar \
   -T VariantEval -R human_b36_both.fasta \
    -l INFO \
    -B:eval,VCF example1.vcf \
    -B:comp,VCF example2.vcf \
    -D /humgen/gsa-scr1/GATK_Data/dbsnp_129_b36.rod \
    -E CountVariants
    -noStandard
    -BTI eval
Analysis Name:         	Count Variants
Analysis Description:  	Counts different classes of variants in the sample

evaluation_name          comparison_name          jexl_expression          filter_name              novelty_name             nProcessedLoci           nCalledLoci              nRefLoci                 nVariantLoci             variantRate              variantRatePerBp         nSNPs                    nInsertions              nDeletions               nComplex                 nNoCalls                 nHets                    nHomRef                  nHomVar                  heterozygosity           heterozygosityPerBp      hetHomRatio              indelRate                indelRatePerBp           deletionInsertionRatio  
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
eval                     comp                     none                     called                   all                      10                       8                        0                        8                        0.8000                   1.0000                   8                        0                        0                        0                        0                        5                        0                        3                        0.5000                   2.0000                   1.6667                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     called                   known                    10                       4                        0                        4                        0.4000                   2.0000                   4                        0                        0                        0                        0                        2                        0                        2                        0.2000                   5.0000                   1.0000                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     called                   novel                    10                       4                        0                        4                        0.4000                   2.0000                   4                        0                        0                        0                        0                        3                        0                        1                        0.3000                   3.0000                   3.0000                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     filtered                 all                      10                       2                        0                        2                        0.2000                   5.0000                   2                        0                        0                        0                        0                        2                        0                        0                        0.2000                   5.0000                   2.0000                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     filtered                 known                    10                       1                        0                        1                        0.1000                   10.0000                  1                        0                        0                        0                        0                        1                        0                        0                        0.1000                   10.0000                  1.0000                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     filtered                 novel                    10                       1                        0                        1                        0.1000                   10.0000                  1                        0                        0                        0                        0                        1                        0                        0                        0.1000                   10.0000                  1.0000                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     raw                      all                      10                       10                       0                        10                       1.0000                   1.0000                   10                       0                        0                        0                        0                        7                        0                        3                        0.7000                   1.0000                   2.3333                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     raw                      known                    10                       5                        0                        5                        0.5000                   2.0000                   5                        0                        0                        0                        0                        3                        0                        2                        0.3000                   3.0000                   1.5000                   0.0000                   0.0000                   0.0000                  
eval                     comp                     none                     raw                      novel                    10                       5                        0                        5                        0.5000                   2.0000                   5                        0                        0                        0                        0                        4                        0                        1                        0.4000                   2.0000                   4.0000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     called                   all                      10                       8                        0                        8                        0.8000                   1.0000                   8                        0                        0                        0                        0                        5                        0                        3                        0.5000                   2.0000                   1.6667                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     called                   known                    10                       4                        0                        4                        0.4000                   2.0000                   4                        0                        0                        0                        0                        2                        0                        2                        0.2000                   5.0000                   1.0000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     called                   novel                    10                       4                        0                        4                        0.4000                   2.0000                   4                        0                        0                        0                        0                        3                        0                        1                        0.3000                   3.0000                   3.0000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     filtered                 all                      10                       2                        0                        2                        0.2000                   5.0000                   2                        0                        0                        0                        0                        2                        0                        0                        0.2000                   5.0000                   2.0000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     filtered                 known                    10                       1                        0                        1                        0.1000                   10.0000                  1                        0                        0                        0                        0                        1                        0                        0                        0.1000                   10.0000                  1.0000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     filtered                 novel                    10                       1                        0                        1                        0.1000                   10.0000                  1                        0                        0                        0                        0                        1                        0                        0                        0.1000                   10.0000                  1.0000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     raw                      all                      10                       10                       0                        10                       1.0000                   1.0000                   10                       0                        0                        0                        0                        7                        0                        3                        0.7000                   1.0000                   2.3333                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     raw                      known                    10                       5                        0                        5                        0.5000                   2.0000                   5                        0                        0                        0                        0                        3                        0                        2                        0.3000                   3.0000                   1.5000                   0.0000                   0.0000                   0.0000                  
eval                     dbsnp                    none                     raw                      novel                    10                       5                        0                        5                        0.5000                   2.0000                   5                        0                        0                        0                        0                        4                        0                        1                        0.4000                   2.0000                   4.0000                   0.0000                   0.0000                   0.0000                  

Here we see a report on the SNP counts for each unique eval, comp, jexl_expression ("select"), filter (called/filtered/raw), and novelty (known/novel/all) set.

Advanced arguments/command

-known name
Zero or more names of the ROD track(s) that should be treated as known variation sites when splitting eval rods into known and novel subsets. By default this is set to "dbsnp" only.
-select
Zero or more JEXL expressions representing other stratifications to use when evaluating the data. See here for documentation on using JEXL expressions.
-selectName
Names to use for the list of stratifications (must be a 1-to-1 mapping).
-sample
If provided, the eval and comp contexts will be derived using genotypes only from these samples, when genotypes are available in the original context.
-family
If provided, genotypes will be examined for mendelian violations: this argument is a string formatted as dad+mom=child where these parameters determine which sample names are examined.
--MendelianViolationQualThreshold, -MVQ
The minimum genotype QUAL score for each trio member required to accept a site as a violation; used with the -MVQ command [default:50].
--InterestingSitesVCF, -outputVCF
If provided, interesting sites will be written to this VCF with the INFO field annotated as to why they are interesting.
--minPhredConfidenceScore, -Q
Minimum confidence (QUAL) score to consider an evaluation variant; by default, there is no minimum.
--minPhredConfidenceScoreForComp, -Qcomp
Minimum confidence (QUAL) score to consider a comp variant; by default, there is no minimum.
-rsID
An optional list of rsIDs and build numbers for capping known variants by their build date.
-maxRsIDBuild
If provided, only variants with rsIDs <= maxRsIDBuild will be included in the set of known variants.
java -Xmx2048m -jar GenomeAnalysisTK.jar \
   -T VariantEval -R human_b36_both.fasta \
    -l INFO \
    -B:eval1,VCF NA12891.vcf \
    -B:eval2,VCF NA12892.vcf \
    -B:comp,VCF CEU.vcf \
    -D /humgen/gsa-scr1/GATK_Data/dbsnp_129_b36.rod \
    -all \
    -sample NA12878 -sample NA12891 -sample NA12892 \
    -Q 30.0 \
    -known dbsnp -known comp \
    -select "DP > 100" -select "QUAL < 100" \
    -selectName highDepth -selectName lowQual

A useful analysis using VariantEval

A useful Venn diagram

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets (e.g. made by running the [[Unified genotyper]Unified Genotyper] and then [[VariantFiltrationWalker]Variant Filtration]) named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.

1) Combine the VCFs

java -jar GenomeAnalysisTK.jar \
  -R ref.fasta \
  -T CombineVariants \
  -variantMergeOptions UNION \
  -B:foo,VCF foo.vcf \
  -B:bar,VCF bar.vcf \
  -priority foo,bar \
  -o merged.vcf

2) Run VariantEval

java -jar GenomeAnalysisTK.jar \
  -R ref.fasta \
  -T VariantEval \
  -reportType Grep \
  -B:eval,VCF merged.vcf \
  -D dbsnp_129_hg18.rod \
  -select 'set=="Intersection"' -selectName Intersection \
  -select 'set=="foo-filteredInOther"' -selectName InFoo-FilteredInBar \
  -select 'set=="foo"' -selectName foo-only \
  -select 'set=="bar-filteredInOther"' -selectName InBar-FilteredInFoo \
  -select 'set=="bar"' -selectName bar-only \
  -Q 50 \
  -E TiTvVariantEvaluator \
  -E CompOverlap \
  -E CountVariants \
  -o myData.eval

One can optionally provide another 'comparison' track by using e.g. -B:comp_baz,VCF baz.vcf

Understanding Genotype Concordance values from Variant Eval

The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.

These data are available

#Table Name : Sample Summary Statistics
evaluation_name,comparison_name,jexl_expression,filter_name,novelty_name,row,percent_comp_het_called_het,percent_comp_het_called_var,percent_comp_hom_called_hom,percent_comp_hom_called_var,percent_non_ref_sensitivity,percent_overall_genotype_concordance,percent_non_ref_discrepancy_rate
eval,comp,none,called,all,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00
eval,comp,none,called,all,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00
eval,comp,none,called,all,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50
eval,comp,none,called,novel,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00
eval,comp,none,called,novel,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00
eval,comp,none,called,novel,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50
eval,comp,none,raw,all,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00
eval,comp,none,raw,all,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00
eval,comp,none,raw,all,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50
eval,comp,none,raw,novel,NA12878.GOOD,100.00,100.00,100.00,100.00,100.00,100.00,0.00
eval,comp,none,raw,novel,NA12878.BAD,25.00,50.00,25.00,50.00,50.00,33.33,75.00
eval,comp,none,raw,novel,allSamples,62.50,75.00,62.50,75.00,75.00,71.43,37.50

for two simple example samples with bad and good concordance to another set of test data. The key outputs:

  • percent_overall_genotype_concordance
  • percent_overall_genotype_concordance
  • percent_non_ref_discrepancy_rate

All defined below. In fact NA12878.GOOD is the example in the slides.

Non-reference sensitivity
Non-reference discrepancy rate
Overall genotype concordance
Example calculation
Personal tools