VariantEval accepts two types of modules: stratification and evaluation modules.
CpG is a three-state stratification:
A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.
EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.
Sample is an N-state stratification, where N is the number of samples in the eval files.
Filter is a three-state stratification:
FunctionalClass is a four-state stratification:
CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.
Degeneracy is a six-state stratification:
See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.
JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]
Novelty is a three-state stratification:
CountVariants is an evaluation module that computes the following metrics:
|nProcessedLoci||Number of processed loci|
|nCalledLoci||Number of called loci|
|nRefLoci||Number of reference loci|
|nVariantLoci||Number of variant loci|
|variantRate||Variants per loci rate|
|variantRatePerBp||Number of variants per base|
|nSNPs||Number of snp loci|
|nInsertions||Number of insertion|
|nDeletions||Number of deletions|
|nComplex||Number of complex loci|
|nNoCalls||Number of no calls loci|
|nHets||Number of het loci|
|nHomRef||Number of hom ref loci|
|nHomVar||Number of hom var loci|
|nSingletons||Number of singletons|
|heterozygosity||heterozygosity per locus rate|
|heterozygosityPerBp||heterozygosity per base pair|
|hetHomRatio||heterozygosity to homozygosity ratio|
|indelRate||indel rate (insertion count + deletion count)|
|indelRatePerBp||indel rate per base pair|
|deletionInsertionRatio||deletion to insertion ratio|
CompOverlap is an evaluation module that computes the following metrics:
|nEvalSNPs||number of eval SNP sites|
|nCompSNPs||number of comp SNP sites|
|novelSites||number of eval sites outside of comp sites|
|nVariantsAtComp||number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)|
|compRate||percentage of eval sites at comp sites|
|nConcordant||number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)|
|concordantRate||the concordance rate|
A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):
$ grep -v '##' dbsnp.vcf #CHROM POS ID REF ALT QUAL FILTER INFO 1 10327 rs112750067 T C . . ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132
Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:
$ grep -v '##' eval_correct_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T C 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:
$ grep -v '##' eval_incorrect_allele.vcf #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001-6 1 10327 . T A 5168.52 PASS ... GT:AD:DP:GQ:PL 0/1:357,238:373:99:3959,0,4059
Running VariantEval with just the CompOverlap module:
$ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -L 1:10327 \ -B:dbsnp,VCF dbsnp.vcf \ -B:eval_correct_allele,VCF eval_correct_allele.vcf \ -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \ -noEV \ -EV CompOverlap \ -o eval.table
We find that the eval.table file contains the following:
$ grep -v '##' eval.table | column -t CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants nCompVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval_correct_allele none all 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none known 1 1 0 1 100.00000000 1 100.00000000 CompOverlap dbsnp eval_correct_allele none novel 0 0 0 0 0.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none all 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none known 1 1 0 1 100.00000000 0 0.00000000 CompOverlap dbsnp eval_incorrect_allele none novel 0 0 0 0 0.00000000 0 0.00000000
As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.
TiTvVariantEvaluator is an evaluation module that computes the following metrics:
|nTi||number of transition loci|
|nTv||number of transversion loci|
|tiTvRatio||the transition to transversion ratio|
|nTiInComp||number of comp transition sites|
|nTvInComp||number of comp transversion sites|
|TiTvRatioStandard||the transition to transversion ratio for comp sites|
Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.
Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.
Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.
After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:
database_repository = http://sourceforge.net/projects/snpeff/files/databases/
Then, download one or more databases using SnpEff's built-in download command:
java -jar snpEff.jar download GRCh37.64
You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.
The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the
data_dir entry in the file
snpEff.config to point to the correct directory.
Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.
If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version **2.0.5*. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.
These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.
We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.
We recommend always running with
-onlyCoding true with human databases (eg., the GRCh37.* databases). Setting
-onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The
-onlyCoding false option should only be used with databases that lack protein coding information.
Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.
See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.
Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.
snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.
See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here
The GRCh37.64 database gives good results on the condition that you run snpEff with the
-onlyCoding true option. The
-onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.
The GRCh37.65 database gives results that are as poor as those you get with the
-onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.
See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here
Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the
data_dir as defined in
java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf
In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:
EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.
And here is the precise layout with all the subfields:
It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.
Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.
There are two different options for doing this:
NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.
The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.
When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:
Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.
Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.
Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.
java -jar dist/GenomeAnalysisTK.jar \ -T VariantAnnotator \ -R /humgen/1kg/reference/human_g1k_v37.fasta \ -A SnpEff \ --variant 1000G.exomes.vcf \ (file to annotate) --snpEffFile snpEff_output.vcf \ (SnpEff VCF output file generated by running SnpEff on the file to annotate) -L 1000G.exomes.vcf \ -o out.vcf
VariantAnnotator adds some or all of the following INFO field annotations to each variant record:
SNPEFF_EFFECT- The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
SNPEFF_IMPACT- Impact of the highest-impact effect resulting from the current variant (
SNPEFF_FUNCTIONAL_CLASS- Functional class of the highest-impact effect resulting from the current variant (
SNPEFF_CODON_CHANGE- Old/New codon for the highest-impact effect resulting from the current variant
SNPEFF_AMINO_ACID_CHANGE- Old/New amino acid for the highest-impact effect resulting from the current variant
SNPEFF_GENE_NAME- Gene name for the highest-impact effect resulting from the current variant
SNPEFF_GENE_BIOTYPE- Gene biotype for the highest-impact effect resulting from the current variant
SNPEFF_TRANSCRIPT_ID- Transcript ID for the highest-impact effect resulting from the current variant
SNPEFF_EXON_ID- Exon ID for the highest-impact effect resulting from the current variant
Example VCF records annotated using SnpEff and VariantAnnotator:
1 874779 . C T 279.94 . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045; MQ=54.49;MQ0=10;MQRankSum=0.982;QD=13.33;ReadPosRankSum=-0.060;SB=-120.09;SNPEFF_AMINO_ACID_CHANGE=G215;SNPEFF_CODON_CHANGE=ggC/ggT; SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11; SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066 1 874816 . C CT 2527.52 . AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520; MQ=53.37;MQ0=6;MQRankSum=-1.388;QD=5.92;ReadPosRankSum=-1.932;SB=-741.06;SNPEFF_EFFECT=FRAME_SHIFT;SNPEFF_EXON_ID=exon_1_874655_874840; SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066
VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.
Here's an example of how to do this:
java -jar dist/GenomeAnalysisTK.jar \ -T VariantAnnotator \ -R /humgen/1kg/reference/human_g1k_v37.fasta \ -E resource.EFF \ --variant 1000G.exomes.vcf \ (file to annotate) --resource snpEff_output.vcf \ (SnpEff VCF output file generated by running SnpEff on the file to annotate) -L 1000G.exomes.vcf \ -o out.vcf
Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.
Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.
SnpEff assigns a functional class to certain effects, in addition to an impact:
NONSENSE: assigned to point mutations that result in the creation of a new stop codon
MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon
SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)
The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.
For a complete, detailed argument reference, refer to the technical documentation page.
You can find detailed information about the various modules here.
Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).
We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.
java -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T CombineVariants \ -V:FOO foo.vcf \ -V:BAR bar.vcf \ -priority FOO,BAR \ -o merged.vcf
java -jar GenomeAnalysisTK.jar \ -T VariantEval \ -R ref.fasta \ -D dbsnp.vcf \ -select 'set=="Intersection"' -selectName Intersection \ -select 'set=="FOO"' -selectName FOO \ -select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \ -select 'set=="BAR"' -selectName BAR \ -select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \ -select 'set=="FilteredInAll"' -selectName FilteredInAll \ -o merged.eval.gatkreport \ -eval merged.vcf \ -l INFO
It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:
java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt grep -v 'set' fields.txt | sort | uniq -c
This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.
The VariantEval output is formatted as a GATKReport.
The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.
##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats : the concordance statistics summary for each sample GenotypeConcordance.sampleSummaryStats CompRod CpG EvalRod JexlExpression Novelty percent_comp_ref_called_var percent_comp_het_called_het percent_comp_het_called_var percent_comp_hom_called_hom percent_comp_hom_called_var percent_non-reference_sensitivity percent_overall_genotype_concordance percent_non-reference_discrepancy_rate GenotypeConcordance.sampleSummaryStats compOMNI all eval none all 0.78 97.65 98.39 99.13 99.44 98.80 99.09 3.60
The key outputs:
All defined below.