# Tagged with #annotation 2 documentation articles | 0 announcements | 14 forum discussions

Created 2012-08-06 17:28:04 | Updated 2015-05-13 15:20:03 | Tags: official basic analyst intro vcf annotation

This document describes "regular" (variants-only) VCF files. For information on the gVCF format produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.

### 1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from NGS data, such as the UnifiedGenotyper and the HaplotypeCaller, is especially complex. This document describes some specific features and annotations used in the VCF files output by the GATK tools.

### 2. Basic structure of a VCF file

The following text is a valid VCF file describing the first few SNPs found by the UG in a deep whole genome data set from our favorite test sample, NA12878:

##fileformat=VCFv4.0
##FILTER=<ID=LowQual,Description="QUAL < 50.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with two (and only two) segregating haplotypes">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="log10-scaled probability of variant being true under the trained gaussian mixture model">
##UnifiedGenotyperV2="analysis_type=UnifiedGenotyperV2 input_file=[TEXT CLIPPED FOR CLARITY]"
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255


It seems a bit complex, but the structure of the file is actually quite simple:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
chr1    873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255


After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that here everything is a SNP, but some could be indels or CNVs.

### 3. How variation is represented

The first 6 columns of the VCF, which represent the observed variation, are easy to understand because they have a single, well-defined meaning.

• CHROM and POS : The CHROM and POS gives the contig on which the variant occurs. For indels this is actually the base preceding the event, due to how indels are represented in a VCF.

• ID: The dbSNP rs identifier of the SNP, based on the contig and position of the call and whether a record exists at this site in dbSNP.

• REF and ALT: The reference base and alternative base that vary in the samples, or in the population in general. Note that REF and ALT are always given on the forward strand. For indels the REF and ALT bases always include at least one base each (the base before the event).

• QUAL: The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance. These values can grow very large when a large amount of NGS data is used for variant calling.

• FILTER: In a perfect world, the QUAL field would be based on a complete model for all error modes present in the data used to call. Unfortunately, we are still far from this ideal, and we have to use orthogonal approaches to determine which called sites, independent of QUAL, are machine errors and which are real SNPs. Whatever approach is used to filter the SNPs, the VCFs produced by the GATK carry both the PASSing filter records (the ones that are good have PASS in their FILTER field) as well as those that fail (the filter field is anything but PASS or a dot). If the FILTER field is a ".", then no filtering has been applied to the records, meaning that all of the records will be used for analysis but without explicitly saying that any PASS. You should avoid such a situation by always filtering raw variant calls before analysis.

In the excerpt shown above, here is how we interpret the line corresponding to each variant:

• chr1:873762 is a novel T/G polymorphism, found with very high confidence (QUAL = 5231.78)
• chr1:877664 is a known A/G SNP (named rs3828047), found with very high confidence (QUAL = 3931.66)
• chr1:899282 is a known C/T SNP (named rs28548431), but has a relative low confidence (QUAL = 71.77)
• chr1:974165 is a known T/C SNP but we have so little evidence for this variant in our data that although we write out a record for it (for book keeping, really) our statistical evidence is so low that we filter the record out as a bad site, as indicated by the "LowQual" annotation.

### 4. How genotypes are represented

The genotype fields of the VCF look more complicated but they're actually not that hard to interpret once you understand that they're just sets of tags and values. Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

chr1    873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26


Looking at that last column, here is what the tags mean:

• GT : The genotype of this sample. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

• 0/0 - the sample is homozygous reference
• 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
• 1/1 - the sample is homozygous alternate In the three examples above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
• GQ: The Genotype Quality, or Phred-scaled confidence that the true genotype is the one provided in GT. In the diploid case, if GT is 0/1, then GQ is really L(0/1) / (L(0/0) + L(0/1) + L(1/1)), where L is the likelihood that the sample is 0/0, 0/1/, or 1/1 under the model built for the NGS dataset. The GQ is simply the second most likely PL - the most likely PL. Because the most likely PL is always 0, GQ = second highest PL - 0. If the second most likely PL is greater than 99, we still assign a GQ of 99, so the highest value of GQ is 99.

• AD and DP: These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site. See the Technical Documentation for details on AD (DepthPerAlleleBySample) and DP (Coverage).

• PL: This field provides the likelihoods of the given genotypes (here, 0/0, 0/1, and 1/1). These are normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. To be concrete, for the heterozygous case, this is L(data given that the true genotype is 0/1). The most likely genotype (given in the GT field) is scaled so that it's P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype.

With that out of the way, let's interpret the genotypes for NA12878 at chr1:899282.

chr1    899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:25.92:103,0,26


At this site, the called genotype is GT = 0/1, which is C/T. The confidence indicated by GQ = 25.92 isn't so good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0). There's a chance that the subject is "hom-var" (=homozygous with the variant allele) since PL(1/1) = 26, which corresponds to 10^(-2.6), or 0.0025, but either way, it's clear that the subject is definitely not "hom-ref" (=homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.

### 5. Understanding annotations

Finally, variants in a VCF can be annotated with a variety of additional tags, either by the built-in tools or with others that you add yourself. The way they're formatted is similar to what we saw in the Genotype fields, except instead of being in two separate fields (tags and values, respectively) the annotation tags and values are grouped together, so tag-value pairs are written one after another.

chr1    873762  [CLIPPED] AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473
chr1    877664  [CLIPPED] AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185
chr1    899282  [CLIPPED] AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148


Here are some commonly used built-in annotations and what they mean:

Annotation tag in VCF Meaning
AC,AF,AN See the Technical Documentation for Chromosome Counts.
DB If present, then the variant is in dbSNP.
DP See the Technical Documentation for Coverage.
DS Were any of the samples downsampled because of too much coverage?
Dels See the Technical Documentation for SpanningDeletions.
MQ and MQ0 See the Technical Documentation for RMS Mapping Quality and Mapping Quality Zero.
BaseQualityRankSumTest See the Technical Documentation for Base Quality Rank Sum Test.
MappingQualityRankSumTest See the Technical Documentation for Mapping Quality Rank Sum Test.
ReadPosRankSumTest See the Technical Documentation for Read Position Rank Sum Test.
HRun See the Technical Documentation for Homopolymer Run.
HaplotypeScore See the Technical Documentation for Haplotype Score.
QD See the Technical Documentation for Qual By Depth.
VQSLOD Only present when using Variant quality score recalibration. Log odds ratio of being a true variant versus being false under the trained gaussian mixture model.
FS See the Technical Documentation for Fisher Strand
SB How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls).

Created 2012-07-23 17:36:42 | Updated 2013-06-10 18:24:01 | Tags: official snpeff variantannotator vcf tooltips callset annotation

### IMPORTANT NOTE: This document is out of date and will be replaced soon. In the meantime, you can find accurate information on how to run SnpEff in a compatible way with GATK in the SnpEff documentation, and instructions on what steps are necessary in the presentation on Functional Annotation linked in the comments below.

Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.

### Introduction

Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.

### SnpEff Setup and Usage

Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.

After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:

database_repository = http://sourceforge.net/projects/snpeff/files/databases/


java -jar snpEff.jar download GRCh37.64


You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.

The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.

Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.

### Supported SnpEff Versions

If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version **2.0.5*. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.

### Current Recommended Best Practices When Running SnpEff

These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.

• We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.

• We recommend always running with -onlyCoding true with human databases (eg., the GRCh37.* databases). Setting -onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The -onlyCoding false option should only be used with databases that lack protein coding information.

• Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.

### Analyses of SnpEff Annotations Across Versions

See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.

• Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.

• snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.

See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here

• The GRCh37.64 database gives good results on the condition that you run snpEff with the -onlyCoding true option. The -onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.

• The GRCh37.65 database gives results that are as poor as those you get with the -onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.

See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here

• snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.

### Example SnpEff Usage with a VCF Input File

Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.

java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf


In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:

EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.


And here is the precise layout with all the subfields:

EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...


It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.

### Adding SnpEff Annotations using VariantAnnotator

Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.

There are two different options for doing this:

#### Option 1: Annotate with only the highest-impact effect for each variant

NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.

The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.

When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:

• Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.

• Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.

• Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.

Example Usage:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-A SnpEff \
--variant 1000G.exomes.vcf \        (file to annotate)
--snpEffFile snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


VariantAnnotator adds some or all of the following INFO field annotations to each variant record:

• SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
• SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)
• SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
• SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variant
• SNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variant
• SNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variant
• SNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variant

Example VCF records annotated using SnpEff and VariantAnnotator:

1   874779  .   C   T   279.94  . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066

1   874816  .   C   CT  2527.52 .   AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066


#### Option 2: Annotate with all effects for each variant

VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.

Here's an example of how to do this:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-E resource.EFF \
--variant 1000G.exomes.vcf \      (file to annotate)
--resource snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.

### List of Genomic Effects

Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.

#### High-Impact Effects

• SPLICE_SITE_ACCEPTOR
• SPLICE_SITE_DONOR
• START_LOST
• EXON_DELETED
• FRAME_SHIFT
• STOP_GAINED
• STOP_LOST

#### Moderate-Impact Effects

• NON_SYNONYMOUS_CODING
• CODON_CHANGE (note: this effect is used by SnpEff only for MNPs, not SNPs)
• CODON_INSERTION
• CODON_CHANGE_PLUS_CODON_INSERTION
• CODON_DELETION
• CODON_CHANGE_PLUS_CODON_DELETION
• UTR_5_DELETED
• UTR_3_DELETED

#### Low-Impact Effects

• SYNONYMOUS_START
• NON_SYNONYMOUS_START
• START_GAINED
• SYNONYMOUS_CODING
• SYNONYMOUS_STOP
• NON_SYNONYMOUS_STOP

#### Modifiers

• NONE
• CHROMOSOME
• CUSTOM
• CDS
• GENE
• TRANSCRIPT
• EXON
• INTRON_CONSERVED
• UTR_5_PRIME
• UTR_3_PRIME
• DOWNSTREAM
• INTRAGENIC
• INTERGENIC
• INTERGENIC_CONSERVED
• UPSTREAM
• REGULATION
• INTRON

### Functional Classes

SnpEff assigns a functional class to certain effects, in addition to an impact:

• NONSENSE: assigned to point mutations that result in the creation of a new stop codon
• MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon
• SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
• NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)

The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.

No posts found with the requested search criteria.

Created 2015-02-05 20:47:57 | Updated | Tags: variantannotator annotation clinvar

Hi,

I'm using VariantAnnotator to add annotations to variants from a bunch of sources. One issue that I have is that for some variants, there are multiple annotations in a supplied resource. In the docs, I read

"Note that if there are multiple records in the resource file that overlap the given position, one is chosen randomly."

Can this behaviour be altered? I need to output all annotations for a record, either on a single line, or on multiple.

In the case i'm working on, one line has the annotation "CLNSIG=5" (i.e. a known pathogenic variant) and the other (likely older record) is "CLNSIG=1" i.e. a variant of unknown significance. I need to output both so I can filter downstream (using SelectVariants) to select those where "CLNSIG=5".

cheers

Created 2015-01-10 08:13:41 | Updated | Tags: unifiedgenotyper variantrecalibrator vqsr haplotypescore annotation

The documentation on the HaplotypeScore annotation reads:

HaplotypeCaller does not output this annotation because it already evaluates haplotype segregation internally. This annotation is only informative (and available) for variants called by Unified Genotyper.

The annotation used to be part of the best practices:

http://gatkforums.broadinstitute.org/discussion/15/best-practice-variant-detection-with-the-gatk-v1-x-retired


I will include it in the VQSR model for UG calls from low coverage data. Is this an unwise decision? I guess this is for myself to evaluate. I thought I would ask, in case I have missed something obvious.

Created 2014-11-21 10:00:37 | Updated | Tags: annotation genotypesummaries

Hi Sheila and Geraldine,

Would someone be kind to add some info regarding the GenotypeSummaries annotation?

Regards, Gerrit

Created 2013-08-20 15:43:47 | Updated | Tags: annotation interpretation

I have been trying to find documentation for understanding the annotations and numbers in the VCF file. Something like below, can someone guide me how to understand/interpret the numbers? How is the quality of the variant calling for this particular indel?

AC=2; AF=0.100; AN=20; BaseQRankSum=6.161; ClippingRankSum=-2.034; DP=313; FS=5.381; InbreedingCoeff=-0.1180; MLEAC=2; MLEAF=0.100; MQ=58.49; MQ0=0; MQRankSum=-0.456; QD=1.46; ReadPosRankSum=-4.442; VQSLOD=0.348; topculprit=ReadPosRankSum

Created 2013-07-23 19:44:06 | Updated | Tags: variantannotator vcf genotype annotation

Hi,

Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PANC_R
chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0


The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect:

java -jar GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf  Does anyone have any suggestions? Thanks in advance! Created 2013-04-22 18:17:32 | Updated | Tags: annotation haplotype Hi Everyone, I had a few questions about the haplotype score. In the technical documentation it states that "Higher scores are indicative of regions with bad alignments, often leading to artifactual SNP and indel calls. Note that the Haplotype Score is only calculated for sites with read coverage." How is the haplotype group for each variant site determined? e.g. Does it take the closest two variants to the query site and then treat the query variant + closest two variants as the haplotype group? Also, in the case of multiallelic SNPs (>2 SNPs), haplotype score is inappropriate since it only looks at whether a site can be explained by the segregation of two and only two haplotypes, correct? So multiallelic snps will be assigned poor haplotype scores OR will these sites not be annotated at all? If we have a case where there is a truly biallelic SNP and a couple of samples have some reads that are erroneously calling a third allele, this variant site will be assigned a poor haplotype score overall, correct? Thanks, MC Created 2013-03-11 11:12:22 | Updated 2013-03-11 12:09:43 | Tags: unifiedgenotyper depthofcoverage annotation Just in the process of updating our pipeline from v2.3-4-gb8f1308 Lite to v2.4-7-g5e89f01 Academic and have run into a small issue. The command line: -T UnifiedGenotyper -glm SNP -R /lustre/scratch111/resources/ref/Homo_sapiens/1000Genomes_hs37d5/hs37d5.fa -I /lustre/scratch111/projects/helic/vcf-newpipe/lists/chr1-pooled.list --alleles /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.snps.vcf.gz -L 1:1-1000000 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A DepthOfCoverage -o /lustre/scratch111/projects/helic/vcf-newpipe/pooled/1/1:1-1000000.asnps.vcf.gz.part.vcf.gz  This worked in 2.3.4. But now gives: Invalid command line: Argument annotation has a bad value: Class DepthOfCoverage is not found; please check that you have specified the class name correctly  I've looked at the release notes but it's not giving me a clue as to what has changed. Has the DepthOfCoverage annotation now been dropped? I've checked and I can reproduce this on the latest nightly (nightly-2013-03-11-g184e5ac) Created 2013-02-28 10:10:09 | Updated | Tags: unifiedgenotyper annotation dbsnp Hi the GATK team; I use the UnifiedGenotyper the following way: java -jar GenomeAnalysisTK-2.1-13-g1706365/GenomeAnalysisTK.jar \ -R /human_g1k_v37.fasta \ -T UnifiedGenotyper \ -glm BOTH \ -S SILENT \ -L ../align/capture.bed \ -I myl.bam \ --dbsnp broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz \ -o output.vcf  When I look at the generated VCF , the variation 18:55997929 (CTTCT/C) is said to be rs4149608 18 55997929 rs4149608 CTTCT C (...)  but in the dbsnp_135.b37.vcf.gz, you can see that the right rs## should be rs144384654  gunzip -c broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf.gz |grep -E -w '(rs4149608|rs144384654)'
18 55997929 rs4149608 CT C,CTTCT (...)
18 55997929 rs144384654 CTTCT C (...)


does UnifiedGenotyper uses the first rs## it finds at a given position ? Or should I use another method/tool to get the 'right' rs## ?

Thank you,

Pierre

Created 2012-12-27 23:26:34 | Updated 2013-01-07 19:16:48 | Tags: vcf annotation

Please look at lines 1 and 2 taken from a vcf file, which have same Chromosome and Position and one of the Alt allele is same in both lines, different allele count and have different rsID.

1 1229111 rs70949568 A ACGCCCCTGCCCTGGAGGCCCCGCCCCTGCCCTGGAGGCCC,C 2629.32 TruthSensitivityTranche99.50to99.90;TruthSensitivityTranche99.30to99.50 AC=80,31;AF=0.1273;AN=284;BaseQRankSum=1.124;DB;DP=426;Dels=0.00;FS=4.620;HRun=1;HaplotypeScore=0.2101;InbreedingCoeff=-0.0029;MQ0=0;MQ=58.46;MQRankSum=1.211;QD=5.26;ReadPosRankSum=-5.748;SB=-36.94;SF=0f,1f;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ACAP3;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000379037;VQSLOD=-2.3894;culprit=MQ GT:DP:GQ:AD:PL

1 1229111 . A C 89.94 TruthSensitivityTranche99.00to99.30 AC=7;AF=0.0614;AN=114;BaseQRankSum=0.801;DP=175;Dels=0.00;FS=1.668;HRun=1;HaplotypeScore=0.2276;InbreedingCoeff=-0.0538;MQ0=0;MQ=57.90;MQRankSum=0.501;QD=4.28;ReadPosRankSum=-4.531;SB=-15.19;SF=0f;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=ACAP3;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000379037;VQSLOD=-1.4433;culprit=MQ GT:DP:GQ:AD:PL

Created 2012-10-12 10:48:39 | Updated 2012-10-18 00:57:09 | Tags: variantannotator variants annotation

Hi,

Could you tell me when we can use new version of SnpEff with GATK?

I have some bugs :

##### ERROR MESSAGE: Could not create module ActiveRegionBasedAnnotation because BUG: Cannot find expected constructor for class caused by exception org.broadinstitute.sting.gatk.walkers

.annotator.interfaces.ActiveRegionBasedAnnotation.()

##### ERROR MESSAGE: Could not create module ExperimentalAnnotation because BUG: Cannot find expected constructor for class

caused by exception org.broadinstitute.sting.gatk.walkers.annotator.interfaces.ExperimentalAnnotation.()

##### ERROR ------------------------------------------------------------------------------------------

I don't know if I forget some other options linked these annotations. These options are important for me. So I deleted them but if somebody want to use them ...

Regards,

Tiphaine

Created 2012-10-10 16:18:42 | Updated 2013-01-07 20:29:25 | Tags: vqsr inbreedingcoeff annotation exome community

I'm curious about the experience of the community at large with VQSR, and specifically with which sets of annotations people have found to work well. The GATK team's recommendations are valuable, but my impression is that they have fairly homogenous data types - I'd like to know if anyone has found it useful to deviate from their recommendations.

For instance, I no longer include InbreedingCoefficient with my exome runs. This was spurred by a case where previously validated variants were getting discarded by VQSR. It turned out that these particular variants were homozygous alternate in the diseased samples and homozygous reference in the controls, yielding an InbreedingCoefficient very close to 1. We decided that the all-homozygous case was far more likely to be genuinely interesting than a sequencing/variant calling artifact, so we removed the annotation from VQSR. In order to catch the all-heterozygous case (which is more likely to be an error), we add a VariantFiltration pass for 'InbreedingCoefficient < -0.8' following ApplyRecalibration.

In my case, I think InbreedingCoefficient isn't as useful because my UG/VQSR cohorts tend to be smaller and less diverse than what the GATK team typically runs (and to be honest, I'm still not sure we're doing the best thing). Has anyone else found it useful to modify these annotations? It would be helpful if we could build a more complete picture of these metrics in a diverse set of experiments.

Created 2012-10-01 16:21:17 | Updated 2012-10-01 17:11:46 | Tags: vcf annotation

Hello,

For the “ABHom” annotations, the VCF header gives the following formula : (A/(A+O)). What does the 'O' stand for?

Thanks, Mika

Created 2012-09-19 19:15:06 | Updated 2013-01-07 19:47:43 | Tags: snpeff vcf annotation

Broad recommends using snpEff to add annotations to VCF files created by GATK. This gives annotations about the effect of a given variant: is it in a coding region? Does it cause a frameshift? What transcripts are impacted? etc. However, snpEff does not provide other annotations you might want, such as 1000 genomes minor allele frequency, SIFT scores, phyloP conservation scores, and so on. I've previously used annovar to get those sorts of things, and that worked well enough, though I did not find it to be especially user-friendly.

So my question is, what other ways have users found of getting this sort of annotation information? I'm interested specifically in human exomes, but I am sure other users reading this Ask the Community post will be interested in answers for other organisms as well. I'm looking for recommendations on what's quick, simple, easy to use, and has been used successfully with VCFs produced by GATK. I'm open to answers in the form of other software tools or sources of raw data that I can easily manipulate on my own.

Created 2012-09-13 16:51:57 | Updated 2013-01-07 20:40:24 | Tags: unifiedgenotyper annotation

I am doing human exome sequencing with hg19 as a reference, and I want UnifiedGenotyper to give me whatever annotations are available and I will worry later about which ones are useful and which are not.

I am confused about the behavior of the --annotation option in UnifiedGenotyper. The default value is listed as [], implying that unless we explicitly list what annotations we want, we get no annotations at all? Is that correct? Then in order to get a list of available annotations, we are directed to the VariantAnnotator --list option but it appears that it is not possible to just run:

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantAnnotator \
--list


In order to get a list of annotations. Instead, one not only needs to include a --variants flag, but the vcf file you point to actually has to be well-formatted, etc., otherwise you get errors like this

##### ERROR MESSAGE: Argument with name '--variant' (-V) is missing.


or this:

##### ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be
determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:


So, that having failed, is anyone able to just provide me with a list of possible arguments to the UnifiedGenotyper --annotation option?