IMPORTANT NOTE: Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.
Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.
Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.
After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:
database_repository = http://sourceforge.net/projects/snpeff/files/databases/
Then, download one or more databases using SnpEff's built-in download command:
java -jar snpEff.jar download GRCh37.64
You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.
The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.
Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.
If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version **2.0.5*. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.
These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.
We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.
We recommend always running with -onlyCoding true with human databases (eg., the GRCh37.* databases). Setting -onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The -onlyCoding false option should only be used with databases that lack protein coding information.
Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.
See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.
Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.
snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.
See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here
The GRCh37.64 database gives good results on the condition that you run snpEff with the -onlyCoding true option. The -onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.
The GRCh37.65 database gives results that are as poor as those you get with the -onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.
See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here
Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.
java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf
In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:
EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.
And here is the precise layout with all the subfields:
EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...
It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.
Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.
There are two different options for doing this:
NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.
The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.
When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:
Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.
Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.
Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.
Example Usage:
java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-A SnpEff \
--variant 1000G.exomes.vcf \ (file to annotate)
--snpEffFile snpEff_output.vcf \ (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf
VariantAnnotator adds some or all of the following INFO field annotations to each variant record:
SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variantSNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variantSNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variantSNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variantSNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variantSNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variantExample VCF records annotated using SnpEff and VariantAnnotator:
1 874779 . C T 279.94 . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
MQ=54.49;MQ0=10;MQRankSum=0.982;QD=13.33;ReadPosRankSum=-0.060;SB=-120.09;SNPEFF_AMINO_ACID_CHANGE=G215;SNPEFF_CODON_CHANGE=ggC/ggT;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066
1 874816 . C CT 2527.52 . AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
MQ=53.37;MQ0=6;MQRankSum=-1.388;QD=5.92;ReadPosRankSum=-1.932;SB=-741.06;SNPEFF_EFFECT=FRAME_SHIFT;SNPEFF_EXON_ID=exon_1_874655_874840;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066
VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.
Here's an example of how to do this:
java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-E resource.EFF \
--variant 1000G.exomes.vcf \ (file to annotate)
--resource snpEff_output.vcf \ (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf
Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.
Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.
SnpEff assigns a functional class to certain effects, in addition to an impact:
NONSENSE: assigned to point mutations that result in the creation of a new stop codonMISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codonSILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codonNONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.
2 SNPs with significant strand bias

Several SNPs with excessive coverage

For a complete, detailed argument reference, refer to the GATK document page here.
In addition to true variation, variant callers emit a number of false-positives. Some of these false-positives can be detected and rejected by various statistical tests. VariantAnnotator provides a way of annotating variant calls as preparation for executing these tests.
Description of the haplotype score annotation

The list below is not comprehensive. Please use the --list argument to get a list of all possible annotations available. Also, see the FAQ article on understanding the Unified Genotyper's VCF files for a description of some of the more standard annotations.
Note that technically the VariantAnnotator does not require reads (from a BAM file) to run; if no reads are provided, only those Annotations which don't use reads (e.g. Chromosome Counts) will be added. But most Annotations do require reads. When running the tool we recommend that you add the -L argument with the variant rod to your command line for efficiency and speed.
Hi,
Could you tell me when we can use new version of SnpEff with GATK?
I have some bugs :
.annotator.interfaces.ActiveRegionBasedAnnotation.
caused by exception org.broadinstitute.sting.gatk.walkers.annotator.interfaces.ExperimentalAnnotation.
I don't know if I forget some other options linked these annotations. These options are important for me. So I deleted them but if somebody want to use them ...
Regards,
Tiphaine
Hello,
I am trying to filter some of my high-coverage samples based on a minimum depth and have found that the value stored in the DP INFO field and the AD genotype tag changes depending on whether or not I have run VariantAnnotator. The call I have used for VariantAnnotator is:
java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R ucsc.hg19.fasta -I example.bam --variant example.raw.vcf --out example.annotated.vcf -G StandardAnnotation -L example.raw.vcf -rf BadCigar -dcov 15000
Here are the differences for some test cases with HaplotypeCaller:
No MarkDuplicates, did IndelRealigner & BQSR, nightly build 12/04/2013
Annotated: DP=2745, AD=4,2729 Raw: DP=957, AD=1,907
MarkDuplicates, IndelRealigner and BQSR, nightly build 12/04/2013
Annotated: DP=20, AD=0,20 Raw: DP=10, AD=0,8
Raw BAM, nightly build 12/04/2013
Annotated: DP=2745,AD=4,2729 Raw: DP=868, AD=1,864
Raw BAM, version 2.4-9
Annotated: DP=2745, AD=4,2729 Raw: DP=616, AD=1,611
I suspect what is happening here is that VariantAnnotator is taking the depth information from the provided BAM and replacing the depth information reported by the variant caller. Anyway, just wondering- which value is a better reflection of the depth used to make a given variant call? (i.e. which could I use in hard filtering?)
Thanks for your help!
I was trying to annotate a vcf file already run on snpEff on VariantAnnotator. Used snpEff v2.05 and GRCh37.64 to annotate the vcf as mentioned in the document. Was using this document for it http://gatkforums.broadinstitute.org/discussion/50/adding-genomic-annotations-using-snpeff-and-variantannotator
When trying to run it through VariantAnnotator, it flags the error "Invalid command line: Failed to load reference dictionary " along with few warnings Warnings : GenomeAnalysisTK-2.3-9-ge5ebf34/human_g1k_v37.dict: No locks available.
Can somebody help me on this
Thanks
I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.
The HaplotypeCaller command line is:
gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar"
#Fasta from the gz in the resource bundle
indx="/home/ref/ucsc.hg19.fasta"
dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf"
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T HaplotypeCaller \
-I chrom_bams/286T.chr3.bam \
-o hapc_vcfs/286T.chr3.raw.vcf
The VariantAnnotator command line is:
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T VariantAnnotator \
--dbsnp $dbsnp --alwaysAppendDbsnpId \
-A BaseQualityRankSumTest -A DepthOfCoverage \
-A FisherStrand -A HaplotypeScore -A InbreedingCoeff \
-A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \
-A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \
-A TandemRepeatAnnotator \
--variant:vcf hapc_vcfs/286T.chr3.raw.vcf \
--out varanno_vcfs/286T.chr3.va.vcf
This all works nicely, but I go back and use ValidateVariants just to be sure:
java -Xms1g -Xmx2g -jar $gatk -R ${indx} -T ValidateVariants \
--dbsnp ${dbsnp} \
--variant:vcf varanno_vcfs/286T.chr3.va.vcf \
1> report/ValidateVariants/286T.chr3.va.valid.out \
2> report/ValidateVariants/286T.chr3.va.valid.err &
An issue arises with a rsID that is flagged as not being present in dbSNP.
...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP
I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.
The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374
chr3 123022685 rs67850374;rs72184829 AAAGAGAAGAGAAGAG A 1865.98 .
There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat $dbsnp | grep rs67850374
chr3 123022685 rs67850374;rs72184829 AA A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG . PASS
My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."
The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.
The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.
My questions are:
As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.
-- Colin
Hi,
I just run HyplotypeCaller on a dataset. For the same dataset, I have run through Unified genotyper and then directly subjected the raw vcf from UG to VQSR step without the help of VariantAnnotator before and get through VQSR without any problem. However, when I try to subject the raw callset derived from HyplotypeCaller directly to VQSR step, the VQSR module complained about it and error message is below:
...
http://gatkforums.broadinstitute.org/discussion/49/using-variant-annotator
So after HyplotypeCaller, the derived vcf file needs to run though VariantAnnotator? Since Unified genotyper derived callset does not need the help of VariantAnnotator (all annotations needed for VQSR are included after UG), it seems not the case for HyplotypeCaller? I can run through VariantAnnotator for HyplotypeCaller derived vcf file, just want to make sure if my understanding is correct?
Thanks and best
Mike
Hi there, I'm running into an issue with my Queue pipeline, with variants called with HaplotypeCaller. Once I get the raw VCF file, I use VariantAnnotator before VQSR: however, no HaplotypeScore annotation is being produced, resulting in a failure of the VariantRecalibrator step where 'HaplotypeScore' was indicated as an annotation.
I tried to correct the issue by indicating to VariantAnnotator to use all annotations
class AnnotationArguments (t: Target) extends VariantAnnotator with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
// Set the memory limit to 7 gigabytes on each command.
this.memoryLimit = 7
this.input_file :+= qscript.bamFile
this.useAllAnnotations
this.D = qscript.dbSNP_b37
}
But I still can't get any output in the annotated VCF of that parameter. Here an example of a variant
AC=5;AF=0.078;AN=64;ActiveRegionSize=179;ClippingRankSum=-0.568;DB;DP=2025;EVENTLENGTH=0;FS=4.139;InbreedingCoeff=-0.0847;MLEAC=5;MLEAF=0.078;MQ=69.98;MQRankSum=-1.428;NVH=1;NumHapAssembly=15;NumHapEval=13;QD=17.20;QDE=17.20;ReadPosRankSum=-1.762;TYPE=SNP;extType=SNP
Any suggestions on what I might be doing wrong?
thanks very much for your help, Francesco
I had annotated raw indel file (given by UnifiedGenotyper), 1000G_omni2.5.b37.sites.vcf and hapmap_3.3.b37.sites.vcf with all possible annotations including QD (QualByDepth) using VariantAnnotator. However, i got an error when i tried to run VariantRecalibrator. It was complaing that QD has not been found in training variant. Is QD important annotation for indel filtering. Can it be ignored ?
P.S. - i did not use sample bam file while annotating training data set.
.
.
.
INFO 15:11:55,999 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK.vcf
INFO 15:12:21,650 TraversalEngine - chr1:128346793 1.98e+07 30.0 s 1.5 s 4.1% 12.1 m 11.6 m
INFO 15:12:51,650 TraversalEngine - chr9:130658800 5.26e+07 60.0 s 1.1 s 53.9% 111.2 s 51.2 s
INFO 15:13:13,618 VariantDataManager - QD: mean = NaN standard deviation = NaN
INFO 15:13:16,417 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator
##### ERROR ------------------------------------------------------------------------------------------
I would like to have certain annotations applied only if the variants in the resource and the target have the same genomic change. Ex: In Cosmic the following variants are present.
12 25398284 . C G . PASS Genename=KRAS;MutationAA=p.G12A;MutationCDS=c.35G>C
12 25398284 . C A . PASS Genename=KRAS;MutationAA=p.G12V;MutationCDS=c.35G>T
12 25398284 . C T . PASS Genename=KRAS;MutationAA=p.G12D;MutationCDS=c.35G>A
However, in my VCF file I get the G12A annotation for all variants at this site even if the actual change is G12V or G12D.
Is this possible?
I am trying to understand how Variant Annotator functions. I took the vcf file from the output of UnifiedGenotyper and ran Variant Annotator with the same .bam and .bed files I used for running UnifiedGenotyper. I expected that all the calculations in the INFO field will remain the same, since I am using the same input files. However, I find that some fields have different values. Here is an example: UnifiedGenotyper output:
chr22 30094366 . G A 172.01 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.182;DP=244;DS;Dels=0.00;FS=0.000;HaplotypeScore=118.5897;MLEAC=1;MLEAF=0.500;MQ=43.29;MQ0=0;MQRankSum=-0.049;QD=0.70;ReadPosRankSum=1.428;SB=-6.201e+01 GT:AD:DP:GQ:PL 0/1:220,19:244:99:202,0,2693
VariantAnnotator output:
chr22 30094366 . G A 172.01 . ABHet=0.923;AC=1;AF=0.500;AN=2;BaseQRankSum=3.182;DP=993;DS;Dels=0.00;FS=0.000;HaplotypeScore=454.8386;MLEAC=1;MLEAF=0.500;MQ=43.29;MQ0=0;MQ0Fraction=0.0000;MQRankSum=-0.378;OND=0.034;QD=0.17;ReadPosRankSum=-4.859;SB=-6.201e+01 GT:AD:DP:GQ:PL 0/1:220,19:244:99:202,0,2693
I am running GATKLite 2.1. Notice the DP in the info field has a different value. HaplotypeScore, QD, MQRankSum, etc have different values as well. Am I doing anything wrong? Shouldn't these values be the same when I recalculate these fields using VariantAnnotator?
Hi,
I'm attempting to use Variant Annotator to annotate some VCFs produced by samtools so I can run VQSR on them. Unfortunately I've gottent stuck and I'm trying to figure out why Variant Annotator wouldn't be annotating INDELs with MappingQualityRankSumTest and ReadPosRankSumTest, it seems to annotate SNPs fine. There are both Homs and het's called on the sample. Could it be I need to left align the indels to get enough coverage? What would you suggest is the best way to debug this? Is there a way to make GATK behave more verbosely about why it's refusing an annotation?
Thanks Martin
I have a set of VCFs with identical positions in them:
VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP
VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP
VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP
VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP
If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator?
I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you.
As featured in this forum question.
Two main things account for these kinds of differences, both linked to default behaviors of the tools:
In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.