# Tagged with #variantannotator 2 documentation articles | 1 announcement | 45 forum discussions

Created 2015-08-18 19:35:06 | Updated 2016-04-13 12:02:58 | Tags: haplotypecaller variantannotator annotation mutect2

### The problem

You specified -A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, UnifiedGenotyper and VariantAnnotator), but that annotation did not show up in your output VCF.

Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.

### Why this happens & solutions

There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.

1. #### You requested an annotation that cannot be calculated by the tool

For example, you're running MuTect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.

2. #### You requested an annotation that can only be calculated if an optional input is provided

For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.

3. #### You requested an annotation that has requirements which are not met by some or all sites

For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.

4. #### You requested an annotation that is already applied by default by the tool you are running

For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with -A.

Created 2012-09-19 18:45:35 | Updated 2013-08-23 22:00:11 | Tags: unifiedgenotyper variantannotator intermediate

As featured in this forum question.

Two main things account for these kinds of differences, both linked to default behaviors of the tools:

• The tools downsample to different depths of coverage

• The tools apply different read filters

In both cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

## Haplotype Caller

• Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
• Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
• The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
• Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
• Fixed small bug where indel discovery was inconsistent based on the active-region size.
• Removed scary warning messages for "VectorPairHMM".
• Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
• When we subset PLs because alleles are removed during genotyping we now also subset the AD.
• Fixed bug where reference sample depth was dropped in the DP annotation.

## Variant Recalibrator

• The -mode argument is now required.
• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

## AnalyzeCovariates

• The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

## Variant Annotator

• SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

## Genotype GVCFs

• Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
• Fixed recurring error about missing PLs.
• No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
• --includeNonVariantSites should now be working.

## Select Variants

• The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

## Indel Realigner

• Now throws a user error when it encounters reads with I operators greater than the number of read bases.
• Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.

## CalculateGenotypePosteriors

• Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
• Now computes genotype posteriors using likelihoods from all members of the trio.
• Added annotations for calling potential de novo mutations.
• Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

## Cat Variants

• Can now process .list files with -V.
• Can now handle BCF and Block-Compressed VCF files.

## Validate Variants

• Now works with gVCF files.
• By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.

## FastaAlternateReferenceMaker

• Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.

## Miscellaneous

• Refactored maven directories and java packages replacing "sting" with "gatk".
• Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
• Added a new read transformer that refactors NDN cigar elements to one N element.
• Now a Tabix index is created for block-compressed output formats.
• Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
• Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
• We now check that output files are specified in a writeable location.
• We now allow blank lines in a (non-BAM) list file.
• Added legibility improvements to the Progress Meter.
• Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
• Made IntervalSharder respect the IntervalMergingRule specified on the command line.
• Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2016-05-09 17:28:33 | Updated | Tags: variantannotator variantfiltration bug

Hi,

I am planning to annotate VCF files with external frequency information such as 1KGP. So I used VariantAnnotator walker to do that and got a VCF file with OneKGP.AF in the INFO tag which is good. However, when I try to VariantFiltration this new VCF file, GATK actually filter/flag my variants using frequency information from "AF" instead of "OneKGP.AF" due to the dot in the name of "OneKGP.AF".

-- It is a great function for users to be able to annotate variant with external information and filter variants. I wonder if there is a way to get around of the problem I encounter here. Thanks

The commandline i used to annotate the original VCF file is

${PathToJava}/java -jar${PatToGATK}/GenomeAnalysisTK.jar \ -R $Ref \ -T VariantAnnotator \ -V CONTROLS_PLUS_CGC_PedTest4.VQSR.ANNOTATED.VARIANT_ONLY.PASS.vcf \ -o CGC_PedTest4.VQSR.PASS.AnnoFreq.vcf \ --resource:OneKGP ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf \ -E OneKGP.AF The commandline i used to Filter/Flag variants ## Filter VCF file based on 1KGP.AF information:${PathToJava}/java -jar ${PatToGATK}/GenomeAnalysisTK.jar \ -R$Ref \ -T VariantFiltration \ -o CGC_PedTest4.VQSR.PASS.AnnoFreqAll.Filter.vcf \ --variant CGC_PedTest4.VQSR.PASS.AnnoFreqAll.vcf \ --filterExpression "OneKGP.AF>0.1" \ --filterName "OneKGP.AF_Pct10"

As the issue coming from the dot in the name of "OneKPG.AF", I tried to replace " -E OneKGP.AF" with "-E OneKGP_AF" in the annotation step and apparently GATK doesn't allow me to do that as shown below.

##### ERROR ------------------------------------------------------------------------------------------

Created 2016-03-14 16:28:57 | Updated | Tags: variantannotator non-model-organisms-snpeff

Hi All,

I am working on a non-model organism and have finished the variant calling process using GATK. I have a VCF file containing variants and my next task is to see the effect of those mutations on the genome. I have been looking at several threads for the same and could not quite grasp what was being written. How should I proceed? @Geraldine_VdAuwera @Sheila

Created 2016-03-07 22:53:40 | Updated 2016-03-07 23:01:14 | Tags: homopolymerrun variantannotator

Hi,

I have one deletion that causes VariantAnnotator to throw an exception when trying to add the HomopolymerRun annotation. Here is the VCF from HaplotypeCaller that I am using:

##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1 sample2 sample3
chr18   44383010    .   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAGCC A   609.61  .   AC=2;AF=0.333;AN=6;BaseQRankSum=-1.949;DP=55;FS=3.699;MLEAC=2;MLEAF=0.333;MQ=58.29;MQ0=0;MQRankSum=-2.245;QD=14.87;ReadPosRankSum=-0.492;SOR=0.180  GT:AD:GQ:PL 0/1:3,12:89:493,0,89    0/0:4,0:11:0,11,175 0/1:4,4:99:157,0,155

And here is the Error:

/usr/local/biotools/java/jdk1.7.0_67/bin/java -Xmx4g -Xms2g -Djava.io.tmpdir=./ -jar /data5/bsi/bictools/alignment/gatk/3.5/GenomeAnalysisTK.jar -T VariantAnnotator -R /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa -V problem.vcf -o variantshrun.vcf -A HomopolymerRun -L region.bed
INFO  16:41:57,116 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:41:57,118 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO  16:41:57,118 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  16:41:57,122 HelpFormatter - Program Args: -T VariantAnnotator -R /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa -V problem.vcf -o variantshrun.vcf -A HomopolymerRun -L region.bed
INFO  16:41:57,128 HelpFormatter - Executing as m088341@franklin09.mayo.edu on Linux 2.6.32-573.18.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_67-b01.
INFO  16:41:57,128 HelpFormatter - Date/Time: 2016/03/07 16:41:57
INFO  16:41:57,129 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:41:57,129 HelpFormatter - --------------------------------------------------------------------------------
INFO  16:41:57,590 GenomeAnalysisEngine - Strictness is SILENT
INFO  16:41:57,691 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
WARN  16:41:57,735 RMDTrackBuilder - Index file problem.vcf.idx is out of date (index older than input file), deleting and updating the index file
INFO  16:41:57,903 RMDTrackBuilder - Writing Tribble index to disk for file problem.vcf.idx
INFO  16:41:57,922 IntervalUtils - Processing 5000 bp from intervals
INFO  16:41:57,982 GenomeAnalysisEngine - Preparing for traversal
INFO  16:41:57,982 GenomeAnalysisEngine - Done preparing for traversal
INFO  16:41:57,983 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  16:41:57,983 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  16:41:57,983 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  16:41:59,062 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.ArrayIndexOutOfBoundsException: 101
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR
##### ERROR MESSAGE: 101
##### ERROR ------------------------------------------------------------------------------------------

Has anyone encountered this error before? Or maybe this is a bug?

Thanks!

Created 2016-03-03 18:33:18 | Updated 2016-03-04 00:32:49 | Tags: variantannotator

Hi, I generated vcf through samtools and bcftools with parsing some attributes. Input vcf is below:

##fileformat=VCFv4.1
##samtoolsVersion=0.1.19-44428cd
##reference=file://./resources/allchr.hg19.fa
##contig=<ID=chr1,length=249250621,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=1b22b98cdeb4a9304cb5d48026a85128>
##contig=<ID=chr2,length=243199373,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=a0d9851da00400dec1098a9255ac712e>
##contig=<ID=chr3,length=198022430,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=641e4338fa8d52a5b781bd2a2c08d3c3>
##contig=<ID=chr5,length=180915260,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=0740173db9ffd264d728f32784845cd7>
##contig=<ID=chr6,length=171115067,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=1d3a93a248d92a729ee764823acbbc6b>
##contig=<ID=chr8,length=146364022,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=96f514a9929e410c6651697bded59aec>
##contig=<ID=chr9,length=141213431,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=3e273117f15e0a400f01055d9f393768>
##contig=<ID=chr10,length=135534747,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=988c28e000e84c26d552359af1ea2e1d>
##contig=<ID=chr11,length=135006516,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=98c59049a2df285c76ffb1c6db8f8b96>
##contig=<ID=chr13,length=115169878,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=283f8d7892baa81b510a015719ca7b0b>
##contig=<ID=chr14,length=107349540,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=98f3cae32b2a2e9524bc19813927542e>
##contig=<ID=chr15,length=102531392,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=e5645a794a8238215b2cd77acb95a078>
##contig=<ID=chr16,length=90354753,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=fc9b1a7b42b97a864f56b348b06095e6>
##contig=<ID=chr18,length=78077248,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=b15d4b2d29dde9d3e4f93d1d0f2cbc9c>
##contig=<ID=chr19,length=59128983,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=1aacd71f30db8e561810913e0b72636d>
##contig=<ID=chr20,length=63025520,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=0dec9660ec1efaaf33281c0d5ea2560f>
##contig=<ID=chr22,length=51304566,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=a718acaa6135fdca8357d5bfe94211dd>
##contig=<ID=chrX,length=155270560,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=7e0e2e580297b7764e31dbc80c2540dd>
##contig=<ID=chrY,length=59373566,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=1e86411d73e6f00a10590f976be01623>
##contig=<ID=chrM,length=16571,URL=file:/home/chihhsu/bin/eSNV-Detect_1.0/resources/./allchr.hg19.fa,md5=d2ed829b8a1628d16cbeee88e88e39eb>
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IS,Number=2,Type=Float,Description="Maximum number of reads supporting an indel and fraction of indel reads">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Reference Allele Counts">
##FORMAT=<ID=MA,Number=1,Type=Integer,Description="Multiple Alternative Alleles">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA07347
==

I ran VariantAnnotator of GATK 3.5 with the command and got errors below.
==
/usr/bin/java -Xmx10g -Xms5g -Djava.io.tmpdir=.//temp -jar /home/chihhsu/bin/eSNV-Detect_1.0/bin/gatk/GenomeAnalysisTK.jar -R /home/chihhsu/bin/eSNV-Detect_1.0/resources/allchr.hg19.fa -T VariantAnnotator -I output_test_GATK3/bwa/variant/NA07347.bam -V ./NA07347.vcf --dbsnp /home/chihhsu/bin/eSNV-Detect_1.0/resources/dbsnp.vcf.gz -L ./NA07347.vcf --out ./NA07347.vcf.temp -A QualByDepth -A ReadPosRankSumTest
INFO  11:56:24,489 HelpFormatter - --------------------------------------------------------------------------------
INFO  11:56:24,491 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO  11:56:24,492 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  11:56:24,496 HelpFormatter - Program Args: -R /home/chihhsu/bin/eSNV-Detect_1.0/resources/allchr.hg19.fa -T VariantAnnotator -I output_test_GATK3/bwa/variant/NA07347.bam -V ./NA07347.vcf --dbsnp /home/chihhsu/bin/eSNV-Detect_1.0/resources/dbsnp.vcf.gz -L ./NA07347.vcf --out ./NA07347.vcf.temp -A QualByDepth -A ReadPosRankSumTest
INFO  11:56:24,505 HelpFormatter - Executing as chihhsu@torta.bcmt.bcm.edu on Linux 3.10.0-327.10.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_71-b15.
INFO  11:56:24,506 HelpFormatter - Date/Time: 2016/03/03 11:56:24
INFO  11:56:24,506 HelpFormatter - --------------------------------------------------------------------------------
INFO  11:56:24,506 HelpFormatter - --------------------------------------------------------------------------------
INFO  11:56:24,582 GenomeAnalysisEngine - Strictness is SILENT
INFO  11:56:24,691 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO  11:56:24,699 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:56:24,723 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
INFO  11:56:25,007 IntervalUtils - Processing 1497 bp from intervals
WARN  11:56:25,010 IndexDictionaryUtils - Track dbsnp doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  11:56:25,079 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  11:56:25,113 GenomeAnalysisEngine - Done preparing for traversal
INFO  11:56:25,113 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  11:56:25,114 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  11:56:25,114 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  11:56:26,072 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:766)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:717)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:128) at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158) at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148) at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465) at htsjdk.variant.variantcontext.VariantContext.validateGenotypes(VariantContext.java:1357) at htsjdk.variant.variantcontext.VariantContext.validate(VariantContext.java:1299) at htsjdk.variant.variantcontext.VariantContext.<init>(VariantContext.java:410) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:496) at htsjdk.variant.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:490) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:213) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:203) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:357) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:114) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: For input string: "DP=6;RD=1;AD=5;RO=0.833;MA=0" ##### ERROR ------------------------------------------------------------------------------------------ I also ran ValidateVariants and there is no error found.  java -jar ./bin/gatk/GenomeAnalysisTK.jar \ -T ValidateVariants \ -R ./resources/allchr.hg19.fa \ -V ./NA07347.vcf \ --validationTypeToExclude ALL Successfully validated the input file. Checked 1497 records with no failures. Any suggestions? Thank you. Created 2016-01-14 15:00:51 | Updated | Tags: allelebalance variantannotator coverage alleledepth Some calls I don't see allele depth (AD), for some I don't see depth (DP), and for some I see neither. Are there scenarios where GATK cannot annotate these? Created 2015-11-25 09:07:33 | Updated | Tags: combinevariants fisherstrand qualbydepth variantannotator I am doing variant calling of multiple RNAseq datasets using GATK/3.4.46. For limitation of computational resources, I ran HaplotypeCaller on each dataset separately. Then I ran CombineVaraints to merge all output VCF files using this command java -Xmx10g -jar GenomeAnalysisTK.jar \ -T CombineVariants \ -R$gatk_ref \ --variant set1.vcf \ --variant set2.vcf \ --variant set3.vcf \ -o combine_output.vcf \ -genotypeMergeOptions UNIQUIFY

Then I tried to run VariantFiltration using thic command

java -Xmx2g -jar $GATK/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R$gatk_ref \ -V combine_output.vcf \ -window 35 -cluster 3 \ -filterName FS -filter "FS > 30.0" \ -filterName QD -filter "QD < 2.0" \ -o $output Several thousands variants though warning for absence of FS and QD. According to @Sheila advise in http://gatkforums.broadinstitute.org/discussion/2334/undefined-variable-variantfiltration, I ran VariantAnnotator to add these annotations using this command java -Xmx45g -jar GenomeAnalysisTK.jar \ -R$gatk_ref \ -T VariantAnnotator \ -I input1.bam \ -I input2.bam \ . . -I input57.bam \ -V combine_output.vcf \ -A Coverage \ -A FisherStrand \ -A QualByDepth \ -nt 7 \ -o combine_output_ann.vcf

Then I repeated the VariantFiltration but I have 2 problems: 1) about 2000 variants are still not annotated for FS. All of them are indels and many of them are not homozygous for the ALT allele). Also ~ 40 variants are still not annotated for QD. All of them have multiple ALT alleles 2) The combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value. According to my manual calculations, I think VariantAnnotator calculates the QD value by dividing this QUAL value by the AD of samples with a non hom-ref genotype call. This cause many variants to fail the QD filter.

Thank you

Created 2015-11-06 11:07:22 | Updated | Tags: selectvariants variantannotator variantfiltration dbsnp

Hi team,

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

1. Addition of in-house identifiers. .................................................... done
2. Exclude if alternate allele is "*" i.e. they are in a deletion. .................................................... I'm assuming this can be done with SelectVariants or FilterVariants.
3. Exclude if ref or alt allele is greater than 50bp .................................................... Perhaps with SelectVariants or FilterVariants --maxIndelSize 50
4. Exclude if ref and alt alleles do not have a common leading base. .................................................... Not sure ... removing larger indels won't exclude all of these.
5. Add VRT (variant type) to Info field .....................................................e.g VRT=1 (for an SNV), VRT=2 for an indel etc. SNPeff doesn't seem to work for this but I could be wrong.

Knowing how to effectively format vcfs between GATK output and NCBI input might be quite useful for many people, and save rather a lot of time.

It would be really useful if the three exclusion criteria could be done using GATK. Is this possible and using what commands ?

I feel as though need to use the GATK variantAnnotator command as well. I'm looking into all of this today, and will post if I get any solutions.

Sincerely,

William Gilks

Created 2015-10-30 16:22:16 | Updated | Tags: variantannotator bug

Hi, I am having a problem with VariantAnnotator.

I am running the command: java1.7 -jar /usr/local/packages/GATK3/GenomeAnalysisTK.jar \ -R /projects4/ruth/Burkholderia/cutadapt/cenocepacia/B_cenocepacia_J2315.fasta \ -T VariantAnnotator \ -I N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted.RG.bam \ -o output2.vcf \ -V N501.C8967_R1.fastq_to_B_cenocepacia_J2315.sorted_GATK.vcf \ -A AlleleBalance \ -A BaseCounts \ -A Coverage \ -A FisherStrand \ -A GenotypeSummaries \ -A LowMQ \ -A RMSMappingQuality \ -A AlleleBalanceBySample

Where the .vcf file was made using GATK HaplotypeCaller.

The error I get is:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException: 0 at org.broadinstitute.gatk.tools.walkers.annotator.AlleleBalanceBySample.annotateWithPileup(AlleleBalanceBySample.java:127) at org.broadinstitute.gatk.tools.walkers.annotator.AlleleBalanceBySample.annotate(AlleleBalanceBySample.java:113) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateGenotypes(VariantAnnotatorEngine.java:420) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:216) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:192) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:312) at org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotator.map(VariantAnnotator.java:85) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

##### ERROR ------------------------------------------------------------------------------------------

Interestingly the error does not occur when I run exactly the same command with a vcf file created using samtools (and the sam bam file).

I have installed the most up-to-date version of GATK (3.4-46) in case that was causing the error, but that does not seem to be the answer.

Any suggestions about what is causing this error would be greatly appreciated.

Thanks,

Ruth

Created 2015-10-04 03:36:09 | Updated | Tags: variantannotator

Hi I am going through the GATK Best Practices pipeline for the first time. I have non-human data and am at the variant discovery step in the pipeline. I don't have well-vetted 'known' variant sites but have pieced together a vcf with high-likelihood snps that are in common among several data sets in our lab over the past year. The vcf has only four basic annotations and it looks like the software that produced it doesn't provide for generating others so I am trying to add a few with VariantAnnotator. My command is java -jar GenomeAnalysisTK.jar \ -R Oncorhynchus_mykiss_chr.fa \ -T VariantAnnotator \ -I Om2013all-rg.bam \ -A StrandOddsRatio \ -A MappingQualityRankSumTest \ -A ReadPosRankSumTest \ -o Om2013bElp.vcf \ -V Om2013bEl.vcf When I run it I get an error:

##### ERROR ------------------------------------------------------------------------------------------

The original command was:

$java -Xmx$jmem -jar $gatk_dir/GenomeAnalysisTK.jar -T VariantAnnotator -l DEBUG \ -R$genome \
-A SnpEff \
--variant $resultsDir/$hq_vcf \
--snpEffFile $resultsDir/${hq_vcf/.vcf/.snpEff.vcf} \
-L $resultsDir/${hq_vcf/.vcf/.snpEff.vcf} \
-o $resultsDir/${hq_vcf/.vcf/.variantCalls.snpEff.va.vcf} \
-rf BadCigar

Where I added already the -rf BadCigar argument with no effect. I rewrote the indexes for the snpEff vcf files used as input, with igvtools - no effect. $jmem was set to '11G'. I can open all vcf files with an text editor - so it seems that there is no file corruption. Is there a different way to combine reads other than with bedtools? Your help is very appreciated, Philipp Created 2014-12-11 15:57:03 | Updated | Tags: haplotypecaller variantannotator Hello I am using GATK 3.3-0 HaplotypeCaller for variant calling. When I run HaplotypeCaller with the command java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R all_chr_reordered.fa -I 30_S30.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 50 -o 30_S30_control1.vcf -L brca.intervals I get all the variants I want, however, I also want to get the number of forward and reverse reads that support REF and ALT alleles. Therefore I use StrandBiasBySample annotation when running HaplotypeCaller with the command: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R all_chr_reordered.fa -I 30_S30.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 50 -o 30_S30_control2.vcf -L brca.intervals -A StrandBiasBySample The SB field is added, but a variant that was in 30_S30_control1.vcf is absent in 30_S30_control2.vcf. All the remaining variants are there. The only difference between two variant calls was adding -A StrandBiasBySample. What I'm wondering about is that why this one variant is absent. the missing variant: 17 41276152 . CAT C 615.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.147;ClippingRankSum=0.564;DP=639;FS=15.426;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.054;QD=0.96;ReadPosRankSum=0.698;SOR=2.181 GT:AD:DP:GQ:PL 0/1:565,70:635:99:653,0,18310 So I decided to run HaplotypeCaller without -A StrandBiasBySample and later add the annotations with VariantAnnotator. Here is the command: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R all_chr_reordered.fa -I 30_S30.bam --variant 30_S30_control1.vcf -L 30_S30_control1.vcf -o 30_S30_control1_SBBS.vcf -A StrandBiasBySample However, the output vcf file 30_S30_control1_SBBS.vcf was not different from the input variant file 30_S30_control1.vcf except for the header, SB field wasn't added. Why was the SB field not added? Is there any other way to get the number of forward and reverse reads? Please find 30_S30_control1.vcf, 30_S30_control2.vcf and 30_S30_control1_SBBS.vcf in attachment Thanks Created 2014-11-14 03:30:08 | Updated | Tags: variantannotator genotypegvcfs I have constructed a VCF using GenotypeGVCFs. Due to downsampling and filtering imposed by HaplotypeCaller/GenotypeGVCFs, four fields of interest within the VCF do not reflect the data that went into building the file. I would like to update a these fields to reflect all read data used in the pipeline, but I have been able to do so for just 3 of the 4. I have been able to update INFO MQ0, INFO DP, and sample-level (FORMAT) AD, but I have not been able to update sample-level (FORMAT) DP. I have run VariantAnnotator as follows. inVCFfn=../vcf/raw.vcf outVCFfn=../vcf/raw.annotated.vcf gatk=/srv/gs1/software/gatk/gatk-3.3.0/GenomeAnalysisTK.jar ref=/srv/gs1/projects/scg/Resources/GATK/b37/human_g1k_v37_decoy.fasta dbsnp=/srv/gs1/projects/scg/Resources/GATK/b37/dbsnp_137.b37.vcf java -Xmx10g \ -jar$gatk \
-T VariantAnnotator \
-R $ref \ -L$inVCFfn \
--variant $inVCFfn \ --dbsnp$dbsnp \
--out $outVCFfn \ --annotation Coverage \ --annotation DepthPerAlleleBySample \ --annotation MappingQualityZero \ [-I fileName.1.bam … -I fileName.810.bam] The documentation for --annotation Coverage indicates that it should update both the INFO DP and the sample-level (FORMAT) DP, but the sample-level update seems not to be occurring. For example: $ awk '($2==7157834) { print$9, "|", $53 }' raw.vcf GT:AD:DP:GQ:PL | 0:4,0:4:99:0,135$ awk '($2==7157834) { print$9, "|", $53 }' raw.annotated.vcf GT:AD:DP:GQ:PL | 0:131,0:4:99:0,135 (Aside: I'm working with a haploid chromosome, hence the single-value GT and two-value AD and PL fields). Although AD was updated from "4,0" to "131,0", the sample-level "DP" remained at "4". Of course I could approximate DP as the sum of AD values, but since AD is pre-filter and sample-level DP is post-filter, this wouldn't quite be correct. I should note that when I run HaplotypeCaller/GenotypeGVCFs on a subset of the data, the sample-level DP values are very close to the sum of AD's, so it's not a matter of most reads being filtered out, but rather that they have been downsampled. Am I missing something? Thanks! David P.S. The INFO field DP and MQ0 are indeed updated as expected: $ awk '($2==7157834) { print$8 }' raw.vcf
AC=1;AF=6.993e-03;AN=143;DP=609;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=0;NCC=0;QD=30.16;SOR=1.259

$awk '($2==7157834) { print $8 }' raw.annotated.vcf AC=1;AF=6.993e-03;AN=143;DP=10750;FS=0.000;GQ_MEAN=124.92;GQ_STDDEV=113.68;MLEAC=1;MLEAF=6.993e-03;MQ=60.00;MQ0=2;NCC=0;QD=30.16;SOR=1.259 Created 2014-10-01 19:58:24 | Updated | Tags: snpeff variantannotator Hello, does anyone know where to download snpEff version 2.0.5? I could only find recent version in my search, which are not compatible with gatk. Thanks, Ricky Created 2014-09-02 02:32:47 | Updated | Tags: snpeff variantannotator Hi, I first ran snpEff using the command: java -jar snpEff.jar eff -v -i vcf -o gatk phased.vcf > snpEff_output.vcf which gave me the snpEff_output.vcf file that looks something like this: Chr1 7050 . T A 22.76 LowQual AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=11.38 G T:AD:DP:GQ:PL 1/1:0,2:2:6:50,6,0 Chr1 7328 . A G 21.77 LowQual AC=2;AF=1.00;AN=2;DP=3;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=21.77;EFF= synonymous_variant(LOW|SILENT|tcA/tcG|p.Ser188Ser/c.564A>G|702|LOC_Os01g01010|protein_coding|CODING|LOC_Os01g01010.1|5|1),synonymous_va riant(LOW|SILENT|tcA/tcG|p.Ser188Ser/c.564A>G|616|LOC_Os01g01010|protein_coding|CODING|LOC_Os01g01010.2|5|1) GT:AD:DP:GQ:PL 1/1:0,3 :3:6:49,6,0 Chr1 9055 . T A 23.76 LowQual AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=23.76 G T:AD:DP:GQ:PL 1/1:0,2:2:6:51,6,0 Chr1 9233 . T C 45.28 LowQual AC=2;AF=1.00;AN=2;DP=3;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=22.64;EFF= synonymous_variant(LOW|SILENT|aaT/aaC|p.Asn539Asn/c.1617T>C|616|LOC_Os01g01010|protein_coding|CODING|LOC_Os01g01010.2|9|1),synonymous_v ariant(LOW|SILENT|aaT/aaC|p.Asn539Asn/c.1617T>C|702|LOC_Os01g01010|protein_coding|CODING|LOC_Os01g01010.1|9|1) GT:AD:DP:GQ:PL 1/1:0,3 :3:9:73,9,0 etc. Then I ran VariantAnnotator using the command: java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R genome.fasta -A SnpEff --variant phased.vcf --snpEffFile snpEff_output.vcf -o annotated.vcf but it gave me a warning for every effect added by snpEff that look like this (the annotated.vcf file was non-empty though): WARN 20:00:30,645 SnpEff - Skipping malformed SnpEff effect field at ChrSy:525610. Error was: "synonymous_variant is not a recognized effect type". Field was: "synonymous_variant(LOW|SILENT|caG/caA|p.Gln171Gln/c.513G>A|ChrSy.fgenesh.gene.77|protein_coding|CODING|ChrSy.fgenesh.mRNA.77|2|WARNING_TRANSCRIPT_INCOMPLETE)" WARN 20:00:30,649 SnpEff - Skipping malformed SnpEff effect field at ChrSy:537868. Error was: "missense_variant is not a recognized effect type". Field was: "missense_variant(LOW|MISSENSE|Gat/Aat|p.Asp298Asn/c.892G>A|ChrSy.fgenesh.gene.79|protein_coding|CODING|ChrSy.fgenesh.mRNA.79|5|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS)" WARN 20:00:30,650 SnpEff - Skipping malformed SnpEff effect field at ChrSy:538754. Error was: "stop_gained is not a recognized effect type". Field was: "stop_gained(LOW|NONSENSE|Cag/Tag|p.Gln137*/c.409C>T|ChrSy.fgenesh.gene.79|protein_coding|CODING|ChrSy.fgenesh.mRNA.79|3|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS)" WARN 20:00:30,650 SnpEff - Skipping malformed SnpEff effect field at ChrSy:538769. Error was: "missense_variant is not a recognized effect type". Field was: "missense_variant(LOW|MISSENSE|Gac/Aac|p.Asp132Asn/c.394G>A|ChrSy.fgenesh.gene.79|protein_coding|CODING|ChrSy.fgenesh.mRNA.79|3|WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS)" What am In doing wrong? Clearly, snpEff outputs the effects in terms/words that GATK does not accept. Please help. Best, nb Created 2014-06-10 20:57:36 | Updated | Tags: variantannotator best-practices vcf developer performance walkers java gatk gatk-walkers Hi, I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct? Created 2014-04-28 15:11:17 | Updated | Tags: allelebalance allelebalancebysample haplotypecaller variantannotator Version 3.1.1. Human normal samples. I couldnt find AlleleBalance and AlleleBalanceBySample tags in my vcf outputs. Tags are not found even for single variant I tried HaplotypeCaller with -all or directly with -A AlleleBalance or -A AlleleBalanceBySample. Also I tried Variantannotator with -all or -A AlleleBalance or -A AlleleBalanceBySample. Any help will be apreciated Created 2014-03-21 20:00:23 | Updated 2014-03-21 20:22:23 | Tags: variantannotator combinegvcfs haplotypcaller GATK Team, First of all, I'm very excited about v3.x GATK, you guys and gals on the GATK Team are doing awesome work! I have a question: How do I get CombineGVCFs to forward the sample-level annotations calculated by HC with the --emitRefConfidence GVCF option enabled? I tried passing the same annotation flags to CombineGVCFs that I passed to HC, but none of the annotations beyond the standard set were incorporated. Annotations such as StrandBiasBySample require access to the underlying reads and forwarding these annotations to the CombineGVCFs output would be less prohibitive than running VariantAnnotator on my 20 WGS datasets... Created 2014-02-10 19:57:41 | Updated 2014-02-10 20:24:46 | Tags: variantannotator variantstovcf variantfiltration vcf variant-calling minor-allele-frequency Hi, I have annotated my vcf file of 20 samples from Unified genotyper using the following steps. Unified genotyper->Variantrecalibration->Applyrecalibration->VariantAnnotator My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples? Created 2013-09-13 04:59:08 | Updated | Tags: variantannotator Hi, I have a tiny, 5000-line VCF file that I want to add dbSNP annotations to. I'm surprised to see VariantAnnotator iterating along the millions of records in the dbSNP file, rather than the 5000 variants in the input VCF file. This will take 42mins on dbsnp 137... Am I misunderstanding how this tool works, or just using it wrongly? thanks, Mark java -Xmx2g -jar /share/ClusterShare/software/contrib/gi/gatk/2.5/dist/GenomeAnalysisTK.jar -T VariantAnnotator -R /share/ClusterShare/biodata/contrib/gi/gatk-resource-bundle/2.5/hg19/ucsc.hg19.fasta --variant PG0000864-BLD_PGx_cleaned.vcf --dbsnp /share/ClusterShare/biodata/contrib/gi/gatk-resource-bundle/2.5/hg19/dbsnp_137.hg19.vcf --out PG0000864-BLD_PGx,GATK.vcf --validation_strictness SILENT Created 2013-09-09 17:01:27 | Updated | Tags: variantannotator I am getting a strange error when running the VariantAnnonator - does anyone know what this is about? I am running it as java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R ucsc.hg19.fasta --variant result.vcf -comp:COSMIC CosmicMutants.vcf -resource CosmicMutants.vcf -E resource.ID -alwaysAppendDbsnpId -o annotated.vcf and the error is WARN 17:22:19,963 AbstractVCFCodec - Allele detected with length 1133370 exceeding max size 1048576 at approximately line 78827, likely resulting in degraded VCF processing performance WARN 17:22:20,037 AbstractVCFCodec - Allele detected with length 1133370 exceeding max size 1048576 at approximately line 78828, likely resulting in degraded VCF processing performance WARN 17:22:20,111 AbstractVCFCodec - Allele detected with length 1133370 exceeding max size 1048576 at approximately line 78827, likely resulting in degraded VCF processing performance .... My vcf file is not that long...but I have looked in CosmicMutants.vcf, and it seems normal to me. Created 2013-08-26 08:07:15 | Updated | Tags: snpeff variantannotator Hi, We use GATK with snpEff to annotate variants of our WES experiments. We use VariantAnnotator to keep only the effect with highest impact. We've realized that for some experiments it could be useful for us to customize priorities assignment (i.e. if a variant is categorized both as DOWNSTREAM for a gene and INTRON for the adjacent gene, the effect picked by default is DOWNSTREAM while we'd like to keep INTRON, etc.) Is there a way to modify VariantAnnotator default behaviour? Created 2013-07-23 19:44:06 | Updated | Tags: variantannotator vcf genotype annotation Hi, Could you tell me how to encourage GATK to annotate my genotype columns (i.e. add annotations to the FORMAT and PANC_R columns in the following file): #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PANC_R chrX 259221 . GA G 136.74 . AC=2;AF=1.00;AN=2;DP=15;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=8.82;MQ0=1;QD=3.04 GT:AD:GQ:PL 1/1:0,2:6:164,6,0 The file was generated with HaplotypeCaller. I used a command line similar to this one to no effect: java -jar$GATKROOT/GenomeAnalysisTK.jarT VariantAnnotator -R hg19_random.fa -I chr7_recalibrated.bam -V chr7.vcf --dbsnpdbSNP135_chr.vcf -A Coverage -A QualByDepth -A FisherStrand -A MappingQualityRankSumTest -A ReadPosRankSumTest -o chr7_annotated-again.vcf

Does anyone have any suggestions? Thanks in advance!

Created 2013-07-03 23:33:40 | Updated | Tags: variantannotator

It used to be possible to annotate a VCF files using multiple VCF track files. Basically the --comp command could be used multiple times. Now only the last track gets annotated. That means, if I use "-T VariantAnnotator --comp:TR1 tr1.vcf --comp:TR2 tr2.vcf, only the TR2 flag is used for annotation, and the TR1 flag is not. This is inconsistent and must be a recent regression.

It works fine with GenomeAnalysisTK-2.5-2 but it doesn't work anymore with GenomeAnalysisTK-2.6-2.

Created 2013-05-29 08:38:09 | Updated | Tags: variantannotator

Hi,

I am writing a pipeline to analyse exome sequencing data from families and would like some advice on using the VariantAnnotator walker. It seems I need to use VariantAnnotator prior to VariantRecalibrator as the latter relies on annotations generated by VariantAnnotator. Is this correct? Also, I would like to annotate my VCF using SNPeff and SNPsift. Can you suggest the best order in which to perform these steps. For example, my plan at the moment is to run steps in the following order: HaplotypeCaller, VariantAnnotator, VariantRecalibrator, ApplyRecalibration, PhaseByTransmission, ReadBackedPhasing, SNPeff, SNPsift.

Any help would be much appreciated,

Kath

Created 2013-05-28 09:31:57 | Updated | Tags: variantannotator vcf multi-sample

Hi,

I have a vcf containing multiple samples. I would like to put the bam files also as input for the Variant Annotator but how does the variant annotator know which bam is for wich column in the vcf? Does the order of the args of the bam files need to correspond to the order of the samples columns in the vcf?

Thanks,

Robin

Created 2013-05-06 15:19:46 | Updated | Tags: vqsr variantannotator

Hi,

I just run HyplotypeCaller on a dataset. For the same dataset, I have run through Unified genotyper and then directly subjected the raw vcf from UG to VQSR step without the help of VariantAnnotator before and get through VQSR without any problem. However, when I try to subject the raw callset derived from HyplotypeCaller directly to VQSR step, the VQSR module complained about it and error message is below:

...

#### ERROR MESSAGE: Bad input: Values for HaplotypeScore annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See

So after HyplotypeCaller, the derived vcf file needs to run though VariantAnnotator? Since Unified genotyper derived callset does not need the help of VariantAnnotator (all annotations needed for VQSR are included after UG), it seems not the case for HyplotypeCaller? I can run through VariantAnnotator for HyplotypeCaller derived vcf file, just want to make sure if my understanding is correct?

Thanks and best

Mike

Created 2013-04-30 12:45:05 | Updated 2013-04-30 12:45:25 | Tags: variantrecalibrator combinevariants variantannotator annotations

I have a set of VCFs with identical positions in them:

VCF1: 1 10097 . T . 26 . AN=196;DP=1622;MQ=20.06;MQ0=456 GT:DP

VCF2: 1 10097 . T . 21.34 . AN=198;DP=2338;MQ=19.53;MQ0=633 GT:DP

VCF3: 1 10097 . T . 11.70 . AN=240;DP=3957;MQ=19.74;MQ0=1085 GT:DP

VCF4: 1 10097 . T . 15.56 . AN=134;DP=1348;MQ=18.22;MQ0=442 GT:DP

If I use all of them as input for VariantRecalibrator, which annotations will VariantRecalibrator use? Should I instead merge the VCFs with CombineVariants and run VariantAnnotator, before I run VariantRecalibrator?

I'm not sure if the forum is for asking technical questions only or you are allowed to ask for best practices as well. Feel free to delete my question, if it doesn't belong here. Thank you.

Created 2013-04-25 20:34:16 | Updated | Tags: variantannotator vcf alleledepth

Hello,

I am trying to filter some of my high-coverage samples based on a minimum depth and have found that the value stored in the DP INFO field and the AD genotype tag changes depending on whether or not I have run VariantAnnotator. The call I have used for VariantAnnotator is:

java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R ucsc.hg19.fasta -I example.bam --variant example.raw.vcf --out example.annotated.vcf -G StandardAnnotation -L example.raw.vcf -rf BadCigar -dcov 15000

Here are the differences for some test cases with HaplotypeCaller:

No MarkDuplicates, did IndelRealigner & BQSR, nightly build 12/04/2013

MarkDuplicates, IndelRealigner and BQSR, nightly build 12/04/2013

Raw BAM, nightly build 12/04/2013

Raw BAM, version 2.4-9

I suspect what is happening here is that VariantAnnotator is taking the depth information from the provided BAM and replacing the depth information reported by the variant caller. Anyway, just wondering- which value is a better reflection of the depth used to make a given variant call? (i.e. which could I use in hard filtering?)

Created 2013-01-25 00:41:19 | Updated | Tags: variantannotator

I was trying to annotate a vcf file already run on snpEff on VariantAnnotator. Used snpEff v2.05 and GRCh37.64 to annotate the vcf as mentioned in the document. Was using this document for it http://gatkforums.broadinstitute.org/discussion/50/adding-genomic-annotations-using-snpeff-and-variantannotator

When trying to run it through VariantAnnotator, it flags the error "Invalid command line: Failed to load reference dictionary " along with few warnings Warnings : GenomeAnalysisTK-2.3-9-ge5ebf34/human_g1k_v37.dict: No locks available.

Can somebody help me on this

Thanks

Created 2012-12-16 22:27:37 | Updated 2012-12-16 22:28:40 | Tags: haplotypecaller validatevariants variantannotator fastareference vcf dbsnp hg19

I ran the HaplotypeCaller, VariantAnnotator, and Variant Validatoor on chr3 locations from a human tumor sample.

The HaplotypeCaller command line is:

gatk="/usr/local/gatk/GenomeAnalysisTK-2.2-8-gec077cd/GenomeAnalysisTK.jar"
#Fasta from the gz in the resource bundle
indx="/home/ref/ucsc.hg19.fasta"
dbsnp="/fdb/GATK_resource_bundle/hg19-1.5/dbsnp_135.hg19.vcf"

java -Xms1g -Xmx2g -jar $gatk -R${indx} -T HaplotypeCaller \
-I chrom_bams/286T.chr3.bam \
-o hapc_vcfs/286T.chr3.raw.vcf


The VariantAnnotator command line is:

java -Xms1g -Xmx2g -jar $gatk -R${indx} -T VariantAnnotator \
--dbsnp $dbsnp --alwaysAppendDbsnpId \ -A BaseQualityRankSumTest -A DepthOfCoverage \ -A FisherStrand -A HaplotypeScore -A InbreedingCoeff \ -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth \ -A RMSMappingQuality -A ReadPosRankSumTest -A SpanningDeletions \ -A TandemRepeatAnnotator \ --variant:vcf hapc_vcfs/286T.chr3.raw.vcf \ --out varanno_vcfs/286T.chr3.va.vcf  This all works nicely, but I go back and use ValidateVariants just to be sure: java -Xms1g -Xmx2g -jar$gatk -R ${indx} -T ValidateVariants \ --dbsnp${dbsnp} \
--variant:vcf varanno_vcfs/286T.chr3.va.vcf \
1> report/ValidateVariants/286T.chr3.va.valid.out \
2> report/ValidateVariants/286T.chr3.va.valid.err &


An issue arises with a rsID that is flagged as not being present in dbSNP.

...fails strict validation: the rsID rs67850374 for the record at position chr3:123022685 is not in dbSNP


I realize this is an error message that generally would not generally qualify as an issue to post to these forums, however it is an error that seems to be generated by the Haplotype caller, illuminated by VariantAnnotator, and caught by the ValidateVariants.

The first 7 fields of the offending line in the 286T.chr3.va.vcf can be found using: cat 286T.chr3.va.vcf | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AAAGAGAAGAGAAGAG        A       1865.98 .


There is a corresponding entry in the dbsnp_135.hg19.vcf file: cat \$dbsnp | grep rs67850374

chr3    123022685       rs67850374;rs72184829   AA      A,AAAGAGAAGAG,AAAGAGAAGAGAAGAGAAGAG     .  PASS


My initial guess is that this is caused by a disagreement in the reference and variant fields between the two annotations. From what I can gather the call to the variantcontext function validateRSIDs() has a call to validateAlternateAlleles(). I assume this is what throws the error that is then caught and reported as "...fails strict validation..."

The UCSC genome browser for hg19 does show the specified position to be AA. It seems as thought the HaplotypeCaller simply used a different reference than dbsnp in this case.

The reference file supplied to HaplotypeCaller was the same as to VariantAnnotator and ValidateVariants. I did not supply the dbsnp argument to the HaplotypeCaller as I planned on doing all annotations after the initial variant calling, and the documentation states that the information is not utilized in the calculations. It seems as though this is a difference in between the reference assembly for dbSNP and the the reference supplied by the resource bundle.

My questions are:

1. Is this really a problem that arises from slightly different reference assemblies?
2. Is the hg19-1.5 reference fasta different from any other hg19 reference fasta?
3. Is there at tool that I have missed that would have prevented this error and allowed the pipeline to continue without error?"
4. Will this strict validation failure cause problems for the VariantRecalibrator?

As it stands, I am simply going to discard the offending lines manually. There are less than twenty in the entire exome sequencing of this particular tumor-normal sequencing. However, it seems like this issue will likely arise again. I will check the dbSNP VCF for places where the reference differs from the sequence in hg19. At least that should give me an estimate of the number of times this will arise and the locations to exclude from the variant calls.

-- Colin

Created 2012-12-10 23:46:19 | Updated 2012-12-11 02:31:16 | Tags: variantannotator resource

I would like to have certain annotations applied only if the variants in the resource and the target have the same genomic change. Ex: In Cosmic the following variants are present.

12      25398284        .       C       G       .       PASS    Genename=KRAS;MutationAA=p.G12A;MutationCDS=c.35G>C
12      25398284        .       C       A       .       PASS    Genename=KRAS;MutationAA=p.G12V;MutationCDS=c.35G>T
12      25398284        .       C       T       .       PASS    Genename=KRAS;MutationAA=p.G12D;MutationCDS=c.35G>A

However, in my VCF file I get the G12A annotation for all variants at this site even if the actual change is G12V or G12D.

Is this possible?

Created 2012-11-19 02:07:53 | Updated | Tags: mappingqualityranksumtest readposranksumtest variantannotator

Hi,

I'm attempting to use Variant Annotator to annotate some VCFs produced by samtools so I can run VQSR on them. Unfortunately I've gottent stuck and I'm trying to figure out why Variant Annotator wouldn't be annotating INDELs with MappingQualityRankSumTest and ReadPosRankSumTest, it seems to annotate SNPs fine. There are both Homs and het's called on the sample. Could it be I need to left align the indels to get enough coverage? What would you suggest is the best way to debug this? Is there a way to make GATK behave more verbosely about why it's refusing an annotation?

Thanks Martin

Created 2012-11-05 08:12:48 | Updated | Tags: unifiedgenotyper qualbydepth variantannotator error

I had annotated raw indel file (given by UnifiedGenotyper), 1000G_omni2.5.b37.sites.vcf and hapmap_3.3.b37.sites.vcf with all possible annotations including QD (QualByDepth) using VariantAnnotator. However, i got an error when i tried to run VariantRecalibrator. It was complaing that QD has not been found in training variant. Is QD important annotation for indel filtering. Can it be ignored ?

P.S. - i did not use sample bam file while annotating training data set.

.
.
.
INFO  15:11:55,999 RMDTrackBuilder - Loading Tribble index from disk for file NCBI_dbsnp_for_GATK.vcf
INFO  15:12:21,650 TraversalEngine -  chr1:128346793        1.98e+07   30.0 s        1.5 s      4.1%        12.1 m    11.6 m
INFO  15:12:51,650 TraversalEngine -  chr9:130658800        5.26e+07   60.0 s        1.1 s     53.9%       111.2 s    51.2 s
INFO  15:13:13,618 VariantDataManager - QD:      mean = NaN      standard deviation = NaN
INFO  15:13:16,417 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.1-13-g1706365):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR
##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations. See http://www.broadinstitute.org/gsa/wiki/index.php/VariantAnnotator
##### ERROR ------------------------------------------------------------------------------------------

Created 2012-10-23 10:00:08 | Updated 2012-10-23 10:00:08 | Tags: haplotypecaller variantannotator

Hi there, I'm running into an issue with my Queue pipeline, with variants called with HaplotypeCaller. Once I get the raw VCF file, I use VariantAnnotator before VQSR: however, no HaplotypeScore annotation is being produced, resulting in a failure of the VariantRecalibrator step where 'HaplotypeScore' was indicated as an annotation.

I tried to correct the issue by indicating to VariantAnnotator to use all annotations

  class AnnotationArguments (t: Target) extends VariantAnnotator with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
// Set the memory limit to 7 gigabytes on each command.
this.memoryLimit = 7
this.input_file :+= qscript.bamFile
this.useAllAnnotations
this.D = qscript.dbSNP_b37
}

But I still can't get any output in the annotated VCF of that parameter. Here an example of a variant

 AC=5;AF=0.078;AN=64;ActiveRegionSize=179;ClippingRankSum=-0.568;DB;DP=2025;EVENTLENGTH=0;FS=4.139;InbreedingCoeff=-0.0847;MLEAC=5;MLEAF=0.078;MQ=69.98;MQRankSum=-1.428;NVH=1;NumHapAssembly=15;NumHapEval=13;QD=17.20;QDE=17.20;ReadPosRankSum=-1.762;TYPE=SNP;extType=SNP

Any suggestions on what I might be doing wrong?

thanks very much for your help, Francesco

Created 2012-10-12 10:48:39 | Updated 2012-10-18 00:57:09 | Tags: variantannotator variants annotation

Hi,

Could you tell me when we can use new version of SnpEff with GATK?

I have some bugs :

##### ERROR MESSAGE: Could not create module ActiveRegionBasedAnnotation because BUG: Cannot find expected constructor for class caused by exception org.broadinstitute.sting.gatk.walkers

.annotator.interfaces.ActiveRegionBasedAnnotation.()

##### ERROR ------------------------------------------------------------------------------------------

I don't know if I forget some other options linked these annotations. These options are important for me. So I deleted them but if somebody want to use them ...

Regards,

Tiphaine

Created 2012-09-19 15:45:44 | Updated 2013-01-07 20:39:33 | Tags: unifiedgenotyper variantannotator downsampling

I am trying to understand how Variant Annotator functions. I took the vcf file from the output of UnifiedGenotyper and ran Variant Annotator with the same .bam and .bed files I used for running UnifiedGenotyper. I expected that all the calculations in the INFO field will remain the same, since I am using the same input files. However, I find that some fields have different values. Here is an example: UnifiedGenotyper output:

chr22   30094366        .       G       A       172.01  .       AC=1;AF=0.500;AN=2;BaseQRankSum=3.182;DP=244;DS;Dels=0.00;FS=0.000;HaplotypeScore=118.5897;MLEAC=1;MLEAF=0.500;MQ=43.29;MQ0=0;MQRankSum=-0.049;QD=0.70;ReadPosRankSum=1.428;SB=-6.201e+01        GT:AD:DP:GQ:PL  0/1:220,19:244:99:202,0,2693 

VariantAnnotator output:

chr22   30094366        .       G       A       172.01  .       ABHet=0.923;AC=1;AF=0.500;AN=2;BaseQRankSum=3.182;DP=993;DS;Dels=0.00;FS=0.000;HaplotypeScore=454.8386;MLEAC=1;MLEAF=0.500;MQ=43.29;MQ0=0;MQ0Fraction=0.0000;MQRankSum=-0.378;OND=0.034;QD=0.17;ReadPosRankSum=-4.859;SB=-6.201e+01      GT:AD:DP:GQ:PL  0/1:220,19:244:99:202,0,2693

I am running GATKLite 2.1. Notice the DP in the info field has a different value. HaplotypeScore, QD, MQRankSum, etc have different values as well. Am I doing anything wrong? Shouldn't these values be the same when I recalculate these fields using VariantAnnotator?