Bug UnifiedGenotyper for INDELs - last line from the indels vcf file is ignored
Posted in Ask the GATK team | Last updated on 2013-10-29 15:02:32

Comments (7)

I think there is a bug somewhere, I am running GATK - UnifiedGenotyper with the following parameters: java GenomeAnalysisTK.jar -T UnifiedGenotyper -glm INDEL -R ../path/to/hs37d5.fa -I /chr17-pooled.list --alleles /path/17:81000001-81195210.indels.vcf.gz -L 17:81000001-81195210 -U LENIENT_VCF_PROCESSING -baq CALCULATE_AS_NECESSARY -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES --standard_min_confidence_threshold_for_calling 4.0 --standard_min_confidence_threshold_for_emitting 4.0 -l INFO -A QualByDepth -A HaplotypeScore -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A InbreedingCoeff -A Coverage -o /path/17:81000001-81195210.aindels.vcf.gz.part.vcf.gz

I am getting the following error at the next step, when I try to apply the recalibration:

java -jar /nfs/users/nfs_m/mercury/src/GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar -T ApplyRecalibration -U LENIENT_VCF_PROCESSING --ts_filter_level 90.00 --mode INDEL -R /path/to/hs37d5.fa --input /path/to/somevcf.gz -recalFile /path/to/vqsr.sites.indels.vcf.gz -tranchesFile /path/to/vqsr.sites.indels.tranches -o /path/to/17.vqsr.vcf.gz.part.vcf.gz

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-2-g6bda569):
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR MESSAGE: Encountered input variant which isn't found in the input recal file. Please make sure VariantRecalibrator and ApplyRecalibration were run on the same set of input variants. First seen at: [VC input @ 17:81195149-81195151 Q51.40 of type=INDEL alleles=[GTT*, GTTT] attr={AC=1, AN=14, DP=27, DP4=[6, 12, 0, 3], HWE=1.000000e+00, INDEL=true, IS=[3, 1.000000], MDV=3, MQ=28, PV4=[0.53, 1, 1, 0.0027], QBD=2.447886e+00, VDB=1.224300e-05} GT=[[EGAN00001096474 ./. GQ 0 DP 0 PL 0,0,0 {DV=0, SP=0}],[EGAN00001096477 ./.

Though the actual problem seems to be a step before this, when running the unified genotyper (as stated above) - and I actually managed to reproduce it:

The very last lines in the (--allele) file /path/17:81000001-81195210.indels.vcf.gz are:

17 81195128 . AGGG AGGGG 58.6 . AN=12;AC=2;INDEL;IS=3,1;DP=57;VDB=0.188545;PV4=0.025,1,1,0.037;DP4=3,12,3,0;MQ=25;QBD=3.25359;MDV=3;HWE=0.0909091 17 81195149 . GTT GTTT 51.4 . AN=14;AC=1;INDEL;IS=3,1;DP=27;VDB=1.2243e-05;PV4=0.53,1,1,0.0027;DP4=6,12,0,3;MQ=28;QBD=2.44789;MDV=3;HWE=1

The very last lines of the output file (/path/17:81000001-81195210.aindels.vcf.gz.part.vcf.gz) are:

17 81195128 . AGGG AGGGG 53.96 . AC=2;AF=0.091;AN=22;BaseQRankSum=0.532;DP=96;FS=24.358;InbreedingCoeff=0.5200;MLEAC=1;MLEAF=0.045;MQ=18.64;MQ0=0;MQRankSum=2.599;QD=17.99;RPA=3,4;RU=G;ReadPosRankSum=-1.910;STR GT:AD:DP:GQ:PL ./. ./. ./. 0/0:3,0:3:9:0,9,102 ./. ./. ./. 0/0:3,0:3:9:0,9,105 ./. ./. ./. ./. ./. ./. 0/0:3,0:3:9:0,9,101 ./. ./. .............

Therefore the last location within this region on chromosome 17 is ignored, hence my error. This happened to me only on this chromosome, only in this region. I am running the UnifiedGenotyper for the whole genome in the same way, so I assume this might be a bug in GATK somewhere, or is there something that I am missing/doing wrong?

Thank you in advance!

Return to top Comment on this article in the forum