Tagged with #strelka
0 documentation articles | 0 announcements | 2 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (7)

I'm running into a problem with vcfs that have single ended break ends. (These are produced by an old version of Strelka .) Tribble doesn't recognize "." as valid in alternative alleles.

Single break ends are valid in the vcf standard and the files validate according to Vcftools.

Others have run into this problem as well: https://groups.google.com/forum/#!searchin/strelka-discuss/gatk/strelka-discuss/gJfsyjZNZXA/ExDXZiVWW_kJ

example error

##### ERROR stack trace
org.broad.tribble.TribbleException: The provided VCF file is malformed at approximately line number 1807: Unparsable vcf record with allele .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC, for input source: /cga/tcga-gsc/benchmark/Indels/strelkaPON/NA18606.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam-NA18608.mapped.ILLUMINA.bwa.CHB.exome.20111114.bam/final.indels.vcf
at org.broadinstitute.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:715)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:527)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:553)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:494)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:291)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:234)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:213)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:45)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:284)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:264)
at org.broad.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:225)
at org.broadinstitute.sting.tools.CatVariants.execute(CatVariants.java:239)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.tools.CatVariants.main(CatVariants.java:258)
##### ERROR ------------------------------------------------------------------------------------------

Example vcf line

19  36002413    .   C   .CCCAGGAGGACTCACTGCCGCTGTCACCTCTGCTGCCACCACTGTTGCCAC    .   PASS    IHP=1;NT=ref;QSI=82;QSI_NT=82;SGT=ref->hom;SOMATIC;SVTYPE=BND;TQSI=1;TQSI_NT=1  DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   49:49:42,44:0,0:7,6:43.72:0.85:0.00 11:11:0,0:6,6:5,5:14.61:0.48:0.0

A full vcf is available at: /humgen/gsa-scr1/pub/incoming/BreakendBug/breakend.vcf

Comments (10)

Strelka produces vcf files that GATK has issues with. The files pass vcftools validation, which according to the docs is the official validation, they do not pass ValidateVariants. VariantEval can't read them either. I'm unsure where the bug lives.

vcf file looks like this:

##fileformat=VCFv4.1
##fileDate=20130801
##source=strelka
##source_version=2.0.8
##startTime=Thu Aug  1 15:23:54 2013
##reference=file:///xchip/cga_home/louisb/reference/human_g1k_v37_decoy.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_007605,length=171823>
##contig=<ID=hs37d5,length=35477943>
##content=strelka somatic indel calls
##germlineIndelTheta=0.0001
##priorSomaticIndelRate=1e-06
##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal">
##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI">
##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT">
##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele">
##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele">
##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interupted homopolymer length intersecting with the indel">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1">
##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2">
##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2">
##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2">
##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2">
##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases">
##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases">
##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases">
##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence">
##FILTER=<ID=iHpol,Description="Indel overlaps an interupted homopolymer longer than 14x in the reference sequence">
##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3">
##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30">
##cmdline=/xchip/cga_home/louisb/Strelka/strelka_workflow_1.0.7/libexec/consolidateResults.pl --config=/xchip/cga/benchmark/testing/full-run/somatic-benchmark/spiked/Strelka_NDEFGHI_T12345678_0.8/config/run.config.ini
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
1   797126  .   GTAAT   G   .   PASS    IC=1;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=2;RU=TAAT;SGT=ref-        >het;SOMATIC;TQSI=1;TQSI_NT=1   DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   47:47:48,49:0,0:3,3:48.72:0.00:0.00 62:62:36,39:17,19:9,9:42.49:0.21:0.00``

The output I get from ValidateVariants is java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T ValidateVariants --variant strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15 INFO 17:19:45,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:19:45,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:19:45,295 HelpFormatter - Program Args: -T ValidateVariants --variant strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,295 HelpFormatter - Date/Time: 2013/08/28 17:19:45 INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,300 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF INFO 17:19:45,412 GenomeAnalysisEngine - Strictness is SILENT INFO 17:19:45,513 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:19:45,533 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf INFO 17:19:45,615 GenomeAnalysisEngine - Preparing for traversal INFO 17:19:45,627 GenomeAnalysisEngine - Done preparing for traversal INFO 17:19:45,627 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:19:45,627 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:19:46,216 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-1-g42d771f): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: File /Users/louisb/Workspace/strelkaVcfDebug/strelka1.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 1:797126 are not observed at all in the sample genotypes ##### ERROR ------------------------------------------------------------------------------------------

output from VariantEval is:

java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T VariantEval --eval strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,333 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,335 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15
INFO  17:15:44,335 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:15:44,335 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:15:44,339 HelpFormatter - Program Args: -T VariantEval --eval strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,339 HelpFormatter - Date/Time: 2013/08/28 17:15:44
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,349 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF
INFO  17:15:44,476 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:15:44,603 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:15:44,623 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf
INFO  17:15:44,710 GenomeAnalysisEngine - Preparing for traversal
INFO  17:15:44,722 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:15:44,722 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:15:44,723 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:15:44,831 VariantEval - Creating 3 combinatorial stratification states
INFO  17:15:45,382 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
    at org.broadinstitute.sting.gatk.walkers.varianteval.evaluators.CountVariants.update1(CountVariants.java:201)
    at org.broadinstitute.sting.gatk.walkers.varianteval.util.EvaluationContext.apply(EvaluationContext.java:88)
    at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:455)
    at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:124)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-1-g42d771f):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
##### ERROR ------------------------------------------------------------------------------------------