Tagged with #vcf format
0 documentation articles | 0 announcements | 8 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.

Created 2014-01-23 16:32:26 | Updated | Tags: merge gatk
Comments (2)

I have in a database 11 vcf and bam files for individuals we've sequenced. I have been trying to merge the 11 individual vcf files into one combined vcf file using CombineVariants in GATK. While it does combine the vcf files, it does something odd that I'm sure has been solved by other users and I am looking for input on.

A singleton SNP in individual 1 will be given "./." in all other 10 individuals instead of "0/0". Is there a way to fix this--the genotypes are not missing, they are reference.That said, some of them will be missing and are rightly called "./.", but I don't know how to incorporate this information into a merged VCF file.

Your help is most appreciated and apologies if this has been asked before--I couldn't find this exact topic.


Created 2013-12-03 11:22:27 | Updated | Tags: variantstovcf dbsnp138
Comments (1)

I have used the following command $ java -jar GenomeAnalysisTK.jar -T VariantsToVCF -V:OLDDBSNP snp138.txt -R genome.fa -o snp138.vcf to convert human snp138.txt file downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ to .vcf format. As output the snp138.vcf file has been generated which has the size as 0 and the other 2 files have been generated as snp138.txt.idx and snp138.vcf.idx. How can i get the snp138.vcf?


Created 2013-09-11 16:54:10 | Updated | Tags: vcf error
Comments (1)

I am trying to liftover from NIST b37 to hg19. I have all the files I need and I can kick off the liftover just fine, but I keep running into problems because the NIST vcf has tags in the variant line INFO field that are not in the header. ##### ERROR MESSAGE: Key PLHSWG found in VariantContext field INFO at chr1:52238 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default

I identified about 90 tags that are not properly documented in the header. Is there a way to ignore all of these INFO header lapses?


Created 2013-08-28 21:25:28 | Updated 2013-08-28 21:41:49 | Tags: validatevariants varianteval vcf strelka
Comments (10)

Strelka produces vcf files that GATK has issues with. The files pass vcftools validation, which according to the docs is the official validation, they do not pass ValidateVariants. VariantEval can't read them either. I'm unsure where the bug lives.

vcf file looks like this:

##fileformat=VCFv4.1
##fileDate=20130801
##source=strelka
##source_version=2.0.8
##startTime=Thu Aug  1 15:23:54 2013
##reference=file:///xchip/cga_home/louisb/reference/human_g1k_v37_decoy.fasta
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
##contig=<ID=GL000207.1,length=4262>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=GL000229.1,length=19913>
##contig=<ID=GL000231.1,length=27386>
##contig=<ID=GL000210.1,length=27682>
##contig=<ID=GL000239.1,length=33824>
##contig=<ID=GL000235.1,length=34474>
##contig=<ID=GL000201.1,length=36148>
##contig=<ID=GL000247.1,length=36422>
##contig=<ID=GL000245.1,length=36651>
##contig=<ID=GL000197.1,length=37175>
##contig=<ID=GL000203.1,length=37498>
##contig=<ID=GL000246.1,length=38154>
##contig=<ID=GL000249.1,length=38502>
##contig=<ID=GL000196.1,length=38914>
##contig=<ID=GL000248.1,length=39786>
##contig=<ID=GL000244.1,length=39929>
##contig=<ID=GL000238.1,length=39939>
##contig=<ID=GL000202.1,length=40103>
##contig=<ID=GL000234.1,length=40531>
##contig=<ID=GL000232.1,length=40652>
##contig=<ID=GL000206.1,length=41001>
##contig=<ID=GL000240.1,length=41933>
##contig=<ID=GL000236.1,length=41934>
##contig=<ID=GL000241.1,length=42152>
##contig=<ID=GL000243.1,length=43341>
##contig=<ID=GL000242.1,length=43523>
##contig=<ID=GL000230.1,length=43691>
##contig=<ID=GL000237.1,length=45867>
##contig=<ID=GL000233.1,length=45941>
##contig=<ID=GL000204.1,length=81310>
##contig=<ID=GL000198.1,length=90085>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=GL000191.1,length=106433>
##contig=<ID=GL000227.1,length=128374>
##contig=<ID=GL000228.1,length=129120>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=GL000209.1,length=159169>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000211.1,length=166566>
##contig=<ID=GL000199.1,length=169874>
##contig=<ID=GL000217.1,length=172149>
##contig=<ID=GL000216.1,length=172294>
##contig=<ID=GL000215.1,length=172545>
##contig=<ID=GL000205.1,length=174588>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000223.1,length=180455>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=GL000212.1,length=186858>
##contig=<ID=GL000222.1,length=186861>
##contig=<ID=GL000200.1,length=187035>
##contig=<ID=GL000193.1,length=189789>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=GL000192.1,length=547496>
##contig=<ID=NC_007605,length=171823>
##contig=<ID=hs37d5,length=35477943>
##content=strelka somatic indel calls
##germlineIndelTheta=0.0001
##priorSomaticIndelRate=1e-06
##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal">
##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI">
##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT">
##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence">
##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele">
##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele">
##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interupted homopolymer length intersecting with the indel">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1">
##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2">
##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2">
##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2">
##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2">
##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases">
##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases">
##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases">
##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence">
##FILTER=<ID=iHpol,Description="Indel overlaps an interupted homopolymer longer than 14x in the reference sequence">
##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3">
##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30">
##cmdline=/xchip/cga_home/louisb/Strelka/strelka_workflow_1.0.7/libexec/consolidateResults.pl --config=/xchip/cga/benchmark/testing/full-run/somatic-benchmark/spiked/Strelka_NDEFGHI_T12345678_0.8/config/run.config.ini
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NORMAL  TUMOR
1   797126  .   GTAAT   G   .   PASS    IC=1;IHP=2;NT=ref;QSI=56;QSI_NT=56;RC=2;RU=TAAT;SGT=ref-        >het;SOMATIC;TQSI=1;TQSI_NT=1   DP:DP2:TAR:TIR:TOR:DP50:FDP50:SUBDP50   47:47:48,49:0,0:3,3:48.72:0.00:0.00 62:62:36,39:17,19:9,9:42.49:0.21:0.00``

The output I get from ValidateVariants is java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T ValidateVariants --variant strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,289 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15 INFO 17:19:45,291 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:19:45,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:19:45,295 HelpFormatter - Program Args: -T ValidateVariants --variant strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta INFO 17:19:45,295 HelpFormatter - Date/Time: 2013/08/28 17:19:45 INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,295 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:19:45,300 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF INFO 17:19:45,412 GenomeAnalysisEngine - Strictness is SILENT INFO 17:19:45,513 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:19:45,533 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf INFO 17:19:45,615 GenomeAnalysisEngine - Preparing for traversal INFO 17:19:45,627 GenomeAnalysisEngine - Done preparing for traversal INFO 17:19:45,627 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:19:45,627 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 17:19:46,216 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.7-1-g42d771f): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: File /Users/louisb/Workspace/strelkaVcfDebug/strelka1.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 1:797126 are not observed at all in the sample genotypes ##### ERROR ------------------------------------------------------------------------------------------

output from VariantEval is:

java -jar ~/Workspace/gatk-protected/dist/GenomeAnalysisTK.jar -T VariantEval --eval strelka1.vcf -R ~/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,333 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,335 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-1-g42d771f, Compiled 2013/08/22 11:08:15
INFO  17:15:44,335 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:15:44,335 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:15:44,339 HelpFormatter - Program Args: -T VariantEval --eval strelka1.vcf -R /Users/louisb/cga_home/reference/human_g1k_v37_decoy.fasta
INFO  17:15:44,339 HelpFormatter - Date/Time: 2013/08/28 17:15:44
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,339 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:15:44,349 ArgumentTypeDescriptor - Dynamically determined type of strelka1.vcf to be VCF
INFO  17:15:44,476 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:15:44,603 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:15:44,623 RMDTrackBuilder - Loading Tribble index from disk for file strelka1.vcf
INFO  17:15:44,710 GenomeAnalysisEngine - Preparing for traversal
INFO  17:15:44,722 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:15:44,722 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:15:44,723 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining
INFO  17:15:44,831 VariantEval - Creating 3 combinatorial stratification states
INFO  17:15:45,382 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.ReviewedStingException: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
    at org.broadinstitute.sting.gatk.walkers.varianteval.evaluators.CountVariants.update1(CountVariants.java:201)
    at org.broadinstitute.sting.gatk.walkers.varianteval.util.EvaluationContext.apply(EvaluationContext.java:88)
    at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:455)
    at org.broadinstitute.sting.gatk.walkers.varianteval.VariantEval.map(VariantEval.java:124)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.7-1-g42d771f):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: BUG: Unexpected genotype type: [NORMAL NA DP 47 {DP2=47, DP50=48.72, FDP50=0.00, SUBDP50=0.00, TAR=48,49, TIR=0,0, TOR=3,3}]
##### ERROR ------------------------------------------------------------------------------------------

Created 2013-07-05 15:39:36 | Updated | Tags: fastaalternatereferencemaker
Comments (1)

Hi, I want to create out of my bam file a consensus file by using FastaAlternatereferneceMaker and my .vcf file. How should I care about positions where no reads mapped to. By default FastaAlternateReferenceMaker uses the REF base, but isnt it better to use here N insted? THANKS :=)


Created 2013-06-11 18:28:02 | Updated 2013-06-11 18:40:43 | Tags:
Comments (2)

I ran gatk UnifiedGenotyper with 8 samples using a bamlist in this order:

/path/to/DB12-011a1.bam
/path/to/DB12-011a2.bam
/path/to/DB12-011m.bam
/path/to/DB12-011f.bam
/path/to/DB13-009a1.bam
/path/to/DB13-009r1.bam
/path/to/DB13-009m.bam
/path/to/DB13-009f.bam

I expected the vcf to display the samples in this order from left to right. Instead my output lists the samples in alphabetical order according to sample name:

DB12-011a1 DB12-011a2 DB12-011f DB12-011m DB13-009a1 DB13-009f DB13-009m DB13-009r1

The order is not a big deal to me, I just want to make sure that I am looking at the correct samples for each row. Is this behavior normal. This is the first time I have run gatk with more than three samples so it is my first time running into this

Thank you!


Created 2013-03-25 03:49:53 | Updated | Tags:
Comments (5)

GATK writes out consScoreGERP=NA and scorePhastCons=NA while they are declared to be type float. The fields should be set to '.' for empty values instead of NA (as per VCF standard)

This causes problems downstream when using VCF parsers that expect either a float or ".". A few other fields use 'NA' (eg granthamScore) but this doesn't break downstream parsers as they are String fields. I think it would be good to change them, though.


Created 2013-02-22 07:11:41 | Updated | Tags: unifiedgenotyper dbsnp
Comments (6)

I've spent some time looking at the various documentation and have a few lingering questions for understanding my data a little better. I was hoping you could help to clarify a bit and make sure I am making correct assumptions.

I have a set of 96 samples that we ran targeted sequencing on. I have followed the best practices for bam file processing, etc, and ran Unified Genotyper on all 96 bam files at once using -EMIT ALL SITES. I have snp chip data on all of my samples as well so I have some "truth" to compare to. It looks like around 25% of the SNPs I expected to get (are present in dbSNP and I have chip data for them) are not present in my vcf file. Where did they go - is it normal to not get all of the dbSNP sites back? Is this because the variant site did not have good enough quality of bases to make it a variant? Does Unified Genotyper not emit sites for which all samples are homozygous reference even though it knows it's there because of the dbSNP file (like CASAVA)? I looked at the coverage at these positions and in some cases it was very high (over 100X) so I'm not sure that they could all be bad. I am somewhat new to this sort of data analysis - what are the best tools to use to look at the quality of the bases at a single site to see if this is the culprit? Any other ideas as to what could be causing this?

What are the internal things in Unified Genotyper that make a site not be called?

Would I be able to use the GENOTYPE GIVEN ALLELES option to make Unified Genotyper call all dbSNPs regardless of base quality or Phred Score? If so, do I use the same reference db snp .vcf file that I use for the rest of the steps?

For the variants I do have: If I am interested in the coverage at a site, the AD is the unfiltered depth and the DP is the filtered depth (the sum of the allele depths is the total depth), correct?

FInally, I have not run VSQR on my output. We did targeted sequencing and I'm not sure if the number of bases we did is enough got VSQR to be helpful. What is the minimum number of bases that VSQR is recommended with? I have 96 samples and I know that is enough. We are mostly interested in dbSNP values anyhow. In your opinion, is it reasonable to stop after Unified Genotyper since we are only interested in dbSNP calls or should we do the VSQR step anyhow?

Thank you for all your help and insights. I really appreciate the feedback allowed for in this forum.

Lisa