Tagged with #genotypeandvalidate
1 documentation article | 0 announcements | 2 forum discussions

Created 2012-07-23 23:55:48 | Updated 2012-07-23 23:55:48 | Tags: gatkdocs genotypeandvalidate
Comments (0)

A new tool has been released!

Check out the documentation at GenotypeAndValidate.

No posts found with the requested search criteria.

Created 2014-04-05 19:29:11 | Updated | Tags: haplotypecaller genotypeandvalidate
Comments (6)


My use case is quite straightforward, but has been surprisingly hard to achieve:
For each sample, I have both Omni 2.5M SNP genotype data and RNA-seq variant call data (done with GATK3). Now I want to see how well the RNA-seq variant calling is performing, using the SNP genotypes as reference. To do this, I need not only the variant calls in the RNA-seq data (as HC is outputting normally), but all genotypes for a given set of positions. Ideally, I would like to keep all the normal info fields from the RNA VCF, to allow calculation of some concordance metrics based on depth of coverage and other quality parameters later.

I've tried the following:
1. GenotypeAndValidate. With SNP VCF as "truth" and BAM to evaluate. The command:

java -Xmx32g -jar ${GATK} \  
-T GenotypeAndValidate \  
-R ${REF} \  
-I ${BAM} \  
-alleles ${SNPVCF} \  
-L ${SNPVCF} \  
-o $SAMPLEID.rnasnp.vcf \  
-nt 4  

The results (running only chr 1, with ~185k SNPs):

(empty) ALT REF No Status
called alt 0 0 4096
called ref 0 0 12995
not called 0 0 153034

sensitivity: NaN%
specificity: 100.000000%
not confident: 3678
not covered: 149356

This runs surprisingly fast - which makes me think I'm not inputting the files as expected.

2. Haplotype Caller in GGA mode. Giving it the SNP VCF as the --alleles file. The command, adjusted for RNA-seq data:

java -Xmx32g -jar ${GATK} \
-T HaplotypeCaller \
-R ${REF} \
--dbsnp ${DBSNP} \
-I ${BAM} \
-L ${SNPVCF} \
-alleles ${SNPVCF} \
--interval_padding 150 \
-recoverDanglingHeads \
-dontUseSoftClippedBases \
-stand_call_conf 0.0 \
-stand_emit_conf 0.0 \
-o $SAMPLEID.rnasnp.vcf \
-nct 16

This almost results in what I want, in that HC starts outputting also 0/0 and ./. calls for reference and non-covered bases.
But, it does so only for SNP-positions with non-reference alleles in the SNP VCF. Again, I want all positions called - including those that are homozygous reference in the SNP VCF.

I am using these tools wrong? Or should I be doing this differently?

Thanks in advance, Vasilios

Created 2014-03-23 02:34:20 | Updated | Tags: genotypeandvalidate
Comments (10)


I am trying to run GenotypeAndValidate on 4 of my bam files. I get the following error. I am sure I do not have 'callStatus' in the info field of my 'alleles' vcf file. Thanks.

INFO 22:21:20,191 HelpFormatter - -------------------------------------------------------------------------------- INFO 22:21:20,194 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.0-0-g6bad1c6, Compiled 2014/03/06 06:30:35 INFO 22:21:20,194 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 22:21:20,195 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 22:21:20,201 HelpFormatter - Program Args: -R hs37d5.fasta -T GenotypeAndValidate -alleles variantCallsHC_filtered.vcf -I p1.clean.dedup.recal.bam -I p2.clean.dedup.recal.bam -I F.bam -I M.bam -o genotypeNvalidate.vcf -L variantCallsHC_filtered.vcf INFO 22:21:20,205 HelpFormatter - Executing as * on Linux 2.6.32-220.13.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15. INFO 22:21:20,206 HelpFormatter - Date/Time: 2014/03/22 22:21:20 INFO 22:21:20,206 HelpFormatter - -------------------------------------------------------------------------------- INFO 22:21:20,207 HelpFormatter - -------------------------------------------------------------------------------- INFO 22:21:21,135 GenomeAnalysisEngine - Strictness is SILENT INFO 22:21:21,329 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 22:21:21,341 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 22:21:21,407 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 22:21:42,545 IntervalUtils - Processing 135051 bp from intervals INFO 22:21:42,659 GenomeAnalysisEngine - Preparing for traversal over 4 BAM files INFO 22:21:43,598 GenomeAnalysisEngine - Done preparing for traversal INFO 22:21:43,598 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 22:21:43,599 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalStateException: Key callStatus found in VariantContext field INFO at 1:762273 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default. at org.broadinstitute.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:173) at org.broadinstitute.variant.vcf.VCFEncoder.encode(VCFEncoder.java:112) at org.broadinstitute.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:214) at org.broadinstitute.sting.gatk.io.storage.VariantContextWriterStorage.add(VariantContextWriterStorage.java:183) at org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub.add(VariantContextWriterStub.java:254) at org.broadinstitute.sting.gatk.walkers.validation.GenotypeAndValidate.map(GenotypeAndValidate.java:493) at org.broadinstitute.sting.gatk.walkers.validation.GenotypeAndValidate.map(GenotypeAndValidate.java:216) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.0-0-g6bad1c6):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Key callStatus found in VariantContext field INFO at 1:762273 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
ERROR ------------------------------------------------------------------------------------------