I've a problem running GATK2.8's VariantRecalibrator, I get Code Exception (java.lang.NullPointerException). Below is the error trace:
INFO 17:05:37,528 GenomeAnalysisEngine - Preparing for traversal INFO 17:14:27,706 VariantDataManager - MQRankSum: mean = -0.09 standard deviation = 0.99 INFO 17:14:27,742 VariantDataManager - ReadPosRankSum: mean = 0.18 standard deviation = 0.96 INFO 17:14:27,820 VariantDataManager - Annotations are now ordered by their information content: [QD, ReadPosRankSum, MQRankSum] INFO 17:14:27,823 VariantDataManager - Training with 2739 variants after standard deviation thresholding. INFO 17:14:27,829 GaussianMixtureModel - Initializing model with 100 k-means iterations... INFO 17:14:28,389 VariantRecalibratorEngine - Finished iteration 0. INFO 17:14:28,534 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.20522 INFO 17:14:28,601 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.46290 INFO 17:14:28,675 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 9.95604 INFO 17:14:28,759 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 9.00456 INFO 17:14:28,861 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.01190 INFO 17:14:28,951 VariantRecalibratorEngine - Finished iteration 30. Current change in mixture coefficients = 0.00796 INFO 17:14:29,041 VariantRecalibratorEngine - Finished iteration 35. Current change in mixture coefficients = 0.00546 INFO 17:14:29,131 VariantRecalibratorEngine - Finished iteration 40. Current change in mixture coefficients = 0.00360 INFO 17:14:29,221 VariantRecalibratorEngine - Finished iteration 45. Current change in mixture coefficients = 0.00218 INFO 17:14:29,239 VariantRecalibratorEngine - Convergence after 46 iterations! INFO 17:14:29,276 VariantRecalibratorEngine - Evaluating full set of 13495 variants... INFO 17:14:29,280 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 17:14:32,024 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
And this is the command i used:
java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -Xmx4g -jar ~/genomekey-data/tools/gatk2.jar -T VariantRecalibrator -R ~/genomekey-data/bwa_references/human_g1k_v37.fasta -input ~/simplex/ngs_test/snp_calling/aln.haplotyper.raw.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=8.0 ~/genomekey-data/bundle/current/dbsnp_137.b37.vcf -resource:hapmap,VCF,known=false,training=true,truth=true,prior=15.0 ~/genomekey-data/bundle/current/hapmap_3.3.b37.vcf -resource:omni,VCF,known=false,training=true,truth=false,prior=12.0 ~/genomekey-data/bundle/current/1000G_omni2.5.b37.vcf -an QD -an MQRankSum -an ReadPosRankSum -recalFile ~/simplex/ngs_test/snp_recal/aln.snp.recal -tranche 100 -tranche 99.9 -tranche 99.0 -tranche 90 -tranchesFile ~/simplex/ngs_test/snp_recal/aln.snp.tranches -rscriptFile ~/simplex/ngs_test/snp_recal/aln.snp.plots.R -mode SNP
Please note that i got the same error running GATK 2.6. Any help or suggestions will be appreciated.
Would it matter a lot to do recalibration before usage of hc? If yes, it's worth to do realignment before recalibration to reduce no of snv's, I guess. I'm searching for variants in one of not too much known genomes. What I've got is 30x coverage and haploids I work on. I did realignment and recalibration but I'm thinking if it helps me in anything while I don't know any obvious snp sites. And if it's not better to use haplotypecaller stright away after mapping and then do the recalibration with a bunch of best quality variants?
Please, help in that matter.
We are running an exome sequencing project where we have between 30 and 50 exomes in total. For optimal variant quality score recalibration we should use as much data as possible in the VariantRecalibrator step. However, for downstream analysis purposes, we want individual exome VCFs, and UnifiedGenotyper has been run individually for each sample. Our plan was to feed all data into VariantRecalibrator, and then run ApplyRecalibration on the individual raw VCFs. But VariantRecalibrator takes only one VCF as input, right? So what would be the best workflow for this scenario? Could we run UnifiedGenotyper to create a common VCF for Recalibration purposes only, and then apply this to the individual VCFs? Or would this somehow create an invalid input for the RECAL-file? Is it better to run variant calling and recalibration both on multi-sample VCFs and split the VCFs sample-wise later?
I had a few questions about the haplotype score.
In the technical documentation it states that "Higher scores are indicative of regions with bad alignments, often leading to artifactual SNP and indel calls. Note that the Haplotype Score is only calculated for sites with read coverage."
How is the haplotype group for each variant site determined? e.g. Does it take the closest two variants to the query site and then treat the query variant + closest two variants as the haplotype group?
Also, in the case of multiallelic SNPs (>2 SNPs), haplotype score is inappropriate since it only looks at whether a site can be explained by the segregation of two and only two haplotypes, correct? So multiallelic snps will be assigned poor haplotype scores OR will these sites not be annotated at all? If we have a case where there is a truly biallelic SNP and a couple of samples have some reads that are erroneously calling a third allele, this variant site will be assigned a poor haplotype score overall, correct?
Hi, I'm encountering this error running VariantRecalibrator with data from 3 samples (I'm testing): Maybe is the problem due to small sample size?
##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantDataManager.selectWorstVariants(VariantDataManager.java:179) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:306) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:107) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:97) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:281) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:94) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-16-g9f648cb): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------
I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.
The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?
Is there any difference or reason to choose one of the following two workflows over the other?
Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Or some third option I'm missing
Any help is appreciated.
We have data from target sequencing genes (only targeted two genes). We analyzed the data by GATK pipeline. Since the data set is too small, we tried hard filtration on both SNP and indels. At the same time, we sequenced the same sample by whole exome sequencing and filter SNP by VQSR. The quality of VQSR results is much better than hard filtration results. For economic reason, we need to develop analysis pipeline for target sequencing, is it ok to incorporate the target sequencing data into an exome sequencing data (merge the VCF files), do VQSR? I just worried the true sites in target sequencing data have different features compared to true sites in whole exome sequencing data.