Hi, I am running GATK on rice libraries and I get the following error with VariantRecalibrator. The command I use is:
java -jar GenomeAnalysisTK-2.8-1-g932cd3a/GenomeAnalysisTK.jar -T VariantRecalibrator -R genome.fasta -input chr5.vcf -mode SNP -recalFile chr5.raw.snps.recal -tranchesFile chr5.raw.snps.tranches -rscriptFile chr5.recal.plots.R -resource:dbSNP,known=true,training=true,truth=true,prior=6.0 all_chromosomes.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ
This is how the program terminates:
INFO 21:42:35,099 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 21:42:35,102 TrainingSet - Found dbSNP track: Known = true Training = true Truth = true Prior = Q6.0
INFO 21:43:05,104 ProgressMeter - Chr11:11654383 9.83e+06 30.0 s 3.0 s 87.7% 34.0 s 4.0 s INFO 21:43:07,413 VariantDataManager - QD: mean = 27.02 standard deviation = 7.70 INFO 21:43:07,440 VariantDataManager - MQRankSum: mean = -0.62 standard deviation = 3.35 INFO 21:43:07,479 VariantDataManager - ReadPosRankSum: mean = 0.25 standard deviation = 1.63 INFO 21:43:07,510 VariantDataManager - FS: mean = 2.11 standard deviation = 9.67 INFO 21:43:07,525 VariantDataManager - MQ: mean = 47.09 standard deviation = 11.34 INFO 21:43:07,660 VariantDataManager - Annotations are now ordered by their information content: [MQ, QD, FS, MQRankSum, ReadPosRankSum] INFO 21:43:07,680 VariantDataManager - Training with 7845 variants after standard deviation thresholding.
INFO 21:43:09,833 VariantRecalibratorEngine - Convergence after 93 iterations! INFO 21:43:09,872 VariantRecalibratorEngine - Evaluating full set of 272372 variants... INFO 21:43:09,884 VariantDataManager - Training with worst 0 scoring variants --> variants with LOD <= -5.0000. INFO 21:43:10,854 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.NullPointerException at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorEngine.generateModel(VariantRecalibratorEngine.java:83) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:359) at org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibrator.onTraversalDone(VariantRecalibrator.java:139) at org.broadinstitute.sting.gatk.executive.Accumulator$StandardAccumulator.finishTraversal(Accumulator.java:129) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:116) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
I have read similar problems on GATK forums and based on those, it seems to me that the training set of VCFs is too small for my data. Is that so? If so, can you please tell me how can I fix it? This is for rice and I only have 1 set of known VCFs from dbSNP.
Can you also please confirm if my settings for 'known', 'training' and 'truth' are correct.
Thanks a lot in advance.
I'm trying to run GATK with the rice genome and I'm having trouble finding a known variant list of rice? Does anyone have a link or a more general list of known variant resources for other reference genomes?
Specifically, I found two rice genomes in dbSNP, but I can't find any documentation about which is which, and once I'm in the FTP site, which file I should be using?
Maybe take all the VCF files for each chromosome and cat them into one big file? Clearly, I'm a little clueless here :)