I've having an issue with VariantRecalibrator which I've faced in the past.
This is the Error which I've seen before:
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --minNumBadVariants 5000, for example).
And I understand it's because the total number of variants is to low, but I'm unable to see how that could be an issue for my data set.
wc *_genotyped.vcf = 350678 93598833 1851184230 Which contains 200 background individual + 3 target exomes.
And my command:
java -jar -T VariantRecalibrator -R /data/srynearson/gatk_reference/human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads 30 -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf -an MQRankSum -an ReadPosRankSum -an FS -input /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_genotyped.vcf -recalFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_recal -tranchesFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_tranches -rscriptFile /data2/srynearson/10956R/UGP_Pipeline_Results/Capture_20_indel_plots.R -mode INDEL
And my verison: The Genome Analysis Toolkit (GATK) vnightly-2014-06-17-gc231c21, Compiled 2014/06/17 00:01:17
So you see my minNumBadVariants is already set to 5000.
Is the error due to the total number of Indels in the VCF file, because otherwise the error of "to few variants" seem wrong.
GATK 1.6 claims to apply a fixed Q20 threshold to clip ends of the reads for the indel caller:
3. Indel Calling with the Unified Genotyper
[...] while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.
--min_base_quality_score / -mbq ( int with default value 17 )
Minimum base quality required to consider a base for calling. The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).
Can this "fixed threshold of Q20" be changed to another value when running the Unified Genotyper?
We find the haplotypecaller is an excellent SNP caller. But recently we got confused for the indel results. We did the target sequencing (total 6 samples with 3 case vs. 3 control). We followed the best practice suggestion except that the VariantRecalibrator (the snp number was around 600 and seems too little for the recalibration). Haploptypecaller detected correctly a SNP but the neighbor deletion was a little strange. From the samtools tview, there is no clear sign for the deletion. We wonder if it came from the de novo assembly by haplotypecaller and is it creditable? Thanks.
The command line:
java -Xmx4g -jar ~/GenomeAnalysisTK-2.1-9-gb90951c/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I sample1.clean.dedup.recal.bam -I sample2.clean.dedup.recal.bam -I sample3.clean.dedup.recal.bam -I sample4.clean.dedup.recal.bam -I sample5.clean.dedup.recal.bam -I sample6.clean.dedup.recal.bam --dbsnp dbsnp_135.hg19.vcf -L target.interval_list -stand_call_conf 50.0 -stand_emit_conf 10.0 -o samples_new.raw.snps.indels.vcf
SNP: 19448410 . T G 2126.64 . AC=6;AF=0.500;AN=12;ActiveRegionSize=135;ClippingRankSum=18.283;DP=976;EVENTLENGTH=0;FS=1076.837;MLEAC=6;MLEAF=0.500;MQ=58.70;MQRankSum=-2.107;NVH=3;NumHapAssembly=17;NumHapEval=13;QD=2.18;QDE=0.73;ReadPosRankSum=-17.858;TYPE=SNP; GT:GQ:PL 0/1:99:195,0,2945 0/1:99:936,0,6037 0/1:99:354,0,3059 0/1:99:301,0,4595 0/1:99:187,0,2191 0/1:99:203,0,2617 Indel: 19448411 . GTGGCTCC G 274.85 . AC=3;AF=0.250;AN=12;ActiveRegionSize=135;ClippingRankSum=9.296;DP=1019;EVENTLENGTH=-7;FS=328.629;MLEAC=3;MLEAF=0.250;MQ=58.74;MQRankSum=-0.451;NVH=3;NumHapAssembly=17;NumHapEval=13;QD=0.46;QDE=0.15;ReadPosRankSum=-11.624;TYPE=INDEL; GT:GQ:PL 0/0:99:0,106,14388 0/1:99:200,0,28094 0/0:45:0,45,15261 0/1:50:50,0,22048 0/1:74:74,0,10244 0/0:42:0,42,12913