We find the haplotypecaller is an excellent SNP caller. But recently we got confused for the indel results. We did the target sequencing (total 6 samples with 3 case vs. 3 control). We followed the best practice suggestion except that the VariantRecalibrator (the snp number was around 600 and seems too little for the recalibration). Haploptypecaller detected correctly a SNP but the neighbor deletion was a little strange. From the samtools tview, there is no clear sign for the deletion. We wonder if it came from the de novo assembly by haplotypecaller and is it creditable? Thanks.
The command line:
java -Xmx4g -jar ~/GenomeAnalysisTK-2.1-9-gb90951c/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I sample1.clean.dedup.recal.bam -I sample2.clean.dedup.recal.bam -I sample3.clean.dedup.recal.bam -I sample4.clean.dedup.recal.bam -I sample5.clean.dedup.recal.bam -I sample6.clean.dedup.recal.bam --dbsnp dbsnp_135.hg19.vcf -L target.interval_list -stand_call_conf 50.0 -stand_emit_conf 10.0 -o samples_new.raw.snps.indels.vcf
SNP: 19448410 . T G 2126.64 . AC=6;AF=0.500;AN=12;ActiveRegionSize=135;ClippingRankSum=18.283;DP=976;EVENTLENGTH=0;FS=1076.837;MLEAC=6;MLEAF=0.500;MQ=58.70;MQRankSum=-2.107;NVH=3;NumHapAssembly=17;NumHapEval=13;QD=2.18;QDE=0.73;ReadPosRankSum=-17.858;TYPE=SNP; GT:GQ:PL 0/1:99:195,0,2945 0/1:99:936,0,6037 0/1:99:354,0,3059 0/1:99:301,0,4595 0/1:99:187,0,2191 0/1:99:203,0,2617 Indel: 19448411 . GTGGCTCC G 274.85 . AC=3;AF=0.250;AN=12;ActiveRegionSize=135;ClippingRankSum=9.296;DP=1019;EVENTLENGTH=-7;FS=328.629;MLEAC=3;MLEAF=0.250;MQ=58.74;MQRankSum=-0.451;NVH=3;NumHapAssembly=17;NumHapEval=13;QD=0.46;QDE=0.15;ReadPosRankSum=-11.624;TYPE=INDEL; GT:GQ:PL 0/0:99:0,106,14388 0/1:99:200,0,28094 0/0:45:0,45,15261 0/1:50:50,0,22048 0/1:74:74,0,10244 0/0:42:0,42,12913
GATK 1.6 claims to apply a fixed Q20 threshold to clip ends of the reads for the indel caller:
3. Indel Calling with the Unified Genotyper
[...] while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example, --min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.
--min_base_quality_score / -mbq ( int with default value 17 )
Minimum base quality required to consider a base for calling. The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).
Can this "fixed threshold of Q20" be changed to another value when running the Unified Genotyper?