I'm trying to understand the haplotype scoring algorithm in GATK 1.6.5. I fortunately got a printed page where I have a simple diagram that explains the algorithm, I can't find it anymore in the new web. The case is that the formula for calculating the haplotype score in this diagram has a variable that I'am missing what it is. This is the formula as it's written:
P(read | haplotype_j) = sum_bi (bi == hi ? ei : 1 - ei / 3) - sum_bi (ei)
I guess bi stands for base at position i at the current read and hi stands base at position i at haplotype_j, that makes sense for me. But, what is ei?? maybe I'm missing something... it looks like it should be a probability in the range (0, 1) for the haplotype score to make sense.
Thanks in advance! Pablo.
I am observing the following scenario at one particular SNP (C/G) using two different enrichment technologies:
(I am using IGV syntax: ALLELE|number of reads w/ allele|%of total reads|+strand reads|- strand reads)
C: 15 47% 15+,0- G: 17 53% 17+,0- - technology2:
C: 17 37% 13+,4- G: 29 63% 26+,3- As you can see both technologies have good coverage of the SNP and also good representation of each allele. SNP(C/G) does not get called in technology1.
My questions are: 1- Does the GATK algorithm have some sort of constraint on the proportion of reads coming from only one strand (as with technology1) in order to try to predict or discard duplicates? 2- I know that the base call of a particular base is bounded by the mapping quality of its read. If my --stand_call_conf is 30 and one of the bases at this SNP position has MQ<30 does this avoid this position getting called? Or is it more like the avg(MQ) has to be >30 (meaning more than one read at this position is taken into account)?
Thanks for any clarification, Gene