I have a question regarding the bayesian methods employed by the UnifiedGenotyper. I recently upgraded to v2.4-9-g532efad after running into a bug in an old version.
The context is sixty whole exome samples (human), and my pipeline performs mark-duplicate, then bq recalibration, then indel realigning, then finally UnifiedGenotyper. Each sample has some number of variants called (combined VCF output of SNP and INDEL for about 200K lines per sample), without reference calls. Later when looking at individual variants we need to know which samples were homozygous reference versus no-coverage, which is information not contained in my initial VCFs. To get the coverage at each variant I'm running a new big UnifiedGenotyper at the alleles specified by a vcf-merge of each file. This alleles file is about 4 million entries, where I have asked UG to call SNPs and reference bases too.
Now we have more complete data, and some strange things are found. Here's an excerpt from the ouput:
1 17538 . C A 710.37 .
0/1:5,1:6:27:27,0,96 0/0:5,1:6:12:0,12,156 0/1:5,1:6:26:26,0,113
The concern is that the same five reads reference, one read alternate is sometimes called Hom and sometimes Het.
At another site I've got
0/0:4,1:5:12:0,12,147 which shows a 6/1 being called hom and a 4/1 being called het!
Apologies for my lack of a well-posed question, I'm trying to get my head wrapped around this so that the next time I tell my lab a variant is discovered, we have more confidence.