I'm sequencing the genome of an organism which is a cross between the reference line (with no SNPs) and an individual from an outbred population (with many SNPs). Therefore all of the SNPs in my target organism will be heterozygous. So far I have sequenced three individuals which are crosses and one individual from our reference line.
I understand that the UnifiedGenotyper uses population genetic principles to ascertain genotype but I can't find more information about how this is performed. Thus, I am primarily worried that heterozygotes with strongly asymmetric allele counts in the reads will be called as homozygotes in order to fit in with, say Hard-Wienberg equilibrium.
Is there any chance you could enlighten me on this ? (or direct me to more detailed information on UG mechanism and settings).
Just to let you know the background, my study organism is Drosophila melanogaster. The whole genome of 164Mb is paired-end sequenced on an Illumina. I have so far sequenced one individual from our in-house reference line, and three individuals which are crosses of the reference line with a diverse, out-bred population. Average coverage is 30X. The 'crosses' are hemiclones in which recombination between the parental chromosomes is suppressed. I plan on sequencing 200 hemiclone individuals in which one haplotype will be shared between them (the reference gene) and the other haplotype will be diverse and unique to each line. As expected, I have identified a limited number of mutations in our in-house laboratory reference line compared to that of the assembly.
Any advice on how to best call genotypes in this unorthodox sample would be most appreciated.
I'm aware HLACaller is no longer technically supported, but I have a question related to some of the issues pertaining to the HLACaller algorithm on whole genome sequencing data. As is noted in the readme, the developers suggest using a -minFreq option to reduce rare HLA haplotypes from being spuriously called.
While that is entirely sensible, I was hoping someone could lend me some insight, suggestions, or help point me to some references that would elucidate which rare HLA alleles tend to show up frequently as false positives etc.? The reason I ask is that I'm working on a large project with cohorts of african ancestry, so I am apprehensive to entirely exclude "rare" alleles (which are likely rare European but not necessarily African alleles). I am currently planning on calling the alleles with the minFreq option in a first round, then scanning for individuals with potential calling errors and redoing them as a batch without the option in place.