I'm sequencing the genome of an organism which is a cross between the reference line (with no SNPs) and an individual from an outbred population (with many SNPs). Therefore all of the SNPs in my target organism will be heterozygous. So far I have sequenced three individuals which are crosses and one individual from our reference line.
I understand that the UnifiedGenotyper uses population genetic principles to ascertain genotype but I can't find more information about how this is performed. Thus, I am primarily worried that heterozygotes with strongly asymmetric allele counts in the reads will be called as homozygotes in order to fit in with, say Hard-Wienberg equilibrium.
Is there any chance you could enlighten me on this ? (or direct me to more detailed information on UG mechanism and settings).
Just to let you know the background, my study organism is Drosophila melanogaster. The whole genome of 164Mb is paired-end sequenced on an Illumina. I have so far sequenced one individual from our in-house reference line, and three individuals which are crosses of the reference line with a diverse, out-bred population. Average coverage is 30X. The 'crosses' are hemiclones in which recombination between the parental chromosomes is suppressed. I plan on sequencing 200 hemiclone individuals in which one haplotype will be shared between them (the reference gene) and the other haplotype will be diverse and unique to each line. As expected, I have identified a limited number of mutations in our in-house laboratory reference line compared to that of the assembly.
Any advice on how to best call genotypes in this unorthodox sample would be most appreciated.
I have a VCF containing 7.4m SNPs over 70 individuals from an F2 intercross, called by the UnifiedGenotyper v2.3.6. I am trying to set appropriate thresholds for filtering these SNPs. The attached plots summarise the individual calls from this data set, with depth on the x-axis, genotype quality on the y-axis and frequency of particular DP+GQ combinations shown in greyscale. The first plot shows 0/1 (heterozygote) calls, the second shows 0/0 (homozyote) calls (the 1/1 plot looks similar to the 0/0 plot).
The homozygote plot shows a clear relationship between minimum depth and maximum GQ; it is impossible to get high GQs at low depth. However, this is not the case for heterozygotes. This makes intuitive sense to me - at low depth, one cannot be sure that a call really is homozygote; perhaps the other allele simply hasn't been sequenced. But we can have more confidence in a low depth heterozygote, as both alleles have been seen.
However, I am wondering what your recommendations for best practice are here; do you recommend using the same GQ thresholds for homozygote and heterozygote calls, or different thresholds? If the same thresholds, it seems like there will be a bias at low coverage; many (quite possibly real) homozygote calls will be excluded, which will make it appear that there is an excess of heterozygosity in low coverage individuals.
Also, there seems to be a periodicity in the homozygote (but not the heterozygote) GQ values; GQ values divisible by three have a different distribution to other GQ values. I assume this doesn't affect the results too much (after all, the scale is fairly arbitrary in the first place) but I'd be interested to know what causes this, if it is known.
Thanks for your help,
I've been experiencing some apparent errors with HaplotypeCaller that I think could be related to how it chooses candidate haplotypes when performing multi-sample calling. Please see the example files I've uploaded to the server (cooketho_20130103.tar.gz). For instance if you look at position 3511 in sample 2, there are 14 non-reference reads and 0 reference reads. When HaplotypeCaller is run with just this sample, it calls this locus homozygous non-reference, which seems to me to be the correct behavior. But when run with all 14 samples, it doesn't call a SNP at this locus. Repeating the run in debug mode shows that the (immediate) cause is that there were 11 candidate haplotypes found, and not a single one of them had the non-reference allele at position 3511. Why?
I came across an earlier post that suggested in some cases increasing the
--minPruning value can be of use, but I tried this to no avail.
My organism is a plant, and is is considerably more heterozygous than human, but changing the
--heterozygosity value did not appear to help either. Double check me on this if you like.
Can you please suggest a fix, or perhaps release some documentation on how HaplotypeCaller selects candidate haplotypes?
P.S. Any idea of when the source will be released to the public, or when a more comprehensive manual will be released? Would be very helpful for figuring out what is going on in cases like this.