The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, and its ability to call indels is far superior. We recommend using HaplotypeCaller in all cases, with only a few exceptions:
In those cases, we recommend using UnifiedGenotyper instead of HaplotypeCaller.
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
Since version 2.0, the UnifiedGenotyper has been able to deal with ploidies other than two. Three use cases are currently supported:
In order to enable this feature, you need to set the
-ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -I myReads.bam \ -T UnifiedGenotyper \ -ploidy 4
glm argument works in the same way as in the diploid case - set to
[INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed over time, but for now please keep these in mind.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
The HaplotypeCaller and ReduceReads currently do not support non-diploid data.
In theory you can use VQSR to filter non-diploid calls, but we currently have no experience with this and therefore cannot offer any support nor best practices on how to do this.
For indels, only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!
This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.
It's available in the latest nightly builds; just use the
-ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.
Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.