In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.
For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the
For normal organism ploidy, you just set the
-ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!
This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.
It's available in the latest nightly builds; just use the
-ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.
Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.
We have added omniploidy support to the GVCF handling tools, with the following limitations:
When running, you need to indicate the sample ploidy that was used to generate the GVCFs with
-ploidy. As usual 2 is the default ploidy.
The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the
As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.
I have been trying to use CombinegVCFs on gVCF file produce by HaplotypeCaller in GVCF mode. The output VCF file doesn't seem to have any data in the genotype field: (just a dot)
chr1 95849 . T
This is the command I used:
java -Xmx45g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/strep_equi/strep_equi.fasta -T CombineGVCFs -V 17-1-2-5_NL_S12_L001_R1_001.gVCF -V 17-1-2-6_NL_S9_L001_R1_001.gVCF -o combined_1a.vcf -S STRICT
Just wanted to confirm.. I have a data from 4 spores of a yeast (haploid) tetrad.. If I want to call out variants using all 4 spores (4 bam files), do I need to set -ploidy as 1 or as 4 (
Number of samples in each pool * Sample Ploidy) ??
This is the code I am using:
java -d64 -Xms1g -Xmx4g -jar GenomeAnalysisTK.jar -glm SNP -nt 52 -R genome.fasta -T UnifiedGenotyper -I $basename"_A.realigned.bam" -I $basename"_B.realigned.bam" -I $basename"_C.realigned.bam" -I $basename"_D.realigned.bam" -ploidy 4 -o $basename.snps.vcf -stand_call_conf 25.0 -stand_emit_conf 10.0
I'm working with SNP calling in a bacterium - I don't have a set of known SNPs, so prior to recalibration, so generate a mask file from all data. My question is, what should I put for the following two options:
because the bacterium is haploid (and I specify --ploidy 1) it seems like these options should be set to zero, as there are no heterozygous loci, but I worry that if I set them to zero, that it won't work as expected. Any advice? I was just setting them to 0.001
Does the GATK team have any recommendations for filtering SNP data for haploid genomes? Our team works with microbial eukaryotes, both haploid and diploid and we have used the GATK v3 best practices for filtering for the latter. [VQSR was not possible, since we do not have access to a truth/high confidence SNP set.]
Dear GATK team,
I know that in the past GATK was not suitable for haploid genomes. I wanted to ask if this possibly changed since then - and whether it is possible to use GATK for haploid genomes.
Thanks a lot, Gilgi