Although we generally recommend using the HaplotypeCaller for calling variants, in some cases it is not possible to do so, as explained in the FAQs. In those cases (i.e. when you're processing a high number of samples together, working with non-diploid organisms or with pooled samples) you should use the UnifiedGenotyper instead.
The Unified Genotyper calls SNPs and indels separately by considering each variant locus independently. The model it uses to do so has been generalized to work with data from organisms of any ploidy.
The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, and its ability to call indels is far superior. We recommend using HaplotypeCaller in all cases, with only a few exceptions:
In those cases, we recommend using UnifiedGenotyper instead of HaplotypeCaller.
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
Since version 2.0, the UnifiedGenotyper has been able to deal with ploidies other than two. Three use cases are currently supported:
In order to enable this feature, you need to set the
-ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -I myReads.bam \ -T UnifiedGenotyper \ -ploidy 4
glm argument works in the same way as in the diploid case - set to
[INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed over time, but for now please keep these in mind.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
The HaplotypeCaller and ReduceReads currently do not support non-diploid data.
In theory you can use VQSR to filter non-diploid calls, but we currently have no experience with this and therefore cannot offer any support nor best practices on how to do this.
For indels, only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool.
Thanks in advance for any advice.
Hi all, I am running a scala script, and would like to the include "-ploidy 1". Any advice on how can I do this?
Some information that may or may not be relevant (idk!):
I am attempting this using UnifiedGenotyper (v2.7-4). At the top of the script I have: package org.broadinstitute.sting.queue.qscripts.examples import org.broadinstitute.sting.queue.QScript import org.broadinstitute.sting.queue.extensions.gatk._
I got the glm both to work using: genotyper.glm = org.broadinstitute.sting.gatk.walkers.genotyper.GenotypeLikelihoodsCalculationModel.Model.BOTH
I was hoping for something similar for ploidy.
Thank you for GATK, which I am learning to use and already considering a great tool. I am trying to analyze a whole genome shotgun dataset from Plasmodium species and trying to decide what ploidy I should use. My dataset comes from a human blood sample. Plasmodia have a brief sexual stage in the mosquito (diploid phase), but then reproduce asexually in the blood of the mammal host (haploid phase). Of course, if the diploid organisms was heterozygous for a certain trait, we are going to have a heterozygous population in the mammalian host. Also, multiclonal infections are possible, it means that more than 2 alleles for the same trait may be present in the blood.
I am sorry for the novice question, what would you suggest me regarding the ploidy parameter?
Thanks for your time and attention, Kind Regards,
Just wanted to confirm.. I have a data from 4 spores of a yeast (haploid) tetrad.. If I want to call out variants using all 4 spores (4 bam files), do I need to set -ploidy as 1 or as 4 (
Number of samples in each pool * Sample Ploidy) ??
This is the code I am using:
java -d64 -Xms1g -Xmx4g -jar GenomeAnalysisTK.jar -glm SNP -nt 52 -R genome.fasta -T UnifiedGenotyper -I $basename"_A.realigned.bam" -I $basename"_B.realigned.bam" -I $basename"_C.realigned.bam" -I $basename"_D.realigned.bam" -ploidy 4 -o $basename.snps.vcf -stand_call_conf 25.0 -stand_emit_conf 10.0
I have three pools of hiseq data where N=20,20, & 40 individuals per pool. I sequenced to ~10x depth for each pool. I would like to use the ploidy setting to estimate probable genotypes from each pool, but I'm torn because it doesn't seem correct to estimate 20 or 40 genotypes from pools that have only been sequenced to 10x depth (only 10 chromosomes could have been sampled). So in such a case, would it be more advisable to set the ploidy level to the depth level of 10 and estimate 5 genotypes per pool?
Thank you very much!
Thank you for developing a great set of tools and adding in the ploidy option to the UnifiedGenotyper! My question is in regards to the ploidy argument when calling multiple pooled samples together. If I have pooled samples from the same population at different time points and want to call SNPs with multiple sample awareness, do I use the ploidy of one sample or that of all samples. For example, if I have pooled samples of approximately 25 haploid genomes each from three different time points from the same population should I use 25 or 75 as my ploidy?
I just want to know what you mean by "-ploidy" in UnifiedGenotyper.
--sample_ploidy / -ploidy ( int with default value 2 ) Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
I am working on an inbred (homozygous) strain of mice. I have only one sample. Should I use number 2 here that represents the diploid nature of the genome. I am confused as it says Ploidy (number of chromosomes) per sample.
I am experiencing a problem with the ploidy 1 option. Having used GATK2 unified genotyper with the params --sample_ploidy 1 --genotype_likelihoods_model BOTH -rf BadCigar I get the following line in a vcf file (see sample in bold)
Staphylococcus 1553115 . A G 24454.01 . AC=13;AF=0.813;AN=16;BaseQRankSum=1.072;DP=1040;Dels=0.00;FS=32.822;HaplotypeScore=3.3463;MLEAC=13;MLEAF=0.813;MQ=40.20;MQ0=47;MQRankSum=-10.543;QD=32.13;ReadPosRankSum=-1.148;SB=-9.076e+03 GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1:0,29:29:99:1:1.00:1015,0 1:0,62:62:99:1:1.00:2053,0 1:0,106:106:99:1:1.00:3210,0 1:0,102:102:99:1:1.00:3305,0 1:0,88:88:99:1:1.00:2750,0 1:0,41:41:99:1:1.00:1324,0 1:0,76:76:99:1:1.00:2448,0 1:0,39:39:99:1:1.00:1303,0 0:64,40:104:99:0:0.00:0,1334 1:0,41:41:99:1:1.00:1373,0 1:0,49:49:99:1:1.00:1668,0 0:72,50:122:99:0:0.00:0,1258 1:0,59:59:99:1:1.00:1852,0 1:0,38:38:99:1:1.00:1192,0 1:0,31:31:99:1:1.00:961,0 0:53,0:53:99:0:0.00:0,1633
The sample in bold is called as WT (genotype 0) with a high GQ despite there being 72 reads of genotype 0 and 50 of genotype 1. Examining the bam file suggests that this is a mapping error in a repetitive phage region
If I set ploidy to be 2 the equivalent line in the resulting vcf file is
Staphylococcus 1553115 . A G 24788.02 . AC=28;AF=0.875;AN=32;BaseQRankSum=0.947;DP=1040;Dels=0.00;FS=32.822;HaplotypeScore=3.3463;InbreedingCoeff=0.4286;MLEAC=28;MLEAF=0.875;MQ=40.20;MQ0=47;MQRankSum=-10.096;QD=25.11;ReadPosRankSum=-1.177;SB=-9.871e+03 GT:AD:DP:GQ:PL 1/1:0,29:29:81:986,81,0 1/1:0,62:62:99:1895,156,0 1/1:0,106:106:99:2992,247,0 1/1:0,102:102:99:3169,268,0 1/1:0,88:88:99:2452,193,0 1/1:0,41:41:99:1243,99,0 1/1:0,76:76:99:2283,193,0 1/1:0,39:39:99:1233,105,0 0/1:64,40:104:99:886,0,1706 1/1:0,41:41:99:1298,108,0 1/1:0,49:49:99:1581,129,0 0/1:72,50:122:99:1235,0,2126 1/1:0,59:59:99:1649,132,0 1/1:0,38:38:87:1065,87,0 1/1:0,31:31:69:821,69,0 0/0:53,0:53:99:0,138,1588
As can be seen from the bold text, the same position is called as heterozygote which based on the number of the reads mapping would be likley except for the fact this is a bacterial haploid genome. Previously I would have discarded this since the heterozygous call indicates mis-mapping as the bam file confirms. I had been hoping to use the sample_polidy option set to 1 for bacterial genomes but this result concerns me. I could obviously filter based on AD but the wonder why the sample was given a high GQ when the polidy is set to 1 and the AD suggests the call of genotype 0 should be doubted.
Any suggestions on what is going on here?? Many thanks
Does the GATK team have any recommendations for filtering SNP data for haploid genomes? Our team works with microbial eukaryotes, both haploid and diploid and we have used the GATK v3 best practices for filtering for the latter. [VQSR was not possible, since we do not have access to a truth/high confidence SNP set.]
Hello, I am running the variant caller to identify SNPs and Reference Calls in a bacterial genome, which means I am running with -ploidy 1, -glm POOLSNP and -prnm POOL as suggested in other regions of this forum. The tool does an excellent job when just looking for Variants, but when I attempt to EMIT_ALL_CONFIDENT_SITES, it just spits out the SNPs and not the reference calls. When I remove the arguments stating that it is ploidy of 1, it works fine but calls SNPs that shouldn't be there since it's assuming diploid. Thus, I would really like to be able to emit all sites in ploidy=1 mode. Any reason why this is not possible? Thanks for you help! John
Leishmania has 36 chromosomes but their copy number is unpredictable for each strain and chromosome copy number can change very quickly. So what is an optimal ploidy setting for organisms with extensive aneuploidy? So far we use just diploid setting. Some samples have consistently more heterozygous SNPs in higher copy chromosomes but this relationship does not hold in many other samples: there is no strong correlation between chromosome copy number and abundance of heterozygous SNPs.