New in GATK 2.0 is the capability of UnifiedGenotyper to natively call non-diploid organisms. Three use cases are currently supported:
In order to enable this feature, users need to set the
-ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -I myReads.bam \ -T UnifiedGenotyper \ -ploidy 3
glm argument works in the same way as in the diploid case - set to
[INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed in the following weeks as we iron out details and fix issues in the GATK 2.0 beta.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, current InbreedingCoeff is omitted. Annotations which do work and are supported in non-diploid use cases are the following:
AF and Genotype annotations such as
The interaction between non-diploid calling and other experimental tools like HaplotypeCaller or ReduceReads is currently not supported.
Whereas it's entirely possible to use VQSR to filter non-diploid calls, we currently have no experience with this and can hence offer no support nor best practices for this.
Only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
Users should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
Leishmania has 36 chromosomes but their copy number is unpredictable for each strain and chromosome copy number can change very quickly. So what is an optimal ploidy setting for organisms with extensive aneuploidy? So far we use just diploid setting. Some samples have consistently more heterozygous SNPs in higher copy chromosomes but this relationship does not hold in many other samples: there is no strong correlation between chromosome copy number and abundance of heterozygous SNPs.
Hello, I am running the variant caller to identify SNPs and Reference Calls in a bacterial genome, which means I am running with -ploidy 1, -glm POOLSNP and -prnm POOL as suggested in other regions of this forum. The tool does an excellent job when just looking for Variants, but when I attempt to EMIT_ALL_CONFIDENT_SITES, it just spits out the SNPs and not the reference calls. When I remove the arguments stating that it is ploidy of 1, it works fine but calls SNPs that shouldn't be there since it's assuming diploid. Thus, I would really like to be able to emit all sites in ploidy=1 mode. Any reason why this is not possible? Thanks for you help! John
I'm trying to run HaplotypeCaller on a haploid organism. Is this possible? What argument should I use for this? My first attempt produced a diploid calls.
Sorry for the silly question
I just want to know what you mean by "-ploidy" in UnifiedGenotyper.
--sample_ploidy / -ploidy ( int with default value 2 ) Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
I am working on an inbred (homozygous) strain of mice. I have only one sample. Should I use number 2 here that represents the diploid nature of the genome. I am confused as it says Ploidy (number of chromosomes) per sample.
Does the GATK team have any recommendations for filtering SNP data for haploid genomes? Our team works with microbial eukaryotes, both haploid and diploid and we have used the GATK v3 best practices for filtering for the latter. [VQSR was not possible, since we do not have access to a truth/high confidence SNP set.]
I am experiencing a problem with the ploidy 1 option. Having used GATK2 unified genotyper with the params --sample_ploidy 1 --genotype_likelihoods_model BOTH -rf BadCigar I get the following line in a vcf file (see sample in bold)
Staphylococcus 1553115 . A G 24454.01 . AC=13;AF=0.813;AN=16;BaseQRankSum=1.072;DP=1040;Dels=0.00;FS=32.822;HaplotypeScore=3.3463;MLEAC=13;MLEAF=0.813;MQ=40.20;MQ0=47;MQRankSum=-10.543;QD=32.13;ReadPosRankSum=-1.148;SB=-9.076e+03 GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1:0,29:29:99:1:1.00:1015,0 1:0,62:62:99:1:1.00:2053,0 1:0,106:106:99:1:1.00:3210,0 1:0,102:102:99:1:1.00:3305,0 1:0,88:88:99:1:1.00:2750,0 1:0,41:41:99:1:1.00:1324,0 1:0,76:76:99:1:1.00:2448,0 1:0,39:39:99:1:1.00:1303,0 0:64,40:104:99:0:0.00:0,1334 1:0,41:41:99:1:1.00:1373,0 1:0,49:49:99:1:1.00:1668,0 0:72,50:122:99:0:0.00:0,1258 1:0,59:59:99:1:1.00:1852,0 1:0,38:38:99:1:1.00:1192,0 1:0,31:31:99:1:1.00:961,0 0:53,0:53:99:0:0.00:0,1633
The sample in bold is called as WT (genotype 0) with a high GQ despite there being 72 reads of genotype 0 and 50 of genotype 1. Examining the bam file suggests that this is a mapping error in a repetitive phage region
If I set ploidy to be 2 the equivalent line in the resulting vcf file is
Staphylococcus 1553115 . A G 24788.02 . AC=28;AF=0.875;AN=32;BaseQRankSum=0.947;DP=1040;Dels=0.00;FS=32.822;HaplotypeScore=3.3463;InbreedingCoeff=0.4286;MLEAC=28;MLEAF=0.875;MQ=40.20;MQ0=47;MQRankSum=-10.096;QD=25.11;ReadPosRankSum=-1.177;SB=-9.871e+03 GT:AD:DP:GQ:PL 1/1:0,29:29:81:986,81,0 1/1:0,62:62:99:1895,156,0 1/1:0,106:106:99:2992,247,0 1/1:0,102:102:99:3169,268,0 1/1:0,88:88:99:2452,193,0 1/1:0,41:41:99:1243,99,0 1/1:0,76:76:99:2283,193,0 1/1:0,39:39:99:1233,105,0 0/1:64,40:104:99:886,0,1706 1/1:0,41:41:99:1298,108,0 1/1:0,49:49:99:1581,129,0 0/1:72,50:122:99:1235,0,2126 1/1:0,59:59:99:1649,132,0 1/1:0,38:38:87:1065,87,0 1/1:0,31:31:69:821,69,0 0/0:53,0:53:99:0,138,1588
As can be seen from the bold text, the same position is called as heterozygote which based on the number of the reads mapping would be likley except for the fact this is a bacterial haploid genome. Previously I would have discarded this since the heterozygous call indicates mis-mapping as the bam file confirms. I had been hoping to use the sample_polidy option set to 1 for bacterial genomes but this result concerns me. I could obviously filter based on AD but the wonder why the sample was given a high GQ when the polidy is set to 1 and the AD suggests the call of genotype 0 should be doubted.
Any suggestions on what is going on here?? Many thanks
Thank you for developing a great set of tools and adding in the ploidy option to the UnifiedGenotyper! My question is in regards to the ploidy argument when calling multiple pooled samples together. If I have pooled samples from the same population at different time points and want to call SNPs with multiple sample awareness, do I use the ploidy of one sample or that of all samples. For example, if I have pooled samples of approximately 25 haploid genomes each from three different time points from the same population should I use 25 or 75 as my ploidy?