The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.
As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.
Caveats for older versions
If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
As of version 3.3, the HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the
-ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.
For earlier versions (all the way to 2.0) the fallback option is UnifiedGenotyper, which also accepts the
For normal organism ploidy, you just set the
-ploidy argument to the desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Several variant annotations are not appropriate for use with non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
GATK 3.3 was released on October 23, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.
--sample_nameargument. This is a shortcut for people who have multi-sample BAMs but would like to use
-ERC GVCFmode with a particular one of those samples.
--ignore_all_filtersoption. If specified, the variant recalibrator will ignore all input filters and treat sites as unfiltered.
--keepOriginalAC functionalityin SelectVariants to work for sites that lose alleles in the selection.
read_grouparguments no longer appear in the header.
--bcffor VCF files, and
--generate_md5for BAM files moved to the engine level.
Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!
This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.
It's available in the latest nightly builds; just use the
-ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.
Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.
We have added omniploidy support to the GVCF handling tools, with the following limitations:
When running, you need to indicate the sample ploidy that was used to generate the GVCFs with
-ploidy. As usual 2 is the default ploidy.
The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the
As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.
Hi, I'm running on the Unified Genotyper (Version=3.3-0-g37228af) on a pooled sample. The ploidy is set to 32. I'm trying to get allele frequency information. I'm trying to filter sites based on depth of coverage but the DP flag is not consistent within a loci
groupIV 756 . C T 141.24 . AC=11;AF=0.344;AN=32;BaseQRankSum=-2.927;DP=37;Dels=0.65;FS=0.000;HaplotypeScore=166.3715;MLEAC=11;MLEAF=0.344;MQ=86.28;MQ0=0;MQRankSum=-0.696;QD=3.82;ReadPosRankSum=-2.927;SOR=1.022 GT:AD:DP:GQ:PL 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1/1/1/1/1/1/1:8,5:13:0:155,39,25,17,12,9,6,4,2,1,1,0,0,0,0,1,2,2,4,5,7,9,11,14,17,21,25,31,38,47,60,2147483647,2147483647
The DP at the start says there is a depth of 37, but at the end it says its 13. Judging from neighbouring sites, the actual depth seems to be 37, but I'm confused why it seems to only be using 13 of them. At other sites the scores match.
I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days:
and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway?
My code is below, and any help would be appreciated.
java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 6 \ -R ~/my_ref_sequence \ --intervals ~/my_intervals_file \ -ploidy 180 \ -log my_log_file \ -I ~/my_input_bam \ -o ~/my_output_vcf
Hello. I've run Haplotype caller for 19 samples in
-ploidy 19 (19 is as high as I can get without getting an error for that, but that is a discussion opened elsewhere). Then, I've tried GenotypeGVCFs and get an ERROR (both command and ERROR below).
Just to add, I've run both HaplotypeCaller and GenotypeGVCFs with and without
-maxAltAlleles 4 to try to limit the referred combination; however, the output is exactly the same ERROR message at the end of GenotypeGVCFs. Do you think there might be a bug or am I doing something wrong?
java -Djava.io.tmpdir=$mytmp -Xmx28g -jar GenomeAnalysisTK.jar -R $ref -T GenotypeGVCFs -o $gVCF.ploidy.vcf -V ploidy.list -ploidy 19 -maxAltAlleles 4
ERROR MESSAGE: the combination of ploidy (19) and number of alleles (21) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyz e this locus
Dear GATK team, We want to know your recommendation as to what variant caller to use if the number of individuals (i.e. ploidy) is unknown, it can be one or many. Our bam files have human amplicons. Any help would be appreciated. Thanks.
I am trying to use the HaplotypeCaller with a ploidy setting of 1 to genotype sex chromosomes in a single individual using the following command:
java -Xmx4g -jar ~/applications/GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller -R /data/genome_references/Hmel1-1_primaryScaffolds_mtDNA.fasta -L Z_chromosome.intervals -ploidy 1 -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -I realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.bam -hets 0.01 -indelHeterozygosity 0.01 -out_mode EMIT_ALL_CONFIDENT_SITES -maxAltAlleles 1 -o realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.HC_sex.vcf
This is failing. Error message at the bottom.
Setting ploidy to 2 works fine.
The following command with the UnifiedGeonotyper is also fine:
java -Xmx4g -jar ~/applications/GenomeAnalysisTK-3.2-2.jar -T UnifiedGenotyper -R /data/genome_references/Hmel1-1_primaryScaffolds_mtDNA.fasta -ploidy 1 -L Z_chromosome.intervals -I realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.bam -glm BOTH -hets 0.01 -indelHeterozygosity 0.01 -out_mode EMIT_ALL_CONFIDENT_SITES -o realigned09-201_Hel_par_ser_HEL_4_CGATGT_L005.HC_sex.vcf
Any ideas what the problem is?
INFO 12:42:20,180 AFCalcFactory - Requested ploidy 1 maxAltAlleles 6 not supported by requested model EXACT_INDEPENDENT looking for an option INFO 12:42:20,181 AFCalcFactory - Selecting model EXACT_GENERAL_PLOIDY INFO 12:42:21,550 GATKRunReport - Uploaded run statistics report to AWS S3
java.lang.NullPointerException at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.GeneralPloidyExactAFCalc.assignGenotype(GeneralPloidyExactAFCalc.java:559) at org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.GeneralPloidyExactAFCalc.subsetAlleles(GeneralPloidyExactAFCalc.java:527) at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:286) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:335) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:320) at org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine.calculateGenotypes(UnifiedGenotypingEngine.java:310) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.isActive(HaplotypeCaller.java:844) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.addIsActiveResult(TraverseActiveRegions.java:618) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.access$800(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.hasNext(TraverseActiveRegions.java:378) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:268) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:273) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
Is GATK Unified genotyper able to call multi-allelic positions in a single pooled sample? Case is a pool of 13 samples, we use UG with ploidy set to 26. If I understand the supplementaries of the original publications correct, UG will never be able to call three alleles at a single position. in single sample calling. Or does this not hold for high ploidy analysis?
If needed, we can call multiple pools together, but this becomes computationally intensive.
In summary, we would like to call a 14xG,6xA,6xT call for example.
Also, how does UG take noise into account when genotyping (sequencing errors), when for example 3% of reads is aberrant at a position, this could correspond to ~ 1/26.
Thanks for any guidelines,
Hello! I have RNAseq data from 40 insects with a haplodiploid sex determination system. While for pupae and adults I know the sex of the individuals, I would like to determine the sex of the younger life stages. I believe there should be a way to use SNP variants to determine whether a particular library is from a haploid (male) or diploid (female) individual, but I don't know how the GATK tools work and what would be most appropriate. I would appreciate any advice!
I am working on population samples from fungi. I am aware that the Haplotype Caller is specific for diploid genomes and the UnifiedGenotyper is preferred for population samples. According to the best practices, I would what like to know what -ploidy should be specified while calling variants of population samples?
I using the UnifiedGenotyper to call SNPs in a pooled sample of 30 diploid individuals (i.e., I am setting the ploidy to 60). Does this mean that if the coverage is < 60 at a given variant site, the vcf file will read "./." for all alleles at that site? In other words, does it require the coverage to be >= the ploidy or it won't produce a called variant at that site? I'm just trying to make sure I am interpreting the vcf file correctly, in that: if there is a called genotype at a given variant site that I can interpret that as the estimated allele frequency in that pool.
Thanks in advance for any advice.
Hi all, I am running a scala script, and would like to the include "-ploidy 1". Any advice on how can I do this?
Some information that may or may not be relevant (idk!):
I am attempting this using UnifiedGenotyper (v2.7-4). At the top of the script I have: package org.broadinstitute.sting.queue.qscripts.examples import org.broadinstitute.sting.queue.QScript import org.broadinstitute.sting.queue.extensions.gatk._
I got the glm both to work using: genotyper.glm = org.broadinstitute.sting.gatk.walkers.genotyper.GenotypeLikelihoodsCalculationModel.Model.BOTH
I was hoping for something similar for ploidy.
Thank you for GATK, which I am learning to use and already considering a great tool. I am trying to analyze a whole genome shotgun dataset from Plasmodium species and trying to decide what ploidy I should use. My dataset comes from a human blood sample. Plasmodia have a brief sexual stage in the mosquito (diploid phase), but then reproduce asexually in the blood of the mammal host (haploid phase). Of course, if the diploid organisms was heterozygous for a certain trait, we are going to have a heterozygous population in the mammalian host. Also, multiclonal infections are possible, it means that more than 2 alleles for the same trait may be present in the blood.
I am sorry for the novice question, what would you suggest me regarding the ploidy parameter?
Thanks for your time and attention, Kind Regards,
Just wanted to confirm.. I have a data from 4 spores of a yeast (haploid) tetrad.. If I want to call out variants using all 4 spores (4 bam files), do I need to set -ploidy as 1 or as 4 (
Number of samples in each pool * Sample Ploidy) ??
This is the code I am using:
java -d64 -Xms1g -Xmx4g -jar GenomeAnalysisTK.jar -glm SNP -nt 52 -R genome.fasta -T UnifiedGenotyper -I $basename"_A.realigned.bam" -I $basename"_B.realigned.bam" -I $basename"_C.realigned.bam" -I $basename"_D.realigned.bam" -ploidy 4 -o $basename.snps.vcf -stand_call_conf 25.0 -stand_emit_conf 10.0
I have three pools of hiseq data where N=20,20, & 40 individuals per pool. I sequenced to ~10x depth for each pool. I would like to use the ploidy setting to estimate probable genotypes from each pool, but I'm torn because it doesn't seem correct to estimate 20 or 40 genotypes from pools that have only been sequenced to 10x depth (only 10 chromosomes could have been sampled). So in such a case, would it be more advisable to set the ploidy level to the depth level of 10 and estimate 5 genotypes per pool?
Thank you very much!
Thank you for developing a great set of tools and adding in the ploidy option to the UnifiedGenotyper! My question is in regards to the ploidy argument when calling multiple pooled samples together. If I have pooled samples from the same population at different time points and want to call SNPs with multiple sample awareness, do I use the ploidy of one sample or that of all samples. For example, if I have pooled samples of approximately 25 haploid genomes each from three different time points from the same population should I use 25 or 75 as my ploidy?
I just want to know what you mean by "-ploidy" in UnifiedGenotyper.
--sample_ploidy / -ploidy ( int with default value 2 ) Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
I am working on an inbred (homozygous) strain of mice. I have only one sample. Should I use number 2 here that represents the diploid nature of the genome. I am confused as it says Ploidy (number of chromosomes) per sample.
I am experiencing a problem with the ploidy 1 option. Having used GATK2 unified genotyper with the params --sample_ploidy 1 --genotype_likelihoods_model BOTH -rf BadCigar I get the following line in a vcf file (see sample in bold)
Staphylococcus 1553115 . A G 24454.01 . AC=13;AF=0.813;AN=16;BaseQRankSum=1.072;DP=1040;Dels=0.00;FS=32.822;HaplotypeScore=3.3463;MLEAC=13;MLEAF=0.813;MQ=40.20;MQ0=47;MQRankSum=-10.543;QD=32.13;ReadPosRankSum=-1.148;SB=-9.076e+03 GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1:0,29:29:99:1:1.00:1015,0 1:0,62:62:99:1:1.00:2053,0 1:0,106:106:99:1:1.00:3210,0 1:0,102:102:99:1:1.00:3305,0 1:0,88:88:99:1:1.00:2750,0 1:0,41:41:99:1:1.00:1324,0 1:0,76:76:99:1:1.00:2448,0 1:0,39:39:99:1:1.00:1303,0 0:64,40:104:99:0:0.00:0,1334 1:0,41:41:99:1:1.00:1373,0 1:0,49:49:99:1:1.00:1668,0 0:72,50:122:99:0:0.00:0,1258 1:0,59:59:99:1:1.00:1852,0 1:0,38:38:99:1:1.00:1192,0 1:0,31:31:99:1:1.00:961,0 0:53,0:53:99:0:0.00:0,1633
The sample in bold is called as WT (genotype 0) with a high GQ despite there being 72 reads of genotype 0 and 50 of genotype 1. Examining the bam file suggests that this is a mapping error in a repetitive phage region
If I set ploidy to be 2 the equivalent line in the resulting vcf file is
Staphylococcus 1553115 . A G 24788.02 . AC=28;AF=0.875;AN=32;BaseQRankSum=0.947;DP=1040;Dels=0.00;FS=32.822;HaplotypeScore=3.3463;InbreedingCoeff=0.4286;MLEAC=28;MLEAF=0.875;MQ=40.20;MQ0=47;MQRankSum=-10.096;QD=25.11;ReadPosRankSum=-1.177;SB=-9.871e+03 GT:AD:DP:GQ:PL 1/1:0,29:29:81:986,81,0 1/1:0,62:62:99:1895,156,0 1/1:0,106:106:99:2992,247,0 1/1:0,102:102:99:3169,268,0 1/1:0,88:88:99:2452,193,0 1/1:0,41:41:99:1243,99,0 1/1:0,76:76:99:2283,193,0 1/1:0,39:39:99:1233,105,0 0/1:64,40:104:99:886,0,1706 1/1:0,41:41:99:1298,108,0 1/1:0,49:49:99:1581,129,0 0/1:72,50:122:99:1235,0,2126 1/1:0,59:59:99:1649,132,0 1/1:0,38:38:87:1065,87,0 1/1:0,31:31:69:821,69,0 0/0:53,0:53:99:0,138,1588
As can be seen from the bold text, the same position is called as heterozygote which based on the number of the reads mapping would be likley except for the fact this is a bacterial haploid genome. Previously I would have discarded this since the heterozygous call indicates mis-mapping as the bam file confirms. I had been hoping to use the sample_polidy option set to 1 for bacterial genomes but this result concerns me. I could obviously filter based on AD but the wonder why the sample was given a high GQ when the polidy is set to 1 and the AD suggests the call of genotype 0 should be doubted.
Any suggestions on what is going on here?? Many thanks
Does the GATK team have any recommendations for filtering SNP data for haploid genomes? Our team works with microbial eukaryotes, both haploid and diploid and we have used the GATK v3 best practices for filtering for the latter. [VQSR was not possible, since we do not have access to a truth/high confidence SNP set.]
Hello, I am running the variant caller to identify SNPs and Reference Calls in a bacterial genome, which means I am running with -ploidy 1, -glm POOLSNP and -prnm POOL as suggested in other regions of this forum. The tool does an excellent job when just looking for Variants, but when I attempt to EMIT_ALL_CONFIDENT_SITES, it just spits out the SNPs and not the reference calls. When I remove the arguments stating that it is ploidy of 1, it works fine but calls SNPs that shouldn't be there since it's assuming diploid. Thus, I would really like to be able to emit all sites in ploidy=1 mode. Any reason why this is not possible? Thanks for you help! John
Leishmania has 36 chromosomes but their copy number is unpredictable for each strain and chromosome copy number can change very quickly. So what is an optimal ploidy setting for organisms with extensive aneuploidy? So far we use just diploid setting. Some samples have consistently more heterozygous SNPs in higher copy chromosomes but this relationship does not hold in many other samples: there is no strong correlation between chromosome copy number and abundance of heterozygous SNPs.