Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.
If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.
In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set
Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).
This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to
SNP, but it can also be set to
BOTH. If set to
BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.
This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.
This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.
The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R haploid_reference.fa \ -I haploid_reads.bam \ -L 20 \ -ploidy 1 --glm BOTH \ --stand\_call\_conf 30 \ --stand\_emit\_conf 10 \ -o raw_haploid_variants.vcf
This creates a VCF file called
raw_haploid_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.
Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
Since version 2.0, the UnifiedGenotyper has been able to deal with ploidies other than two. Three use cases are currently supported:
In order to enable this feature, you need to set the
-ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -I myReads.bam \ -T UnifiedGenotyper \ -ploidy 4
glm argument works in the same way as in the diploid case - set to
[INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed over time, but for now please keep these in mind.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
The HaplotypeCaller and ReduceReads currently do not support non-diploid data.
In theory you can use VQSR to filter non-diploid calls, but we currently have no experience with this and therefore cannot offer any support nor best practices on how to do this.
For indels, only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
Just wanted to confirm.. I have a data from 4 spores of a yeast (haploid) tetrad.. If I want to call out variants using all 4 spores (4 bam files), do I need to set -ploidy as 1 or as 4 (
Number of samples in each pool * Sample Ploidy) ??
This is the code I am using:
java -d64 -Xms1g -Xmx4g -jar GenomeAnalysisTK.jar -glm SNP -nt 52 -R genome.fasta -T UnifiedGenotyper -I $basename"_A.realigned.bam" -I $basename"_B.realigned.bam" -I $basename"_C.realigned.bam" -I $basename"_D.realigned.bam" -ploidy 4 -o $basename.snps.vcf -stand_call_conf 25.0 -stand_emit_conf 10.0
I'm working with SNP calling in a bacterium - I don't have a set of known SNPs, so prior to recalibration, so generate a mask file from all data. My question is, what should I put for the following two options:
because the bacterium is haploid (and I specify --ploidy 1) it seems like these options should be set to zero, as there are no heterozygous loci, but I worry that if I set them to zero, that it won't work as expected. Any advice? I was just setting them to 0.001
Does the GATK team have any recommendations for filtering SNP data for haploid genomes? Our team works with microbial eukaryotes, both haploid and diploid and we have used the GATK v3 best practices for filtering for the latter. [VQSR was not possible, since we do not have access to a truth/high confidence SNP set.]