The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, and its ability to call indels is far superior. We recommend using HaplotypeCaller in all cases, with only a few exceptions:
In those cases, we recommend using UnifiedGenotyper instead of HaplotypeCaller.
Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.
If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.
In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set
Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).
This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to
SNP, but it can also be set to
BOTH. If set to
BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.
This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.
This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.
The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.
Run the following GATK command:
java -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R haploid_reference.fa \ -I haploid_reads.bam \ -L 20 \ -ploidy 1 --glm BOTH \ --stand\_call\_conf 30 \ --stand\_emit\_conf 10 \ -o raw_haploid_variants.vcf
This creates a VCF file called
raw_haploid_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.
Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.
For a complete, detailed argument reference, refer to the technical documentation page.
The GATK requires the reference sequence in a single reference sequence in FASTA format, with all contigs in the same file. The GATK requires strict adherence to the FASTA standard. Only the standard ACGT bases are accepted; no non-standard bases (W, for example) are tolerated. Gzipped fasta files will not work with the GATK, so please make sure to unzip them first. Please see [Preparing the essential GATK input files: the reference genome] for more information on preparing FASTA reference sequences for use with the GATK.
The Unified Genotyper now makes multi-allelic variant calls!
The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.
The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools'
mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.
While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).
Note also that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.
Finally, while many of the parameters are common between indel and SNP calling, some parameters have different meaning or operate differently. For example,
--min_base_quality_score has a fixed, well defined operation for SNPs (bases at a particular location with base quality lower than this threshold are ignored). However, indel calling is by definition delocalized and haplotype-based, so this parameter does not make sense. Instead, the indel caller will clip both ends of the reads if their quality is below a certain threshold (Q20), up to the point where there is a base in the read exceeding this threshold.
Note that the Unified Genotyper will not call indels in 454 data!
It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The
-L option lets you specify the region to process. If you set
/dev/stdout (or leave it out completely), output will be sent to the standard output of the console.
You can turn off logging completely by setting
-l OFF so that the GATK operates in silent mode.
By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage:
The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous.
That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on the fly. For example, adding
-rf ReassignMappingQuality -DMQ 60 to your command-line would change all mapping qualities in your bam to 60.
Or, if you are working with data from a program like TopHat which uses MAPQ 255 to convey meaningful information, you can use the ReassignOneMappingQuality filter (new in 2.4) to assign a different MAPQ value to those reads so they won't be ignored by GATK tools. For example, adding
-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 would change the mapping qualities of reads with MAPQ 255 in your bam to MAPQ 60.
At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:
INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126
This is what these lines mean:
This the total number of reference bases that were visited.
Visited bases minus reference Ns and places with no coverage, which we never try to call.
Callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if
T is the min confidence, this is the count of bases where
QUAL > T for the site being reference in all samples and/or
QUAL > T for the site being non-reference in at least one sample.
Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the percentage of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.
Note also that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.
The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in the GSA Public Drop Box.
Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.
Just because something looks like a SNP in IGV doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs) and in the HaplotypeCaller (for all variants including indels). So, before you post this issue in our support forum, please do a little bit of investigation on your own, as follows.
To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here.
Here is a checklist of questions you should ask yourself:
The genotyper ignores sites if there are too many overlapping deletions. This value can be set using the
--max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument), but be aware that increasing it could affect the reliability of your results.
Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base. If your would-be SNP is only supported by low-confidence bases, it is probably a false positive.
Keep in mind that the depth reported in the VCF is the unfiltered depth. You may think you have good coverage at that site, but the Unified Genotyper ignores bases if they don't look good, so actual coverage seen by the UG may be lower than you think.
A base's quality is capped by the mapping quality of its read. The reason for this is that low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- you may be looking at the sequence of some other locus in the genome!
Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.
By default the UG will only consider a certain number of alternate alleles. This value can be set using the
--max_alternate_alleles argument (see the UG's documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter.
SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site? If so, you are probably seeing false positives.
In addition to the reasons above, Haplotype Caller has another reason why some variants do not get called when it seems obvious in the original bam file.
Haplotype Caller performs a local reassembly of the reads in the active region. Please refer here for more details: http://www.broadinstitute.org/gatk/guide/article?id=4148
This reassembly is important, because when mapping reads to the whole genome, it is easy to make an error. When reassembling reads in a much smaller region, such as the active region, the alignment is more likely to be accurate.
The reads you see in the alignment of the original bam file may suggest a variant should be called. However, due to the realignment, the reads may no longer support the variant. In order to see the new alignment of reads, you can use -bamout argument. You can then compare the aligned reads from the original bam file to the newly aligned reads in the -bamout file.
In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!
In general most GATK tools don't care about ploidy. The major exception is, of course, at the variant calling step: the variant callers need to know what ploidy is assumed for a given sample in order to perform the appropriate calculations.
Since version 2.0, the UnifiedGenotyper has been able to deal with ploidies other than two. Three use cases are currently supported:
In order to enable this feature, you need to set the
-ploidy argument to desired number of chromosomes per organism. In the case of pooled sequencing experiments, this argument should be set to the number of chromosomes per barcoded sample, i.e.
(Ploidy per individual) * (Individuals in pool).
Note that all other UnifiedGenotyper arguments work in the same way.
A full minimal command line would look for example like
java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -I myReads.bam \ -T UnifiedGenotyper \ -ploidy 4
glm argument works in the same way as in the diploid case - set to
[INDEL|SNP|BOTH] to specify which variants to discover and/or genotype.
Many of these limitations will be gradually removed over time, but for now please keep these in mind.
Fragment-aware calling like the one provided by default for diploid organisms is not present for the non-diploid case.
Some annotations do not work in non-diploid cases. In particular, InbreedingCoeff will not be annotated on non-diploid calls. Annotations that do work and are supported in non-diploid use cases are the following:
AF, and Genotype annotations such as
The HaplotypeCaller and ReduceReads currently do not support non-diploid data.
In theory you can use VQSR to filter non-diploid calls, but we currently have no experience with this and therefore cannot offer any support nor best practices on how to do this.
For indels, only a maximum of 4 alleles can be genotyped. This is not relevant for the SNP case, but discovering or genotyping more than this number of indel alleles will not work and an arbitrary set of 4 alleles will be chosen at a site.
You should also be aware of the fundamental accuracy limitations of high ploidy calling. Calling low-frequency variants in a pool or in an organism with high ploidy is hard because these rare variants become almost indistinguishable from sequencing errors.
GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history
Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.
As reported here:
If you encounter this bug too, please don't post a new question about it. Feel free to comment in this thread to let us know you have also had the same problem. Tell us what version of the GATK you were using and post your command line.
Thank you for your patience while we work to fix this issue.