Unified genotyper
The Unified Genotyper
For a complete, detailed argument reference, refer to the GATK document page here.
Relatively Recent Changes
The Unified Genotyper now makes multi-allelic variant calls!
Fragment-based calling
The Unified Genotyper calls SNPs via a two-stage inference, first from the reads to the sequenced fragments, and then from these inferred fragments to the chromosomal sequence of the organism. This two-stage system properly handles the correlation of errors between read pairs when the sequenced fragments contains errors itself. See Fragment-based calling PDF for more details and analysis.
The Allele Frequency Calculation
The allele frequency calculation model used by the Unified Genotyper computes a mathematically precise estimation of the allele frequency at a site given the read data. The mathematical derivation is similar to the one used by Samtools' mpileup tool. Heng Li has graciously allowed us to post the mathematical calculations backing the EXACT model here; more info on Samtools is available here: [1]. Note that the calculations in the provided document assume just a single alternate allele for simplicity, whereas the Unified Genotyper has been extended to handle genotyping multi-allelic events. A slide showing the mathematical details for multi-allelic calling is available here.
Explanation of the VCF Output
See Understanding the Unified Genotyper's VCF files.
The Unified Genotyper didn't call my SNP! What happened?
It is important to point out that just because something looks like a SNP in IGV, it doesn't mean that it is of high quality. We are extremely confident in the genotype likelihoods calculations in the Unified Genotyper (especially for SNPs), so before you post this issue to our support forum you will first need to do a little investigation on your own. To diagnose what is happening, you should take a look at the pileup of bases at the position in question. It is very important for you to look at the underlying data here. Here are some pointers to get you started in the right direction:
- What do the base qualities look like for the non-reference bases? Remember that there is a minimum base quality threshold and that low base qualities mean that the sequencer assigned a low confidence to that base.
- What do the mapping qualities look like for the reads with the non-reference bases? A base's quality is capped by the mapping quality of its read as low mapping qualities mean that the aligner had little confidence that the read is mapped to the correct location in the genome. Also, reads with mapping quality 255 ("unknown") are ignored.
- Are you working with SOLiD data? SOLiD alignments tend to have reference bias and it can be severe in some cases. Do the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site?
- Remember that the depth reported in the VCF is the unfiltered depth - the Unified Genotyper ignores bases if they don't look good. Also, keep in mind that currently the Unified Genotyper will only call a single alternate allele (for SNPs) at a position (based on the most likely one).
How do I make tumor vs. normal calls with the Unified Genotyper?
See Cancer-specific Variant Discovery Tools.
Indel Calling with the Unified Genotyper
While the indel calling capabilities of the Unified Genotyper are still under active development, they are now in a stable state and are supported for use by external users. Please note that, as with SNPs, the Unified Genotyper is fairly aggressive in making a call and, consequently, the false positive rate will be high in the raw call set. We expect users to properly filter these results as per our best practices (which will be changing continually).
Note that it is critical for the correct operation of the indel calling that the BAM file to be called is previously indel-realigned (see the IndelRealigner section on details). We strongly recommend doing joint Smith-Waterman alignment and not only per-lane or per-sample alignment at known sites. This is important because the caller is only empowered to genotype indels which are already present in reads.
Miscellaneous notes
Note that the Unified Genotyper will not call indels in 454 data!
It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console. Turn off logging completely (-l OFF) so that the GATK operates in silent mode.
By default the Unified Genotyper downsamples each sample's coverage to no more than 250x (so there will be at most 250 * number_of_samples reads at a site). Unless there is a good reason for wanting to change this value, we suggest using this default value especially for exome processing; allowing too much coverage will require a lot more memory to run. When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this value to about 10 times the average coverage: '-dcov 40'.
The Unified Genotyper does not use reads with a mapping quality of 255 ("unknown quality" according to the SAM specification). This filtering is enforced because the genotyper caps a base's quality by the mapping quality of its read (since the probability of the base's being correct depends on both qualities). We rely on sensible values for the mapping quality and therefore using reads with a 255 mapping quality is dangerous. That being said, if you are working with a data type where alignment quality cannot be determined, there is a (completely unsupported) workaround: the ReassignMappingQuality filter enables you to reassign the mapping quality of all reads on-the-fly. For example, adding "-rf ReassignMappingQuality -DMQ 60" to your command-line would change all mapping qualities in your bam to 60.
Callable bases
At the end of a GATK UG run, you should see if you have -l INFO enabled a report that looks like:
INFO 00:23:29,795 UnifiedGenotyper - Visited bases 247249719 INFO 00:23:29,796 UnifiedGenotyper - Callable bases 219998386 INFO 00:23:29,796 UnifiedGenotyper - Confidently called bases 219936125 INFO 00:23:29,796 UnifiedGenotyper - % callable bases of all loci 88.978 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of all loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - % confidently called bases of callable loci 88.953 INFO 00:23:29,797 UnifiedGenotyper - Actual calls made 303126
Here are the meaning of the lines:
- The visited bases is the total number of reference bases that were visited.
- Callable bases are visited bases minus reference Ns and places with no coverage, which we never try to call.
- Confidently called bases are callable bases that exceed the emit confidence threshold, either for being non-reference or reference. That is, if T is the min confidence, this is the count of bases where QUAL > T for the site being reference in all samples and or QUAL > T for the site being non-reference in at least one sample
Note a subtle implication of the last statement, with all samples vs. any sample: calling multiple samples tends to reduce the % of confidently callable bases, as in order to be confidently reference one has to be able to establish that all samples are reference, which is hard because of the stochastic coverage drops in each sample.
Note that confidently called bases will rise with additional data per sample, so if you don't dedup your reads, include lots of poorly mapped reads, the numbers will increase. Of course, just because you confidently call the site doesn't mean that the data processing resulted in high-quality output, just that we had sufficient statistical evident based on your input data to called ref / non-ref.
Calling sex chromosomes
The GATK can be used to call the sex (X and Y) chromosomes, without explicit knowledge of the gender of the samples. In an ideal world, with perfect upfront data processing, we would get perfect genotypes on the sex chromosomes without knowledge of who is diploid on X and has no Y, and who is hemizygous on both. However, misalignment and mismapping contributes especially to these chromosomes, as their reference sequence is clearly of lower quality than the autosomal regions of the genome. Nevertheless, it is possible to get reasonably good SNP calls, even with simple data processing and basic filtering. Results with proper, full data processing as per the best practices in the GATK should lead to very good calls. You can view a presentation "The GATK Unified Genotyper on chrX and chrY" in GSA Public Drop Box
Our general approach to calling on X and Y is to treat them just as we do the autosomes and then applying a gender-aware tools to correct the genotypes afterwards. It makes sense to filter out sites across all samples (outside PAR) that appear as confidently het in males, as well as sites on Y that appear confidently non-reference in females. Finally, it's possible to simply truncate the genotype likelihoods for males and females as appropriate from their diploid likelihoods -- AA, AB, and BB -- to their haploid equivalents -- AA and BB -- and adjust the genotype calls to reflect only these two options. We applied this approach in 1000G, but we only did it as the data went into imputation, so there's no simple tool to do this, unfortunately. The GATK team is quite interested in a general sex correction tool (analogous to the PhaseByTransmission tool for trios), so please do contact us if you are interested in contributing such a tool to the GATK codebase.