Unified genotyper
From GSA
Contents |
The Unified Genotyper
The GATK Unified Genotyper is a multiple-sample, technology-aware SNP caller. It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting an accurate posterior probability of there being a segregating variant allele at each locus as well as for the genotype of each sample. The system can either emit just the variant sites or complete genotypes (which includes homozygous reference calls) satisfying some phred-scaled confidence value. The genotyper can make accurate calls on both single sample data and multi-sample data.
For those users who have run older (and now deprecated) instantiations of our genotypers and can no longer find these once familiar tools in the GATK, the Unified Genotyper encompasses the functionality of both the Single Sample Genotyper and the Multi Sample Caller.
Recent Changes
- We no longer support outputting calls in a format other than VCF, as the overhead for doing so was too high.
- The argument for specifying the output VCF file is now -o instead of -varout.
Inputs
Genotype Calculation Model
The -gm,--genotype_model option controls the model used to calculate the genotype of the data at a given locus.
- JOINT_ESTIMATE is the default model. It performs well in all cases (i.e. even marginal cases) and achieves an accurate allele frequency estimate.
Base Mismatching Mode
The -bm,--base_model option controls the error model used for calculating P(base | genotype):
- ONE_STATE in which the probability of a mismatching base to a genotype is the error probability according to the Q score, e
- THREE_STATE in which the prob. is e / 3, reflecting that a miscalled base can be any of the three other bases, but is anti-conservative in practice
- EMPIRICAL uses experimental mismatch probabilities derived from machine error rates themselves. Significantly improves performance but (1) requires BAM read groups by read annotated with platform or (2) manual assignment of the machine type with the --platform argument. The EMPIRICAL mode supports Solexa, solid, and 454 data, but is not well tested on solid or 454.
Heterozygosity
You can use the --heterozygosity argument to change the expected heterozygosity value used to compute prior likelihoods for any locus. The default priors are:
- het = 1e-3
- P(hom-ref genotype) = 1 - 3 * het / 2
- P(het genotype) = het
- P(hom-var genotype) = het / 2
Minimum Confidence Threshold
- -stand_call_conf,--standard_min_confidence_threshold_for_calling is the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls (that aren't at 'trigger' sites). Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 (this is the default).
- -stand_emit_conf,--standard_min_confidence_threshold_for_emitting is the minimum phred-scaled Qscore threshold to emit low confidence calls (that aren't at 'trigger' sites). Genotypes with confidence >= this but less than the calling threshold are emitted but marked as filtered. The default value is 30.
- -trig_call_conf,--trigger_min_confidence_threshold_for_calling is the minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls that are at 'trigger' sites. Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 (this is the default).
- -trig_emit_conf,--trigger_min_confidence_threshold_for_emitting is the minimum phred-scaled Qscore threshold to emit low confidence calls that are at 'trigger' sites. Genotypes with confidence >= this but less than the calling threshold are emitted but marked as filtered. The default value is 30.
Arguments to deal with malformed input bams
- -single_sample,--assume_single_sample_reads specifies which sample the genotyper should associate with reads which don't have read groups. This argument should only be provided in conjunction with single-sample data (e.g. "-single_sample NA12878"). Defaults to null, indicating that the system will throw a runtime exception when such reads are detected.
- -pl,--platform <platform> Causes the genotyper to assume that reads without the PL header TAG are this platform. Defaults to null, indicating that the system will throw a runtime exception when such reads are detected.
Arguments Affecting Outputs
- -o specifies the output file, e.g. -o my.calls.vcf
- -genotype,--genotype enables genotyping mode, whereby the confidence in the genotype itself is used for the confidence threshold test rather than the confidence in a non-reference genotype.
- -all_bases,--output_all_callable_bases instructs the genotyper to emit calls at all bases with coverage, regardless of the confidence or genotype at the locus.
- -verbose,--verbose_mode specifies the output file used to print debugging information.
- -metrics,--metrics_file specifies the output file used to print various metrics about callability.
- -nsl,--noSLOD instructs the Genotyper not to calculate the SLOD.
- -A,--annotations tells the Genotyper to apply one or more annotations to variant calls (e.g. '-A DepthOfCoverage -A AlleleBalance' would include those 2 annotations); see the VariantAnnotator for more details.
- -G,--groups instructs the Genotyper to apply one or more annotation interfaces/groups to variant calls (e.g. '-G Standard' would apply all annotations which implement the StandardAnnotation interface); see the VariantAnnotator for more details.
- -deletions,--max_deletion_fraction specifies the maximum fraction of reads with deletions spanning a locus tolerated by the genotyper (the genotyper won't make calls at loci with too many deletions). Default is 0.05.
- -mbq,--min_base_quality_score specifies the minimum base quality required to consider a base for calling. Default is 10.
- -mmq,--min_mapping_quality_score specifies the minimum read mapping quality required to consider a read for calling (MQ0 reads are ignored regardless). Default is 10.
- -mm40,--max_mismatches_in_40bp_window specifies the maximum number of mismatches within a 40 bp window (20bp on either side) around the target position for a read to be used for calling (i.e. reads with too many mismatches won't be used by the genotyper). Default is 3. To disable, set this value to 40 or greater.
- -bad_mates,--use_reads_with_bad_mates instructs the genotyper to use reads whose mates are mapped excessively far away for calling. Default is not to use them.
- -cap_base_qual,--cap_base_quality_by_mapping_quality instructs the genotyper to cap the base quality of any given base by its read's mapping quality.
Miscellaneous
- -mrl,--max_reads_at_locus specifies the maximum coverage at a locus. This is actually a GATK-wide argument and is used to skip loci that have too much coverage. If the Unified Genotyper is running abnormally slow, it is most likely due to excessive coverage in your input.
- One can optionally provide one or more 'trigger' ROD tracks that instruct the Unified Genotyper to emit calls at sites which overlap the track(s) even if they are reference sites. This is an alternative to the -all_bases mode which can potentially result in massive files (especially when calling genome-wide). If any track whose name begins with 'trigger' is provided, the genotyper will emit calls at variant sites plus sites that overlap the track (e.g. -B:trigger,VCF file.vcf).
How the annotations in the INFO field of the output VCF are calculated
The Strand Bias calculation is described here (in the figure to the right). Other annotations are described on the page for the VariantAnnotator. Also, the QUAL field of the VCF is the phred-scaled probability that the call being made is incorrect (i.e. -10*log_10 p[no variant]).
Example commands
1000 genomes pilot 2
Example command using a data file from the 1000 genomes archive with the empirical calling table for the first 10K of chr1 starting at 11Mb.
java -jar /path/to/GenomeAnalysisTK.jar \ -l INFO \ -L 1:11,000,000-11,010,000 \ -R /broad/1KG/reference/human_b36_both.fasta \ -T UnifiedGenotyper \ -I /broad/1KG/DCC/ftp/pilot_data/NA12878/alignment/NA12878.chrom1.SLX.SRP000032.2009_07.bam \ -o my.vcf \ -stand_call_conf 30.0 \ -stand_emit_conf 10.0
Generating calls at all sites
The following example command generates calls for the low coverage 1000 genomes pilot 1 bam file using the empirical confusion matrix at all sites.
java -jar /path/to/GenomeAnalysisTK.jar \ -l INFO \ -R /broad/1KG/reference/human_b36_both.fasta \ -T UnifiedGenotyper \ -I /broad/1KG/DCC/ftp/pilot_data/data/NA12878/alignment/NA12878.SLX.maq.SRP000031.2009_08.bam \ -o my.vcf \ -all_bases
It's common to want to operate only over a part of the genome and to output SNP calls to standard output, rather than a file. The -L option lets you specify the region to process. If you set -o to /dev/stdout (or leave it out completely), output will be sent to the standard output of the console. Turn off logging completely (-l OFF) so that the GATK operates in silent mode.
Caveats
- The system is under active and continuous development. All outputs, the underlying likelihood model, arguments, and file formats are liable to change.
- The system can be very aggressive in calling variants. In the 1000 genomes project for pilot 2 (deep coverage of ~35x) we expect the raw Qscore > 50 variants to contain at least ~10% FP calls. We use extensive post-calling filters to eliminate most of these FPs. VariantFilteration is a tool to perform this filtering.
- We only handle diploid genotypes
