# UnifiedGenotyper

A variant caller which unifies the approaches of several disparate callers -- Works for single-sample and multi-sample data.

## Overview

The GATK Unified Genotyper is a multiple-sample, technology-aware SNP and indel caller. It uses a Bayesian genotype likelihood model to estimate simultaneously the most likely genotypes and allele frequency in a population of N samples, emitting an accurate posterior probability of there being a segregating variant allele at each locus as well as for the genotype of each sample. The system can either emit just the variant sites or complete genotypes (which includes homozygous reference calls) satisfying some phred-scaled confidence value. The genotyper can make accurate calls on both single sample data and multi-sample data.

### Input

The read data from which to make variant calls.

### Output

A raw, unfiltered, highly sensitive callset in VCF format.

### Example generic command for multi-sample SNP calling

 java -jar GenomeAnalysisTK.jar \
-R resources/Homo_sapiens_assembly18.fasta \
-T UnifiedGenotyper \
-I sample1.bam [-I sample2.bam ...] \
--dbsnp dbSNP.vcf \
-o snps.raw.vcf \
-stand_call_conf [50.0] \
-stand_emit_conf 10.0 \
-dcov [50 for 4x, 200 for >30x WGS or Whole exome] \
[-L targets.interval_list]


The above command will call all of the samples in your provided BAM files [-I arguments] together and produce a VCF file with sites and genotypes for all samples. The easiest way to get the dbSNP file is from the GATK resource bundle (see Guide FAQs for details). Several arguments have parameters that should be chosen based on the average coverage per sample in your data. See the detailed argument descriptions below.

### Example command for generating calls at all sites

 java -jar /path/to/GenomeAnalysisTK.jar \
-l INFO \
-R resources/Homo_sapiens_assembly18.fasta \
-T UnifiedGenotyper \
-I /DCC/ftp/pilot_data/data/NA12878/alignment/NA12878.SLX.maq.SRP000031.2009_08.bam \
-o my.vcf \
--output_mode EMIT_ALL_SITES


### Caveats

• The system is under active and continuous development. All outputs, the underlying likelihood model, arguments, and file formats are likely to change.
• The system can be very aggressive in calling variants. In the 1000 genomes project for pilot 2 (deep coverage of ~35x) we expect the raw Qscore > 50 variants to contain at least ~10% FP calls. We use extensive post-calling filters to eliminate most of these FPs. Variant Quality Score Recalibration is a tool to perform this filtering.
• The generalized ploidy model can be used to handle non-diploid or pooled samples (see the -ploidy argument in the table below).

These Read Filters are automatically applied to the data by the Engine before processing by UnifiedGenotyper.

### Parallelism options

This tool can be run in multi-threaded mode using these options.

### Downsampling settings

This tool applies the following downsampling settings by default.

• Mode: BY_SAMPLE
• To coverage: 250

### Window size

This tool uses a sliding window on the reference.

• Window start: -200 bp before the locus
• Window stop: 200 bp after the locus

## Command-line Arguments

### Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

### UnifiedGenotyper specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Optional Inputs
--alleles
none The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES
--comp
[] Comparison VCF file
--dbsnp
-D
none dbSNP file
Optional Outputs
--out
-o
stdout File to which variants should be written
Optional Parameters
--annotation
-A
[] One or more specific annotations to apply to variant calls
--contamination_fraction_to_filter
-contamination
0.0 Fraction of contamination in sequencing data (for all samples) to aggressively remove
--excludeAnnotation
-XA
[] One or more specific annotations to exclude
--genotype_likelihoods_model
-glm
SNP Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together
--genotyping_mode
-gt_mode
DISCOVERY Specifies how to determine the alternate alleles to use for genotyping
--group
-G
[Standard] One or more classes/groups of annotations to apply to variant calls
--heterozygosity
-hets
0.001 Heterozygosity value used to compute prior likelihoods for any locus. See the GATKDocs for full details on the meaning of this population genetics concept
--indel_heterozygosity
-indelHeterozygosity
1.25E-4 Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept
--max_deletion_fraction
-deletions
0.05 Maximum fraction of reads with deletions spanning this locus for it to be callable
--min_base_quality_score
-mbq
17 Minimum base quality required to consider a base for calling
--min_indel_count_for_genotyping
-minIndelCnt
5 Minimum number of consensus indels required to trigger genotyping run
--min_indel_fraction_per_sample
-minIndelFrac
0.25 Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles
--output_mode
-out_mode
EMIT_VARIANTS_ONLY Specifies which type of calls we should output
--pair_hmm_implementation
-pairHMM
LOGLESS_CACHING The PairHMM implementation to use for -glm INDEL genotype likelihood calculations
--pcr_error_rate
-pcr_error
1.0E-4 The PCR error rate to be used for computing fragment-based likelihoods
--sample_ploidy
-ploidy
2 Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
--standard_min_confidence_threshold_for_calling
-stand_call_conf
30.0 The minimum phred-scaled confidence threshold at which variants should be called
--standard_min_confidence_threshold_for_emitting
-stand_emit_conf
30.0 The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold)
Optional Flags
--annotateNDA
-nda
false If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
--computeSLOD
-slod
false If provided, we will calculate the SLOD (SB annotation)
--contamination_fraction_per_sample_file
-contaminationFile
NA Tab-separated File containing fraction of contamination in sequencing data (per sample) to aggressively remove. Format should be "" (Contamination is double) per line; No header.
--indelGapContinuationPenalty
-indelGCP
10 Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
--indelGapOpenPenalty
-indelGOP
45 Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10
--input_prior
-inputPrior
[] Input prior for calls
--max_alternate_alleles
-maxAltAlleles
6 Maximum number of alternate alleles to genotype
--onlyEmitSamples
[] If provided, only these samples will be emitted into the VCF, regardless of which samples are present in the BAM file
--allSitePLs
false Annotate all sites with PLs

### Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

### --alleles / -alleles

The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES
When the UnifiedGenotyper is put into GENOTYPE_GIVEN_ALLELES mode it will genotype the samples using only the alleles provide in this rod binding

--alleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

RodBinding[VariantContext]  none

### --allSitePLs / -allSitePLs

Annotate all sites with PLs
Advanced, experimental argument: if SNP likelihood model is specified, and if EMIT_ALL_SITES output mode is set, when we set this argument then we will also emit PLs at all sites. This will give a measure of reference confidence and a measure of which alt alleles are more plausible (if any). WARNINGS: - This feature will inflate VCF file size considerably. - All SNP ALT alleles will be emitted with corresponding 10 PL values. - An error will be emitted if EMIT_ALL_SITES is not set, or if anything other than diploid SNP model is used

boolean  false

### --annotateNDA / -nda

If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
Depending on the value of the --max_alternate_alleles argument, we may genotype only a fraction of the alleles being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the number of alternate alleles that were originally discovered at the site.

boolean  false

### --annotation / -A

One or more specific annotations to apply to variant calls
Which annotations to add to the output VCF file. See the VariantAnnotator -list argument to view available annotations.

List[String]  []

### --comp / -comp

Comparison VCF file
If a call overlaps with a record from the provided comp track, the INFO field will be annotated as such in the output with the track name (e.g. -comp:FOO will have 'FOO' in the INFO field). Records that are filtered in the comp track will be ignored. Note that 'dbSNP' has been special-cased (see the --dbsnp argument).

--comp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

List[RodBinding[VariantContext]]  []

### --computeSLOD / -slod

If provided, we will calculate the SLOD (SB annotation)
Note that calculating the SLOD increases the runtime by an appreciable amount.

boolean  false

### --contamination_fraction_per_sample_file / -contaminationFile

Tab-separated File containing fraction of contamination in sequencing data (per sample) to aggressively remove. Format should be "" (Contamination is double) per line; No header.
This argument specifies a file with two columns "sample" and "contamination" specifying the contamination level for those samples. Samples that do not appear in this file will be processed with CONTAMINATION_FRACTION.

File

### --contamination_fraction_to_filter / -contamination

Fraction of contamination in sequencing data (for all samples) to aggressively remove
If this fraction is greater is than zero, the caller will aggressively attempt to remove contamination through biased down-sampling of reads. Basically, it will ignore the contamination fraction of reads for each alternate allele. So if the pileup contains N total bases, then we will try to remove (N * contamination fraction) bases for each alternate allele.

double  0.0  [ [ -∞  ∞ ] ]

### --dbsnp / -D

dbSNP file
rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. dbSNP is not used in any way for the calculations themselves.

--dbsnp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

RodBinding[VariantContext]  none

### --excludeAnnotation / -XA

One or more specific annotations to exclude
Which annotations to exclude from output in the VCF file. Note that this argument has higher priority than the -A or -G arguments, so annotations will be excluded even if they are explicitly included with the other options.

List[String]  []

### --genotype_likelihoods_model / -glm

Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together

The --genotype_likelihoods_model argument is an enumerated type (Model), which can have one of the following values:

SNP
INDEL
GENERALPLOIDYSNP
GENERALPLOIDYINDEL
BOTH

Model  SNP

### --genotyping_mode / -gt_mode

Specifies how to determine the alternate alleles to use for genotyping

The --genotyping_mode argument is an enumerated type (GenotypingOutputMode), which can have one of the following values:

DISCOVERY
The genotyper will choose the most likely alternate allele
GENOTYPE_GIVEN_ALLELES
Only the alleles passed by the user should be considered.

GenotypingOutputMode  DISCOVERY

### --group / -G

One or more classes/groups of annotations to apply to variant calls
If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups. Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.

String[]  [Standard]

### --heterozygosity / -hets

Heterozygosity value used to compute prior likelihoods for any locus. See the GATKDocs for full details on the meaning of this population genetics concept
The expected heterozygosity value used to compute prior probability that a locus is non-reference. The default priors are for provided for humans: het = 1e-3 which means that the probability of N samples being hom-ref at a site is: 1 - sum_i_2N (het / i) Note that heterozygosity as used here is the population genetics concept: http://en.wikipedia.org/wiki/Zygosity#Heterozygosity_in_population_genetics That is, a hets value of 0.01 implies that two randomly chosen chromosomes from the population of organisms would differ from each other (one being A and the other B) at a rate of 1 in 100 bp. Note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype, which in the GATK is purely determined by the probability of the observed data P(D | AB) under the model that there may be a AB het genotype. The posterior probability of this AB genotype would use the het prior, but the GATK only uses this posterior probability in determining the prob. that a site is polymorphic. So changing the het parameters only increases the chance that a site will be called non-reference across all samples, but doesn't actually change the output genotype likelihoods at all, as these aren't posterior probabilities at all. The quantity that changes whether the GATK considers the possibility of a het genotype at all is the ploidy, which determines how many chromosomes each individual in the species carries.

Double  0.001  [ [ -∞  ∞ ] ]

### --indel_heterozygosity / -indelHeterozygosity

Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept
This argument informs the prior probability of having an indel at a site.

double  1.25E-4  [ [ -∞  ∞ ] ]

### --indelGapContinuationPenalty / -indelGCP

Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10

byte  10  [ [ -∞  ∞ ] ]

### --indelGapOpenPenalty / -indelGOP

Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10

byte  45  [ [ -∞  ∞ ] ]

### --input_prior / -inputPrior

Input prior for calls
By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2*N+1)) 2*N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

List[Double]  []

### --max_alternate_alleles / -maxAltAlleles

Maximum number of alternate alleles to genotype
If there are more than this number of alternate alleles presented to the genotyper (either through discovery or GENOTYPE_GIVEN ALLELES), then only this many alleles will be used. Note that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. As of GATK 2.2 the genotyper can handle a very large number of events, so the default maximum has been increased to 6.

int  6  [ [ -∞  ∞ ] ]

### --max_deletion_fraction / -deletions

Maximum fraction of reads with deletions spanning this locus for it to be callable
If the fraction of reads with deletions spanning a locus is greater than this value, the site will not be considered callable and will be skipped. To disable the use of this parameter, set its value to >1.

Double  0.05  [ [ -∞  ∞ ] ]

### --min_base_quality_score / -mbq

Minimum base quality required to consider a base for calling
The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).

int  17  [ [ -∞  ∞ ] ]

### --min_indel_count_for_genotyping / -minIndelCnt

Minimum number of consensus indels required to trigger genotyping run
A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives.

int  5  [ [ -∞  ∞ ] ]

### --min_indel_fraction_per_sample / -minIndelFrac

Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles
Complementary argument to minIndelCnt. Only samples with at least this fraction of indel-containing reads will contribute to counting and overcoming the threshold minIndelCnt. This parameter ensures that in deep data you don't end up summing lots of super rare errors up to overcome the 5 read default threshold. Should work equally well for low-coverage and high-coverage samples, as low coverage samples with any indel containing reads should easily over come this threshold.

double  0.25  [ [ -∞  ∞ ] ]

### --onlyEmitSamples / -onlyEmitSamples

If provided, only these samples will be emitted into the VCF, regardless of which samples are present in the BAM file

Set[String]  []

### --out / -o

File to which variants should be written
A raw, unfiltered, highly sensitive callset in VCF format.

VariantContextWriter  stdout

### --output_mode / -out_mode

Specifies which type of calls we should output

The --output_mode argument is an enumerated type (OutputMode), which can have one of the following values:

EMIT_VARIANTS_ONLY
produces calls only at variant sites
EMIT_ALL_CONFIDENT_SITES
produces calls at variant sites and confident reference sites
EMIT_ALL_SITES
produces calls at any callable site regardless of confidence; this argument is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode

OutputMode  EMIT_VARIANTS_ONLY

### --pair_hmm_implementation / -pairHMM

The PairHMM implementation to use for -glm INDEL genotype likelihood calculations
The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.

The --pair_hmm_implementation argument is an enumerated type (HMM_IMPLEMENTATION), which can have one of the following values:

EXACT
ORIGINAL
LOGLESS_CACHING
VECTOR_LOGLESS_CACHING
DEBUG_VECTOR_LOGLESS_CACHING
ARRAY_LOGLESS

HMM_IMPLEMENTATION  LOGLESS_CACHING

### --pcr_error_rate / -pcr_error

The PCR error rate to be used for computing fragment-based likelihoods
The PCR error rate is independent of the sequencing error rate, which is necessary because we cannot necessarily distinguish between PCR errors vs. sequencing errors. The practical implication for this value is that it effectively acts as a cap on the base qualities.

Double  1.0E-4  [ [ -∞  ∞ ] ]

### --sample_ploidy / -ploidy

Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
Sample ploidy - equivalent to number of chromosomes per pool. In pooled experiments this should be = # of samples in pool * individual sample ploidy

int  2  [ [ -∞  ∞ ] ]

### --standard_min_confidence_threshold_for_calling / -stand_call_conf

The minimum phred-scaled confidence threshold at which variants should be called
The minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this is the default).

double  30.0  [ [ -∞  ∞ ] ]

### --standard_min_confidence_threshold_for_emitting / -stand_emit_conf

The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold)
This argument allows you to emit low quality calls as filtered records.

double  30.0  [ [ -∞  ∞ ] ]

GATK version 3.2-2-gec30cee built at 2014/09/12 22:29:29. GTD: NA