No posts found Could not load requested forum posts.

CalculateGenotypePosteriors

Calculates genotype posterior likelihoods given panel data

Category Variant Evaluation and Manipulation Tools

Traversal LocusWalker

PartitionBy LOCUS


Overview

Given a VCF with genotype likelihoods from the HaplotypeCaller, UnifiedGenotyper, or another source which provides -unbiased- GLs, calculate the posterior genotype state and likelihood given allele frequency information from both the samples themselves and input VCFs describing allele frequencies in related populations. VCFs to use for informing the genotype likelihoods (e.g. a population-specific VCF from 1000 genomes) should have at least one of: - AC field and AN field - MLEAC field and AN field - genotypes The AF field will not be used in this calculation as it does not provide a way to estimate the confidence interval or uncertainty around the allele frequency, while AN provides this necessary information. This uncertainty is modeled by a Dirichlet distribution: that is, the frequency is known up to a Dirichlet distribution with parameters AC1+q,AC2+q,...,(AN-AC1-AC2-...)+q, where "q" is the global frequency prior (typically q << 1). The genotype priors applied then follow a Dirichlet-Multinomial distribution, where 2 alleles per sample are drawn independently. This assumption of independent draws is the assumption Hardy-Weinberg Equilibrium. Thus, HWE is imposed on the likelihoods as a result of CalculateGenotypePosteriors.

Input

  • A VCF with genotype likelihoods, and optionally genotypes, AC/AN fields, or MLEAC/AN fields
  • (Optional) A PED pedigree file containing the description of the individuals relationships.

A collection of VCFs to use for informing allele frequency priors. Each VCF must have one of - AC field and AN field - MLEAC field and AN field - genotypes

Output

A new VCF with: 1) Genotype posteriors added to the genotype fields ("PP") 2) Genotypes and GQ assigned according to these posteriors 3) Per-site genotype priors added to the INFO field ("PG") 4) (Optional) Per-site, per-trio transmission probabilities given as Phred-scaled probability of all genotypes in the trio being correct, added to the genotype fields ("TP")

Notes

Currently, priors will only be applied for SNP sites in the input callset (and only those that have a SNP at the matching site in the priors VCF unless the --calculateMissingPriors flag is used). If the site is not called in the priors, flat priors will be applied. Flat priors are also applied for any non-SNP sites in the input callset.

Examples

 Inform the genotype assignment of NA12878 using the 1000G Euro panel
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T CalculateGenotypePosteriors \
   -V NA12878.wgs.HC.vcf \
   -supporting 1000G_EUR.genotypes.combined.vcf \
   -o NA12878.wgs.HC.posteriors.vcf \

 Refine the genotypes of a large panel based on the discovered allele frequency
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T CalculateGenotypePosteriors \
   -V input.vcf \
   -o output.withPosteriors.vcf

 Apply frequency and HWE-based priors to the genotypes of a family without including the family allele counts
 in the allele frequency estimates
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T CalculateGenotypePosteriors \
   -V input.vcf \
   -o output.withPosteriors.vcf \
   --ignoreInputSamples

 Calculate the posterior genotypes of a callset, and impose that a variant *not seen* in the external panel
 is tantamount to being AC=0, AN=100 within that panel
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T CalculateGenotypePosteriors \
   -supporting external.panel.vcf \
   -V input.vcf \
   -o output.withPosteriors.vcf
   --numRefSamplesIfNoCall 100
   
 Apply only family priors to a callset
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T CalculateGenotypePosteriors \
   -V input.vcf \
   --skipPopulationPriors
   -ped family.ped
   -o output.withPosteriors.vcf 

 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by CalculateGenotypePosteriors.

Downsampling settings

This tool applies the following downsampling settings by default.

  • Mode: BY_SAMPLE
  • To coverage: 1,000

Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

CalculateGenotypePosteriors specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--variant
 -V
NA Input VCF file
Optional Inputs
--supporting
[] Other callsets to use in generating genotype posteriors
Optional Outputs
--out
 -o
stdout File to which variants should be written
Optional Parameters
--deNovoPrior
 -DNP
1.0E-6 The de novo mutation prior
--globalPrior
 -G
0.001 The global Dirichlet prior parameters for the allele frequency
--numRefSamplesIfNoCall
 -nrs
0 The number of homozygous reference to infer were seen at a position where an "other callset" contains no site or genotype information
Optional Flags
--calculateMissingPriors
 -calcMissing
false Use discovered allele frequency in the callset for variants that do no appear in the external callset
--defaultToAC
 -useAC
false Use the AC field as opposed to MLEAC. Does nothing if VCF lacks MLEAC field
--ignoreInputSamples
 -ext
false Use external information only; do not inform genotype priors by the discovered allele frequency in the callset whose posteriors are being calculated. Useful for callsets containing related individuals.
--skipFamilyPriors
 -skipFam
false Skip application of family-based priors
--skipPopulationPriors
 -skipPop
false Skip application of population-based priors

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--calculateMissingPriors / -calcMissing

Use discovered allele frequency in the callset for variants that do no appear in the external callset
Calculate priors for missing external variants from sample data -- default behavior is to apply flat priors

boolean  false


--defaultToAC / -useAC

Use the AC field as opposed to MLEAC. Does nothing if VCF lacks MLEAC field
Rather than looking for the MLEAC field first, and then falling back to AC; first look for the AC field and then fall back to MLEAC or raw genotypes

boolean  false


--deNovoPrior / -DNP

The de novo mutation prior
The mutation prior -- i.e. the probability that a new mutation occurs. Sensitivity analysis on known de novo mutations suggests a default value of 10^-6.

double  1.0E-6  [ [ -?  ? ] ]


--globalPrior / -G

The global Dirichlet prior parameters for the allele frequency
The global prior of a variant site -- i.e. the expected allele frequency distribution knowing only that N alleles exist, and having observed none of them. This is the "typical" 1/x trend, modeled here as not varying across alleles. The calculation for this parameter is (Effective population size) * (steady state mutation rate)

double  0.001  [ [ -?  ? ] ]


--ignoreInputSamples / -ext

Use external information only; do not inform genotype priors by the discovered allele frequency in the callset whose posteriors are being calculated. Useful for callsets containing related individuals.
Do not use the [MLE] allele count from the input samples (the ones for which you're calculating posteriors) in the site frequency distribution; only use the AC and AN calculated from external sources.

boolean  false


--numRefSamplesIfNoCall / -nrs

The number of homozygous reference to infer were seen at a position where an "other callset" contains no site or genotype information
When a variant is not seen in a panel, whether to infer (and with what effective strength) that only reference alleles were ascertained at that site. E.g. "If not seen in 1000Genomes, treat it as AC=0, AN=2000". This is applied across all external panels, so if numRefIsMissing = 10, and the variant is absent in two panels, this confers evidence of AC=0,AN=20

int  0  [ [ -?  ? ] ]


--out / -o

File to which variants should be written

VariantContextWriter  stdout


--skipFamilyPriors / -skipFam

Skip application of family-based priors
Skip application of family-based priors. Note: if pedigree file is absent, family-based priors will be skipped.

boolean  false


--skipPopulationPriors / -skipPop

Skip application of population-based priors
Skip application of population-based priors

boolean  false


--supporting / -supporting

Other callsets to use in generating genotype posteriors
Supporting external panels. Allele counts from these panels (taken from AC,AN or MLEAC,AN or raw genotypes) will be used to inform the frequency distribution underying the genotype priors.

--supporting binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

List[RodBinding[VariantContext]]  []


--variant / -V

Input VCF file
Variants from this VCF file are used by this tool as input. The file must at least contain the standard VCF header lines, but can be empty (i.e., no variants are contained in the file).

--variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R RodBinding[VariantContext]


See also Guide Index | Tool Documentation Index | Support Forum

GATK version 3.2-2-gec30cee built at 2014/07/17 17:54:48. GTD: NA