GenotypeGVCFs

Genotypes any number of gVCF files that were produced by the Haplotype Caller into a single joint VCF file.

Category Variant Discovery Tools

Traversal LocusWalker

PartitionBy LOCUS


Overview

GenotypeGVCFs merges gVCF records that were produced as part of the reference model-based variant discovery pipeline (see documentation for more details) using the '-ERC GVCF' or '-ERC BP_RESOLUTION' mode of the HaplotypeCaller. This tool performs the multi-sample joint aggregation step and merges the records together in a sophisticated manner. At all positions of the target, this tool will combine all spanning records, produce correct genotype likelihoods, re-genotype the newly merged record, and then re-annotate it. Note that this tool cannot work with just any gVCF files - they must have been produced with the HaplotypeCaller, which uses a sophisticated reference model to produce accurate genotype likelihoods for every position in the target.

Input

One or more Haplotype Caller gVCFs to genotype.

Output

A combined, genotyped VCF.

Examples

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T GenotypeGVCFs \
   --variant gvcf1.vcf \
   --variant gvcf2.vcf \
   -o output.vcf
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by GenotypeGVCFs.

Parallelism options

This tool can be run in multi-threaded mode using this option.

Downsampling settings

This tool applies the following downsampling settings by default.

  • Mode: BY_SAMPLE
  • To coverage: 1,000

Window size

This tool uses a sliding window on the reference.

  • Window start: -10 bp before the locus
  • Window stop: 10 bp after the locus

Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

GenotypeGVCFs specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--variant
 -V
NA One or more input gVCF files
Optional Inputs
--dbsnp
 -D
none dbSNP file
Optional Outputs
--out
 -o
stdout File to which variants should be written
Optional Parameters
--heterozygosity
 -hets
0.001 Heterozygosity value used to compute prior likelihoods for any locus. See the GATKDocs for full details on the meaning of this population genetics concept
--indel_heterozygosity
 -indelHeterozygosity
1.25E-4 Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept
--sample_ploidy
 -ploidy
2 Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
--standard_min_confidence_threshold_for_calling
 -stand_call_conf
30.0 The minimum phred-scaled confidence threshold at which variants should be called
--standard_min_confidence_threshold_for_emitting
 -stand_emit_conf
30.0 The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold)
Optional Flags
--annotateNDA
 -nda
false If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
--includeNonVariantSites
 -allSites
false Include loci found to be non-variant after genotyping
Advanced Parameters
--annotation
 -A
[InbreedingCoeff, FisherStrand, QualByDepth, ChromosomeCounts, GenotypeSummaries] One or more specific annotations to recompute
--input_prior
 -inputPrior
[] Input prior for calls
--max_alternate_alleles
 -maxAltAlleles
6 Maximum number of alternate alleles to genotype

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--annotateNDA / -nda

If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
Depending on the value of the --max_alternate_alleles argument, we may genotype only a fraction of the alleles being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the number of alternate alleles that were originally discovered at the site.

boolean  false


--annotation / -A

One or more specific annotations to recompute
Which annotations to recompute for the combined output VCF file.

List[String]  [InbreedingCoeff, FisherStrand, QualByDepth, ChromosomeCounts, GenotypeSummaries]


--dbsnp / -D

dbSNP file
The rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. Note that dbSNP is not used in any way for the calculations themselves.

--dbsnp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

RodBinding[VariantContext]  none


--heterozygosity / -hets

Heterozygosity value used to compute prior likelihoods for any locus. See the GATKDocs for full details on the meaning of this population genetics concept
The expected heterozygosity value used to compute prior probability that a locus is non-reference. The default priors are for provided for humans: het = 1e-3 which means that the probability of N samples being hom-ref at a site is: 1 - sum_i_2N (het / i) Note that heterozygosity as used here is the population genetics concept: http://en.wikipedia.org/wiki/Zygosity#Heterozygosity_in_population_genetics That is, a hets value of 0.01 implies that two randomly chosen chromosomes from the population of organisms would differ from each other (one being A and the other B) at a rate of 1 in 100 bp. Note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype, which in the GATK is purely determined by the probability of the observed data P(D | AB) under the model that there may be a AB het genotype. The posterior probability of this AB genotype would use the het prior, but the GATK only uses this posterior probability in determining the prob. that a site is polymorphic. So changing the het parameters only increases the chance that a site will be called non-reference across all samples, but doesn't actually change the output genotype likelihoods at all, as these aren't posterior probabilities at all. The quantity that changes whether the GATK considers the possibility of a het genotype at all is the ploidy, which determines how many chromosomes each individual in the species carries.

Double  0.001  [ [ -?  ? ] ]


--includeNonVariantSites / -allSites

Include loci found to be non-variant after genotyping

boolean  false


--indel_heterozygosity / -indelHeterozygosity

Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept
This argument informs the prior probability of having an indel at a site.

double  1.25E-4  [ [ -?  ? ] ]


--input_prior / -inputPrior

Input prior for calls
By default, the prior specified with the argument --heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2*N+1)) 2*N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

List[Double]  []


--max_alternate_alleles / -maxAltAlleles

Maximum number of alternate alleles to genotype
If there are more than this number of alternate alleles presented to the genotyper (either through discovery or GENOTYPE_GIVEN ALLELES), then only this many alleles will be used. Note that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. As of GATK 2.2 the genotyper can handle a very large number of events, so the default maximum has been increased to 6.

int  6  [ [ -?  ? ] ]


--out / -o

File to which variants should be written

VariantContextWriter  stdout


--sample_ploidy / -ploidy

Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
Sample ploidy - equivalent to number of chromosomes per pool. In pooled experiments this should be = # of samples in pool * individual sample ploidy

int  2  [ [ -?  ? ] ]


--standard_min_confidence_threshold_for_calling / -stand_call_conf

The minimum phred-scaled confidence threshold at which variants should be called
The minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this is the default).

double  30.0  [ [ -?  ? ] ]


--standard_min_confidence_threshold_for_emitting / -stand_emit_conf

The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold)
This argument allows you to emit low quality calls as filtered records.

double  30.0  [ [ -?  ? ] ]


--variant / -V

One or more input gVCF files
The gVCF files to merge together

R List[RodBindingCollection[VariantContext]]


See also Guide Index | Tool Documentation Index | Support Forum

GATK version 3.2-2-gec30cee built at 2014/07/17 17:54:48. GTD: NA