Randomly selects VCF records according to specified options.
ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions a) Sample restrictions. A user can specify a set of samples, and we will only consider sites which are polymorphic within such given sample subset. These sample restrictions can be given as a set of individual samples, a text file (each line containing a sample name), or a regular expression. A user can additionally specify whether samples will be considered based on their genotypes (a non-reference genotype means that such sample is polymorphic in that variant, and hence that variant will be considered for inclusion in set), or based on their PLs. b) A user can additionally specify a sampling method based on allele frequency. Two sampling methods are currently supported. 1. Uniform sampling will just sample uniformly from variants polymorphic in selected samples. 2. Sampling based on Allele Frequency spectrum will ensure that output sites have the same AF distribution as the input set. User can additionally restrict output to a particular type of variant (SNP, Indel, etc.)
One or more variant sets to choose from.
A sites-only VCF with the desired number of randomly selected sites.
java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T ValidationSiteSelectorWalker \ --variant input1.vcf \ --variant input2.vcf \ -sn NA12878 \ -o output.vcf \ --numValidationSites 200 \ -sampleMode POLY_BASED_ON_GT \ -freqMode KEEP_AF_SPECTRUM java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T ValidationSiteSelectorWalker \ --variant:foo input1.vcf \ --variant:bar input2.vcf \ --numValidationSites 200 \ -sf samples.txt \ -o output.vcf \ -sampleMode POLY_BASED_ON_GT \ -freqMode UNIFORM -selectType INDEL
These Read Filters are automatically applied to the data by the Engine before processing by ValidationSiteSelector.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.
|--numValidationSites||int||0||Number of output validation sites|
|--variant||List[RodBinding[VariantContext]]||NA||Input VCF file, can be specified multiple times|
|--frequencySelectionMode||AF_COMPUTATION_MODE||KEEP_AF_SPECTRUM||Allele Frequency selection mode|
|--ignoreGenotypes||boolean||false||If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection|
|--ignorePolymorphicStatus||boolean||false||If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection|
|--includeFilteredSites||boolean||false||If true, will include filtered sites in set to choose variants from|
|--out||VariantContextWriter||stdout||File to which variants should be written|
|--sample_expressions||Set[String]||NA||Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times|
|--sample_file||Set[File]||NA||File containing a list of samples (one per line) to include. Can be specified multiple times|
|--sample_name||Set[String]||||Include genotypes from this sample. Can be specified multiple times|
|--sampleMode||SAMPLE_SELECTION_MODE||NONE||Sample selection mode|
|--samplePNonref||double||0.99||GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site|
|--selectTypeToInclude||List[Type]||||Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times|
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Allele Frequency selection mode. This argument selects allele frequency selection mode. See the wiki for more information.
The --frequencySelectionMode argument is an enumerated type (AF_COMPUTATION_MODE), which can have one of the following values:
If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection. Argument for the frequency selection mode. (AC/AF/AN) are taken from VCF info field, not recalculated. Typically specified for sites-only VCFs that still have AC/AF/AN information.
If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection. Argument for the frequency selection mode. Allows reference (non-polymorphic) sites to be included in the validation set.
If true, will include filtered sites in set to choose variants from. Do not exclude filtered sites (e.g. not PASS or .) from consideration for validation
Number of output validation sites. The number of sites in your validation set
File to which variants should be written. The output VCF file
Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times. Sample regexps to subset the input VCF to, prior to selecting variants. -sn NA12* subsets to all samples with prefix NA12
File containing a list of samples (one per line) to include. Can be specified multiple times. File containing a list of sample names to subset the input vcf to. Equivalent to specifying the contents of the file separately with -sn
Include genotypes from this sample. Can be specified multiple times. Sample name(s) to subset the input VCF to, prior to selecting variants. -sn A -sn B subsets to samples A and B.
Sample selection mode. A mode for selecting sites based on sample-level data. See the wiki documentation for more information.
The --sampleMode argument is an enumerated type (SAMPLE_SELECTION_MODE), which can have one of the following values:
GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site. An P[nonref] threshold for SAMPLE_SELECTION_MODE=POLY_BASED_ON_GL. See the wiki documentation for more information.
Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times. This argument selects particular kinds of variants (i.e. SNP, INDEL) out of a list. If left unspecified, all types are considered.
GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.