No official posts found with tag ValidationSiteSelector
No discussions found with tag ValidationSiteSelector

ValidationSiteSelector

Randomly selects VCF records according to specified options.

Category Validation Utilities

Traversal LocusWalker

PartitionBy LOCUS


Overview

ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions a) Sample restrictions. A user can specify a set of samples, and we will only consider sites which are polymorphic within such given sample subset. These sample restrictions can be given as a set of individual samples, a text file (each line containing a sample name), or a regular expression. A user can additionally specify whether samples will be considered based on their genotypes (a non-reference genotype means that such sample is polymorphic in that variant, and hence that variant will be considered for inclusion in set), or based on their PLs. b) A user can additionally specify a sampling method based on allele frequency. Two sampling methods are currently supported. 1. Uniform sampling will just sample uniformly from variants polymorphic in selected samples. 2. Sampling based on Allele Frequency spectrum will ensure that output sites have the same AF distribution as the input set. User can additionally restrict output to a particular type of variant (SNP, Indel, etc.)

Input

One or more variant sets to choose from.

Output

A sites-only VCF with the desired number of randomly selected sites.

Examples

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T ValidationSiteSelectorWalker \
   --variant input1.vcf \
   --variant input2.vcf \
   -sn NA12878 \
   -o output.vcf \
   --numValidationSites 200   \
   -sampleMode  POLY_BASED_ON_GT \
   -freqMode KEEP_AF_SPECTRUM

 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T ValidationSiteSelectorWalker \
   --variant:foo input1.vcf \
   --variant:bar input2.vcf \
   --numValidationSites 200 \
   -sf samples.txt \
   -o output.vcf \
   -sampleMode  POLY_BASED_ON_GT \
   -freqMode UNIFORM
   -selectType INDEL
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by ValidationSiteSelector.


Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

ValidationSiteSelector specific arguments

This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.

Name Type Default value Summary
Required
--numValidationSites int 0 Number of output validation sites
--variant List[RodBinding[VariantContext]] NA Input VCF file, can be specified multiple times
Optional
--frequencySelectionMode AF_COMPUTATION_MODE KEEP_AF_SPECTRUM Allele Frequency selection mode
--ignoreGenotypes boolean false If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection
--ignorePolymorphicStatus boolean false If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection
--includeFilteredSites boolean false If true, will include filtered sites in set to choose variants from
--out VariantContextWriter stdout File to which variants should be written
--sample_expressions Set[String] NA Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
--sample_file Set[File] NA File containing a list of samples (one per line) to include. Can be specified multiple times
--sample_name Set[String] [] Include genotypes from this sample. Can be specified multiple times
--sampleMode SAMPLE_SELECTION_MODE NONE Sample selection mode
--samplePNonref double 0.99 GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site
--selectTypeToInclude List[Type] [] Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--frequencySelectionMode / -freqMode ( AF_COMPUTATION_MODE with default value KEEP_AF_SPECTRUM )

Allele Frequency selection mode. This argument selects allele frequency selection mode. See the wiki for more information.
The --frequencySelectionMode argument is an enumerated type (AF_COMPUTATION_MODE), which can have one of the following values:

KEEP_AF_SPECTRUM
UNIFORM

--ignoreGenotypes / -ignoreGenotypes ( boolean with default value false )

If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection. Argument for the frequency selection mode. (AC/AF/AN) are taken from VCF info field, not recalculated. Typically specified for sites-only VCFs that still have AC/AF/AN information.

--ignorePolymorphicStatus / -ignorePolymorphicStatus ( boolean with default value false )

If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection. Argument for the frequency selection mode. Allows reference (non-polymorphic) sites to be included in the validation set.

--includeFilteredSites / -ifs ( boolean with default value false )

If true, will include filtered sites in set to choose variants from. Do not exclude filtered sites (e.g. not PASS or .) from consideration for validation

--numValidationSites / -numSites ( required int with default value 0 )

Number of output validation sites. The number of sites in your validation set

--out / -o ( VariantContextWriter with default value stdout )

File to which variants should be written. The output VCF file

--sample_expressions / -se ( Set[String] )

Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times. Sample regexps to subset the input VCF to, prior to selecting variants. -sn NA12* subsets to all samples with prefix NA12

--sample_file / -sf ( Set[File] )

File containing a list of samples (one per line) to include. Can be specified multiple times. File containing a list of sample names to subset the input vcf to. Equivalent to specifying the contents of the file separately with -sn

--sample_name / -sn ( Set[String] with default value [] )

Include genotypes from this sample. Can be specified multiple times. Sample name(s) to subset the input VCF to, prior to selecting variants. -sn A -sn B subsets to samples A and B.

--sampleMode / -sampleMode ( SAMPLE_SELECTION_MODE with default value NONE )

Sample selection mode. A mode for selecting sites based on sample-level data. See the wiki documentation for more information.
The --sampleMode argument is an enumerated type (SAMPLE_SELECTION_MODE), which can have one of the following values:

NONE
POLY_BASED_ON_GT
POLY_BASED_ON_GL

--samplePNonref / -samplePNonref ( double with default value 0.99 )

GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site. An P[nonref] threshold for SAMPLE_SELECTION_MODE=POLY_BASED_ON_GL. See the wiki documentation for more information.

--selectTypeToInclude / -selectType ( List[Type] with default value [] )

Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times. This argument selects particular kinds of variants (i.e. SNP, INDEL) out of a list. If left unspecified, all types are considered.

--variant / -V ( required List[RodBinding[VariantContext]] )

Input VCF file, can be specified multiple times. The input VCF file --variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3


See also Guide Index | Technical Documentation Index | Support Forum

GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.