# ValidationSiteSelector

Randomly selects VCF records according to specified options.

## Overview

ValidationSiteSelectorWalker is intended for use in experiments where we sample data randomly from a set of variants, for example in order to choose sites for a follow-up validation study. Sites are selected randomly but within certain restrictions. There are two main sources of restrictions a) Sample restrictions. A user can specify a set of samples, and we will only consider sites which are polymorphic within such given sample subset. These sample restrictions can be given as a set of individual samples, a text file (each line containing a sample name), or a regular expression. A user can additionally specify whether samples will be considered based on their genotypes (a non-reference genotype means that such sample is polymorphic in that variant, and hence that variant will be considered for inclusion in set), or based on their PLs. b) A user can additionally specify a sampling method based on allele frequency. Two sampling methods are currently supported. 1. Uniform sampling will just sample uniformly from variants polymorphic in selected samples. 2. Sampling based on Allele Frequency spectrum will ensure that output sites have the same AF distribution as the input set. User can additionally restrict output to a particular type of variant (SNP, Indel, etc.)

### Input

One or more variant sets to choose from.

### Output

A sites-only VCF with the desired number of randomly selected sites.

### Examples

 java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T ValidationSiteSelectorWalker \
--variant input1.vcf \
--variant input2.vcf \
-sn NA12878 \
-o output.vcf \
--numValidationSites 200   \
-sampleMode  POLY_BASED_ON_GT \
-freqMode KEEP_AF_SPECTRUM

java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T ValidationSiteSelectorWalker \
--variant:foo input1.vcf \
--variant:bar input2.vcf \
--numValidationSites 200 \
-sf samples.txt \
-o output.vcf \
-sampleMode  POLY_BASED_ON_GT \
-freqMode UNIFORM
-selectType INDEL


These Read Filters are automatically applied to the data by the Engine before processing by ValidationSiteSelector.

### Downsampling settings

This tool applies the following downsampling settings by default.

• Mode: BY_SAMPLE
• To coverage: 1,000

## Command-line Arguments

### Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

### ValidationSiteSelector specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--variant
-V
NA Input VCF file, can be specified multiple times
Required Parameters
--numValidationSites
-numSites
0 Number of output validation sites
Optional Inputs
--sample_file
-sf
NA File containing a list of samples (one per line) to include. Can be specified multiple times
Optional Outputs
--out
-o
NA File to which variants should be written
Optional Parameters
--frequencySelectionMode
-freqMode
NA Allele Frequency selection mode
--sample_expressions
-se
NA Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
--sample_name
-sn
NA Include genotypes from this sample. Can be specified multiple times
--sampleMode
NA Sample selection mode
--samplePNonref
0.99 GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site
--selectTypeToInclude
-selectType
NA Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
Optional Flags
--ignoreGenotypes
NA If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection
--ignorePolymorphicStatus
NA If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection
--includeFilteredSites
-ifs
NA If true, will include filtered sites in set to choose variants from

### Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

### --frequencySelectionMode / -freqMode

Allele Frequency selection mode
This argument selects allele frequency selection mode. See the wiki for more information.

The --frequencySelectionMode argument is an enumerated type (AF_COMPUTATION_MODE), which can have one of the following values:

KEEP_AF_SPECTRUM
UNIFORM

AF_COMPUTATION_MODE

### --ignoreGenotypes / -ignoreGenotypes

If true, will ignore genotypes in VCF, will take AC,AF from annotations and will make no sample selection
Argument for the frequency selection mode. (AC/AF/AN) are taken from VCF info field, not recalculated. Typically specified for sites-only VCFs that still have AC/AF/AN information.

boolean

### --ignorePolymorphicStatus / -ignorePolymorphicStatus

If true, will ignore polymorphic status in VCF, and will take VCF record directly without pre-selection
Argument for the frequency selection mode. Allows reference (non-polymorphic) sites to be included in the validation set.

boolean

### --includeFilteredSites / -ifs

If true, will include filtered sites in set to choose variants from
Do not exclude filtered sites (e.g. not PASS or .) from consideration for validation

boolean

### --numValidationSites / -numSites

Number of output validation sites
The number of sites in your validation set

R int  [ [ -∞  ∞ ] ]

### --out / -o

File to which variants should be written
The output VCF file

VariantContextWriter

### --sample_expressions / -se

Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
Sample regexps to subset the input VCF to, prior to selecting variants. -sn NA12* subsets to all samples with prefix NA12

Set[String]

### --sample_file / -sf

File containing a list of samples (one per line) to include. Can be specified multiple times
File containing a list of sample names to subset the input vcf to. Equivalent to specifying the contents of the file separately with -sn

Set[File]

### --sample_name / -sn

Include genotypes from this sample. Can be specified multiple times
Sample name(s) to subset the input VCF to, prior to selecting variants. -sn A -sn B subsets to samples A and B.

Set[String]

### --sampleMode / -sampleMode

Sample selection mode
A mode for selecting sites based on sample-level data. See the wiki documentation for more information.

The --sampleMode argument is an enumerated type (SAMPLE_SELECTION_MODE), which can have one of the following values:

NONE
POLY_BASED_ON_GT
POLY_BASED_ON_GL

SAMPLE_SELECTION_MODE

### --samplePNonref / -samplePNonref

GL-based selection mode only: the probability that a site is non-reference in the samples for which to include the site
An P[nonref] threshold for SAMPLE_SELECTION_MODE=POLY_BASED_ON_GL. See the wiki documentation for more information.

double  [ [ -∞  ∞ ] ]

### --selectTypeToInclude / -selectType

Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
This argument selects particular kinds of variants (i.e. SNP, INDEL) out of a list. If left unspecified, all types are considered.

List[Type]

### --variant / -V

Input VCF file, can be specified multiple times
The input VCF file

--variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R List[RodBinding[VariantContext]]