SelectVariants

Selects variants from a VCF source.

Category Variant Evaluation and Manipulation Tools

Traversal LocusWalker

PartitionBy LOCUS


Overview

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented in the Using JEXL expressions section (http://www.broadinstitute.org/gatk/guide/article?id=1255). One can optionally include concordance or discordance tracks for use in selecting overlapping variants.

Input

A variant set to select from.

Output

A selected VCF.

Examples

 Select two samples out of a VCF with many samples:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -sn SAMPLE_A_PARC \
   -sn SAMPLE_B_ACTG

 Select two samples and any sample that matches a regular expression:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_PARC \
   -sn SAMPLE_1_ACTG \
   -se 'SAMPLE.+PARC'

 Select any sample that matches a regular expression and sites where the QD annotation is more than 10:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -se 'SAMPLE.+PARC'
   -select "QD > 10.0"

 Select a sample and exclude non-variant loci and filtered loci:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_ACTG \
   -env \
   -ef

 Select a sample and restrict the output vcf to a set of intervals:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -L /path/to/my.interval_list \
   -sn SAMPLE_1_ACTG

 Select all calls missed in my vcf, but present in HapMap (useful to take a look at why these variants weren't called by this dataset):
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant hapmap.vcf \
   --discordance myCalls.vcf
   -o output.vcf \
   -sn mySample

 Select all calls made by both myCalls and hisCalls (useful to take a look at what is consistent between the two callers):
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant myCalls.vcf \
   --concordance hisCalls.vcf
   -o output.vcf \
   -sn mySample

 Generating a VCF of all the variants that are mendelian violations:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -bed family.ped \
   -mvq 50 \
   -o violations.vcf

 Creating a set with 50% of the total number of variants in the variant VCF:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -fraction 0.5

 Select only indels from a VCF:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -selectType INDEL

 Select only multi-allelic SNPs and MNPs from a VCF (i.e. SNPs with more than one allele listed in the ALT column):
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -selectType SNP -selectType MNP \
   -restrictAllelesTo MULTIALLELIC

 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by SelectVariants.

Parallelism options

This tool can be run in multi-threaded mode using this option.

Downsampling settings

This tool applies the following downsampling settings by default.

  • Mode: BY_SAMPLE
  • To coverage: 1,000

Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

SelectVariants specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Inputs
--variant
 -V
NA Input VCF file
Optional Inputs
--concordance
 -conc
none Output variants that were also called in this comparison track
--discordance
 -disc
none Output variants that were not called in this comparison track
--exclude_sample_file
 -xl_sf
[] File containing a list of samples (one per line) to exclude. Can be specified multiple times
--sample_file
 -sf
NA File containing a list of samples (one per line) to include. Can be specified multiple times
Optional Outputs
--out
 -o
stdout File to which variants should be written
Optional Parameters
--exclude_sample_name
 -xl_sn
[] Exclude genotypes from this sample. Can be specified multiple times
--keepIDs
 -IDs
NA Only emit sites whose ID is found in this file (one ID per line)
--maxIndelSize
2147483647 indel size select
--mendelianViolationQualThreshold
 -mvq
0.0 Minimum genotype QUAL score for each trio member required to accept a site as a violation
--remove_fraction_genotypes
 -fractionGenotypes
0.0 Selects a fraction (a number between 0 and 1) of the total genotypes at random from the variant track and sets them to nocall
--restrictAllelesTo
ALL Select only variants of a particular allelicity. Valid options are ALL (default), MULTIALLELIC or BIALLELIC
--sample_expressions
 -se
NA Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
--sample_name
 -sn
[] Include genotypes from this sample. Can be specified multiple times
--select_expressions
 -select
[] One or more criteria to use when selecting the data
--select_random_fraction
 -fraction
0.0 Selects a fraction (a number between 0 and 1) of the total variants at random from the variant track
--selectTypeToInclude
 -selectType
[] Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
Optional Flags
--ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES
false Allow samples other than those in the VCF to be specified on the command line. These samples will be ignored.
--excludeFiltered
 -ef
false Don't include filtered loci in the analysis
--excludeNonVariants
 -env
false Don't include loci found to be non-variant after the subsetting procedure
--keepOriginalAC
false Store the original AC, AF, and AN values in the INFO field after selecting (using keys AC_Orig, AF_Orig, and AN_Orig)
--mendelianViolation
 -mv
false output mendelian violation sites only

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES

Allow samples other than those in the VCF to be specified on the command line. These samples will be ignored.

boolean  false


--concordance / -conc

Output variants that were also called in this comparison track
A site is considered concordant if (1) we are not looking for specific samples and there is a variant called in both the variant and concordance tracks or (2) every sample present in the variant track is present in the concordance track and they have the sample genotype call.

--concordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

RodBinding[VariantContext]  none


--discordance / -disc

Output variants that were not called in this comparison track
A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.

--discordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

RodBinding[VariantContext]  none


--exclude_sample_file / -xl_sf

File containing a list of samples (one per line) to exclude. Can be specified multiple times
Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.

Set[File]  []


--exclude_sample_name / -xl_sn

Exclude genotypes from this sample. Can be specified multiple times
Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.

Set[String]  []


--excludeFiltered / -ef

Don't include filtered loci in the analysis

boolean  false


--excludeNonVariants / -env

Don't include loci found to be non-variant after the subsetting procedure

boolean  false


--keepIDs / -IDs

Only emit sites whose ID is found in this file (one ID per line)
If provided, we will only include variants whose ID field is present in this list of ids. The matching is exact string matching. The file format is just one ID per line

File


--keepOriginalAC / -keepOriginalAC

Store the original AC, AF, and AN values in the INFO field after selecting (using keys AC_Orig, AF_Orig, and AN_Orig)

boolean  false


--maxIndelSize

indel size select

int  2147483647  [ [ -?  ? ] ]


--mendelianViolation / -mv

output mendelian violation sites only
This activates the mendelian violation module that will select all variants that correspond to a mendelian violation following the rules given by the family structure.

Boolean  false


--mendelianViolationQualThreshold / -mvq

Minimum genotype QUAL score for each trio member required to accept a site as a violation

double  0.0  [ [ -?  ? ] ]


--out / -o

File to which variants should be written

VariantContextWriter  stdout


--remove_fraction_genotypes / -fractionGenotypes

Selects a fraction (a number between 0 and 1) of the total genotypes at random from the variant track and sets them to nocall

double  0.0  [ [ -?  ? ] ]


--restrictAllelesTo / -restrictAllelesTo

Select only variants of a particular allelicity. Valid options are ALL (default), MULTIALLELIC or BIALLELIC
When this argument is used, we can choose to include only multiallelic or biallelic sites, depending on how many alleles are listed in the ALT column of a vcf. For example, a multiallelic record such as: 1 100 . A AAA,AAAAA will be excluded if "-restrictAllelesTo BIALLELIC" is included, because there are two alternate alleles, whereas a record such as: 1 100 . A T will be included in that case, but would be excluded if "-restrictAllelesTo MULTIALLELIC

The --restrictAllelesTo argument is an enumerated type (NumberAlleleRestriction), which can have one of the following values:

ALL
BIALLELIC
MULTIALLELIC

NumberAlleleRestriction  ALL


--sample_expressions / -se

Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times

Set[String]


--sample_file / -sf

File containing a list of samples (one per line) to include. Can be specified multiple times

Set[File]


--sample_name / -sn

Include genotypes from this sample. Can be specified multiple times

Set[String]  []


--select_expressions / -select

One or more criteria to use when selecting the data
Note that these expressions are evaluated *after* the specified samples are extracted and the INFO field annotations are updated.

ArrayList[String]  []


--select_random_fraction / -fraction

Selects a fraction (a number between 0 and 1) of the total variants at random from the variant track
This routine is based on probability, so the final result is not guaranteed to carry the exact fraction. Can be used for large fractions.

double  0.0  [ [ -?  ? ] ]


--selectTypeToInclude / -selectType

Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times
This argument select particular kinds of variants out of a list. If left empty, there is no type selection and all variant types are considered for other selection criteria. When specified one or more times, a particular type of variant is selected.

List[Type]  []


--variant / -V

Input VCF file
Variants from this VCF file are used by this tool as input. The file must at least contain the standard VCF header lines, but can be empty (i.e., no variants are contained in the file).

--variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

R RodBinding[VariantContext]


See also Guide Index | Tool Documentation Index | Support Forum

GATK version 3.1-1-g07a4bf8 built at 2014/03/18 07:00:36. GTD: NA