SelectVariants

Selects variants from a VCF source.

Category Variant Evaluation and Manipulation Tools

Traversal LocusWalker

PartitionBy LOCUS


Overview

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented in the Using JEXL expressions section (http://www.broadinstitute.org/gatk/guide/article?id=1255). One can optionally include concordance or discordance tracks for use in selecting overlapping variants.

Input

A variant set to select from.

Output

A selected VCF.

Examples

 Select two samples out of a VCF with many samples:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -sn SAMPLE_A_PARC \
   -sn SAMPLE_B_ACTG

 Select two samples and any sample that matches a regular expression:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_PARC \
   -sn SAMPLE_1_ACTG \
   -se 'SAMPLE.+PARC'

 Select any sample that matches a regular expression and sites where the QD annotation is more than 10:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -se 'SAMPLE.+PARC'
   -select "QD > 10.0"

 Select a sample and exclude non-variant loci and filtered loci:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -sn SAMPLE_1_ACTG \
   -env \
   -ef

 Select a sample and restrict the output vcf to a set of intervals:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -L /path/to/my.interval_list \
   -sn SAMPLE_1_ACTG

 Select all calls missed in my vcf, but present in HapMap (useful to take a look at why these variants weren't called by this dataset):
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant hapmap.vcf \
   --discordance myCalls.vcf
   -o output.vcf \
   -sn mySample

 Select all calls made by both myCalls and hisCalls (useful to take a look at what is consistent between the two callers):
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant myCalls.vcf \
   --concordance hisCalls.vcf
   -o output.vcf \
   -sn mySample

 Generating a VCF of all the variants that are mendelian violations:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -bed family.ped \
   -mvq 50 \
   -o violations.vcf

 Creating a set with 50% of the total number of variants in the variant VCF:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -fraction 0.5

 Select only indels from a VCF:
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -selectType INDEL

 Select only multi-allelic SNPs and MNPs from a VCF (i.e. SNPs with more than one allele listed in the ALT column):
 java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -selectType SNP -selectType MNP \
   -restrictAllelesTo MULTIALLELIC

 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by SelectVariants.

Parallelism options

This tool can be run in multi-threaded mode using this option.


Command-line Arguments

Inherited arguments

The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).

SelectVariants specific arguments

This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.

Name Type Default value Summary
Required
--variant RodBinding[VariantContext] NA Input VCF file
Optional
--ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES boolean false Allow a samples other than those in the VCF to be specified on the command line. These samples will be ignored.
--concordance RodBinding[VariantContext] none Output variants that were also called in this comparison track
--discordance RodBinding[VariantContext] none Output variants that were not called in this comparison track
--exclude_sample_file Set[File] [] File containing a list of samples (one per line) to exclude. Can be specified multiple times
--exclude_sample_name Set[String] [] Exclude genotypes from this sample. Can be specified multiple times
--excludeFiltered boolean false Don't include filtered loci in the analysis
--excludeNonVariants boolean false Don't include loci found to be non-variant after the subsetting procedure
--keepIDs File NA Only emit sites whose ID is found in this file (one ID per line)
--keepOriginalAC boolean false Store the original AC, AF, and AN values in the INFO field after selecting (using keys AC_Orig, AF_Orig, and AN_Orig)
--maxIndelSize int 2147483647 indel size select
--mendelianViolation Boolean false output mendelian violation sites only
-mvq double 0.0 Minimum genotype QUAL score for each trio member required to accept a site as a violation
--out VariantContextWriter stdout File to which variants should be written
--remove_fraction_genotypes double 0.0 Selects a fraction (a number between 0 and 1) of the total genotypes at random from the variant track and sets them to nocall
--restrictAllelesTo NumberAlleleRestriction ALL Select only variants of a particular allelicity. Valid options are ALL (default), MULTIALLELIC or BIALLELIC
--sample_expressions Set[String] NA Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times
--sample_file Set[File] NA File containing a list of samples (one per line) to include. Can be specified multiple times
--sample_name Set[String] [] Include genotypes from this sample. Can be specified multiple times
--select_expressions ArrayList[String] [] One or more criteria to use when selecting the data
--select_random_fraction double 0.0 Selects a fraction (a number between 0 and 1) of the total variants at random from the variant track
--selectTypeToInclude List[Type] [] Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES ( boolean with default value false )

Allow a samples other than those in the VCF to be specified on the command line. These samples will be ignored..

--concordance / -conc ( RodBinding[VariantContext] with default value none )

Output variants that were also called in this comparison track. A site is considered concordant if (1) we are not looking for specific samples and there is a variant called in both the variant and concordance tracks or (2) every sample present in the variant track is present in the concordance track and they have the sample genotype call. --concordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

--discordance / -disc ( RodBinding[VariantContext] with default value none )

Output variants that were not called in this comparison track. A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track. --discordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

--exclude_sample_file / -xl_sf ( Set[File] with default value [] )

File containing a list of samples (one per line) to exclude. Can be specified multiple times. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.

--exclude_sample_name / -xl_sn ( Set[String] with default value [] )

Exclude genotypes from this sample. Can be specified multiple times. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.

--excludeFiltered / -ef ( boolean with default value false )

Don't include filtered loci in the analysis.

--excludeNonVariants / -env ( boolean with default value false )

Don't include loci found to be non-variant after the subsetting procedure.

--keepIDs / -IDs ( File )

Only emit sites whose ID is found in this file (one ID per line). If provided, we will only include variants whose ID field is present in this list of ids. The matching is exact string matching. The file format is just one ID per line

--keepOriginalAC / -keepOriginalAC ( boolean with default value false )

Store the original AC, AF, and AN values in the INFO field after selecting (using keys AC_Orig, AF_Orig, and AN_Orig).

--maxIndelSize ( int with default value 2147483647 )

indel size select.

--mendelianViolation / -mv ( Boolean with default value false )

output mendelian violation sites only. This activates the mendelian violation module that will select all variants that correspond to a mendelian violation following the rules given by the family structure.

-mvq / --mendelianViolationQualThreshold ( double with default value 0.0 )

Minimum genotype QUAL score for each trio member required to accept a site as a violation.

--out / -o ( VariantContextWriter with default value stdout )

File to which variants should be written.

--remove_fraction_genotypes / -fractionGenotypes ( double with default value 0.0 )

Selects a fraction (a number between 0 and 1) of the total genotypes at random from the variant track and sets them to nocall.

--restrictAllelesTo / -restrictAllelesTo ( NumberAlleleRestriction with default value ALL )

Select only variants of a particular allelicity. Valid options are ALL (default), MULTIALLELIC or BIALLELIC. When this argument is used, we can choose to include only multiallelic or biallelic sites, depending on how many alleles are listed in the ALT column of a vcf. For example, a multiallelic record such as: 1 100 . A AAA,AAAAA will be excluded if "-restrictAllelesTo BIALLELIC" is included, because there are two alternate alleles, whereas a record such as: 1 100 . A T will be included in that case, but would be excluded if "-restrictAllelesTo MULTIALLELIC
The --restrictAllelesTo argument is an enumerated type (NumberAlleleRestriction), which can have one of the following values:

ALL
BIALLELIC
MULTIALLELIC

--sample_expressions / -se ( Set[String] )

Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times.

--sample_file / -sf ( Set[File] )

File containing a list of samples (one per line) to include. Can be specified multiple times.

--sample_name / -sn ( Set[String] with default value [] )

Include genotypes from this sample. Can be specified multiple times.

--select_expressions / -select ( ArrayList[String] with default value [] )

One or more criteria to use when selecting the data. Note that these expressions are evaluated *after* the specified samples are extracted and the INFO field annotations are updated.

--select_random_fraction / -fraction ( double with default value 0.0 )

Selects a fraction (a number between 0 and 1) of the total variants at random from the variant track. This routine is based on probability, so the final result is not guaranteed to carry the exact fraction. Can be used for large fractions.

--selectTypeToInclude / -selectType ( List[Type] with default value [] )

Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times. This argument select particular kinds of variants out of a list. If left empty, there is no type selection and all variant types are considered for other selection criteria. When specified one or more times, a particular type of variant is selected.

--variant / -V ( required RodBinding[VariantContext] )

Input VCF file. Variants from this VCF file are used by this tool as input. The file must at least contain the standard VCF header lines, but can be empty (i.e., no variants are contained in the file). --variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3


See also Guide Index | Technical Documentation Index | Support Forum

GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.