Selects variants from a VCF source.
Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented in the Using JEXL expressions section (http://www.broadinstitute.org/gatk/guide/article?id=1255). One can optionally include concordance or discordance tracks for use in selecting overlapping variants.
A variant set to select from.
A selected VCF.
Select two samples out of a VCF with many samples: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -sn SAMPLE_A_PARC \ -sn SAMPLE_B_ACTG Select two samples and any sample that matches a regular expression: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -sn SAMPLE_1_PARC \ -sn SAMPLE_1_ACTG \ -se 'SAMPLE.+PARC' Select any sample that matches a regular expression and sites where the QD annotation is more than 10: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -se 'SAMPLE.+PARC' -select "QD > 10.0" Select a sample and exclude non-variant loci and filtered loci: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -sn SAMPLE_1_ACTG \ -env \ -ef Select a sample and restrict the output vcf to a set of intervals: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -L /path/to/my.interval_list \ -sn SAMPLE_1_ACTG Select all calls missed in my vcf, but present in HapMap (useful to take a look at why these variants weren't called by this dataset): java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant hapmap.vcf \ --discordance myCalls.vcf -o output.vcf \ -sn mySample Select all calls made by both myCalls and hisCalls (useful to take a look at what is consistent between the two callers): java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant myCalls.vcf \ --concordance hisCalls.vcf -o output.vcf \ -sn mySample Generating a VCF of all the variants that are mendelian violations: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -bed family.ped \ -mvq 50 \ -o violations.vcf Creating a set with 50% of the total number of variants in the variant VCF: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -fraction 0.5 Select only indels from a VCF: java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -selectType INDEL Select only multi-allelic SNPs and MNPs from a VCF (i.e. SNPs with more than one allele listed in the ALT column): java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant input.vcf \ -o output.vcf \ -selectType SNP -selectType MNP \ -restrictAllelesTo MULTIALLELIC
These Read Filters are automatically applied to the data by the Engine before processing by SelectVariants.
This tool can be run in multi-threaded mode using this option.
The arguments described in the entries below can be supplied to this tool to modify its behavior. For example, the -L argument directs the GATK engine restricts processing to specific genomic intervals (this is an Engine capability and is therefore available to all GATK walkers).
This table summarizes the command-line arguments that are specific to this tool. For details, see the list further down below the table.
| Name | Type | Default value | Summary |
|---|---|---|---|
| Required | |||
| --variant | RodBinding[VariantContext] | NA | Input VCF file |
| Optional | |||
| --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES | boolean | false | Allow a samples other than those in the VCF to be specified on the command line. These samples will be ignored. |
| --concordance | RodBinding[VariantContext] | none | Output variants that were also called in this comparison track |
| --discordance | RodBinding[VariantContext] | none | Output variants that were not called in this comparison track |
| --exclude_sample_file | Set[File] | [] | File containing a list of samples (one per line) to exclude. Can be specified multiple times |
| --exclude_sample_name | Set[String] | [] | Exclude genotypes from this sample. Can be specified multiple times |
| --excludeFiltered | boolean | false | Don't include filtered loci in the analysis |
| --excludeNonVariants | boolean | false | Don't include loci found to be non-variant after the subsetting procedure |
| --keepIDs | File | NA | Only emit sites whose ID is found in this file (one ID per line) |
| --keepOriginalAC | boolean | false | Store the original AC, AF, and AN values in the INFO field after selecting (using keys AC_Orig, AF_Orig, and AN_Orig) |
| --maxIndelSize | int | 2147483647 | indel size select |
| --mendelianViolation | Boolean | false | output mendelian violation sites only |
| -mvq | double | 0.0 | Minimum genotype QUAL score for each trio member required to accept a site as a violation |
| --out | VariantContextWriter | stdout | File to which variants should be written |
| --remove_fraction_genotypes | double | 0.0 | Selects a fraction (a number between 0 and 1) of the total genotypes at random from the variant track and sets them to nocall |
| --restrictAllelesTo | NumberAlleleRestriction | ALL | Select only variants of a particular allelicity. Valid options are ALL (default), MULTIALLELIC or BIALLELIC |
| --sample_expressions | Set[String] | NA | Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times |
| --sample_file | Set[File] | NA | File containing a list of samples (one per line) to include. Can be specified multiple times |
| --sample_name | Set[String] | [] | Include genotypes from this sample. Can be specified multiple times |
| --select_expressions | ArrayList[String] | [] | One or more criteria to use when selecting the data |
| --select_random_fraction | double | 0.0 | Selects a fraction (a number between 0 and 1) of the total variants at random from the variant track |
| --selectTypeToInclude | List[Type] | [] | Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Allow a samples other than those in the VCF to be specified on the command line. These samples will be ignored..
Output variants that were also called in this comparison track. A site is considered concordant if (1) we are not looking for specific samples and there is a variant called in both the variant and concordance tracks or (2) every sample present in the variant track is present in the concordance track and they have the sample genotype call. --concordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
Output variants that were not called in this comparison track. A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track. --discordance binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
File containing a list of samples (one per line) to exclude. Can be specified multiple times. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.
Exclude genotypes from this sample. Can be specified multiple times. Note that sample exclusion takes precedence over inclusion, so that if a sample is in both lists it will be excluded.
Don't include filtered loci in the analysis.
Don't include loci found to be non-variant after the subsetting procedure.
Only emit sites whose ID is found in this file (one ID per line). If provided, we will only include variants whose ID field is present in this list of ids. The matching is exact string matching. The file format is just one ID per line
Store the original AC, AF, and AN values in the INFO field after selecting (using keys AC_Orig, AF_Orig, and AN_Orig).
indel size select.
output mendelian violation sites only. This activates the mendelian violation module that will select all variants that correspond to a mendelian violation following the rules given by the family structure.
Minimum genotype QUAL score for each trio member required to accept a site as a violation.
File to which variants should be written.
Selects a fraction (a number between 0 and 1) of the total genotypes at random from the variant track and sets them to nocall.
Select only variants of a particular allelicity. Valid options are ALL (default), MULTIALLELIC or BIALLELIC. When this argument is used, we can choose to include only multiallelic or biallelic sites, depending on how many alleles are listed in the ALT column of a vcf.
For example, a multiallelic record such as:
1 100 . A AAA,AAAAA
will be excluded if "-restrictAllelesTo BIALLELIC" is included, because there are two alternate alleles, whereas a record such as:
1 100 . A T
will be included in that case, but would be excluded if "-restrictAllelesTo MULTIALLELIC
The --restrictAllelesTo argument is an enumerated type (NumberAlleleRestriction), which can have one of the following values:
Regular expression to select many samples from the ROD tracks provided. Can be specified multiple times.
File containing a list of samples (one per line) to include. Can be specified multiple times.
Include genotypes from this sample. Can be specified multiple times.
One or more criteria to use when selecting the data. Note that these expressions are evaluated *after* the specified samples are extracted and the INFO field annotations are updated.
Selects a fraction (a number between 0 and 1) of the total variants at random from the variant track. This routine is based on probability, so the final result is not guaranteed to carry the exact fraction. Can be used for large fractions.
Select only a certain type of variants from the input file. Valid types are INDEL, SNP, MIXED, MNP, SYMBOLIC, NO_VARIATION. Can be specified multiple times. This argument select particular kinds of variants out of a list. If left empty, there is no type selection and all variant types are considered for other selection criteria. When specified one or more times, a particular type of variant is selected.
Input VCF file. Variants from this VCF file are used by this tool as input. The file must at least contain the standard VCF header lines, but can be empty (i.e., no variants are contained in the file). --variant binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3
See also Guide Index | Technical Documentation Index | Support Forum
GATK version 2.5-2-gdb4546e built at 2013/05/01 09:32:36.