Tagged with #selectvariants
2 documentation articles | 5 announcements | 85 forum discussions

Created 2012-08-01 23:04:18 | Updated 2016-03-16 21:58:01 | Tags: selectvariants variantfiltration jexl

Comments (38)

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).

5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Examples using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Examples using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

Example using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'

Created 2012-07-23 17:50:07 | Updated 2015-05-15 23:55:20 | Tags: selectvariants jexl vcf callset

Comments (32)

This document describes why you might want to extract a subset of variants from a callset and how you would achieve this.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). The GATK tool that we use the most for subsetting calls in various ways is SelectVariants; it enables easy and convenient subsetting of VCF files according to many criteria.

Select Variants operates on VCF files (also sometimes referred to as ROD in our documentation, for Reference Ordered Data) provided at the command line using the GATK's built in --variant option. You can provide multiple VCF files for Select Variants, but at least one must be named 'variant' and this will be the file (or set of files) from which variants will be selected. Other files can be used to modify the selection based on concordance or discordance between the callsets (see --discordance / --concordance arguments in the tool documentation).

There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions; make sure to read the linked document for details, especially the section on working with complex expressions.

Note that in the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset. See further below for an explanation of how that works.

Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

  • isVariant => is there an ALT allele?

  • isPolymorphic => is some sample non-ref in the samples?

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.

Additional information

For information on how to construct regular expressions for use with this tool, see the method article on variant filtering with JEXL, or "Summary of regular-expression constructs" section here for more hardcore reading.

Created 2014-07-15 03:54:06 | Updated 2014-10-23 17:58:36 | Tags: variantrecalibrator haplotypecaller selectvariants variantannotator release-notes catvariants genotypegvcfs

Comments (13)

GATK 3.2 was released on July 14, 2014. Itemized changes are listed below. For more details, see the user-friendly version highlights.

We also want to take this opportunity to thank super-user Phillip Dexheimer for all of his excellent contributions to the codebase, especially for this release.

Haplotype Caller

  • Various improvements were made to the assembly engine and likelihood calculation, which leads to more accurate genotype likelihoods (and hence better genotypes).
  • Reads are now realigned to the most likely haplotype before being used by the annotations, so AD and DP will now correspond directly to the reads that were used to generate the likelihoods.
  • The caller is now more conservative in low complexity regions, which significantly reduces false positive indels at the expense of a little sensitivity; mostly relevant for whole genome calling.
  • Small performance optimizations to the function to calculate the log of exponentials and to the Smith-Waterman code (thanks to Nigel Delaney).
  • Fixed small bug where indel discovery was inconsistent based on the active-region size.
  • Removed scary warning messages for "VectorPairHMM".
  • Made VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
  • When we subset PLs because alleles are removed during genotyping we now also subset the AD.
  • Fixed bug where reference sample depth was dropped in the DP annotation.

Variant Recalibrator

  • The -mode argument is now required.
  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.


  • The plotting script now uses the theme instead of opt functions to work with recent versions of the ggplot2 R library.

Variant Annotator

  • SB tables are created even if the ref or alt columns have no counts (used in the FS and SOR annotations).

Genotype GVCFs

  • Added missing arguments so that now it models more closely what's available in the Haplotype Caller.
  • Fixed recurring error about missing PLs.
  • No longer pulls the headers from all input rods including dbSNP, rather just from the input variants.
  • --includeNonVariantSites should now be working.

Select Variants

  • The dreaded "Invalid JEXL expression detected" error is now a kinder user error.

Indel Realigner

  • Now throws a user error when it encounters reads with I operators greater than the number of read bases.
  • Fixed bug where reads that are all insertions (e.g. 50I) were causing it to fail.


  • Now computes posterior probabilities only for SNP sites with SNP priors (other sites have flat priors applied).
  • Now computes genotype posteriors using likelihoods from all members of the trio.
  • Added annotations for calling potential de novo mutations.
  • Now uses PP tag instead of GP tag because posteriors are Phred-scaled.

Cat Variants

  • Can now process .list files with -V.
  • Can now handle BCF and Block-Compressed VCF files.

Validate Variants

  • Now works with gVCF files.
  • By default, all strict validations are performed; use --validationTypeToExclude to exclude specific tests.


  • Now use '--use_IUPAC_sample sample_name' to specify which sample's genotypes should be used for the IUPAC encoding with multi-sample VCF files.


  • Refactored maven directories and java packages replacing "sting" with "gatk".
  • Extended on-the-fly sample renaming feature to VCFs with the --sample_rename_mapping_file argument.
  • Added a new read transformer that refactors NDN cigar elements to one N element.
  • Now a Tabix index is created for block-compressed output formats.
  • Switched outputRoot in SplitSamFile to an empty string instead of null (thanks to Carlos Barroto).
  • Enabled the AB annotation in the reference model pipeline (thanks to John Wallace).
  • We now check that output files are specified in a writeable location.
  • We now allow blank lines in a (non-BAM) list file.
  • Added legibility improvements to the Progress Meter.
  • Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming (thanks to Mike McCowan).
  • Made IntervalSharder respect the IntervalMergingRule specified on the command line.
  • Sam, tribble, and variant jars updated to version 1.109.1722; htsjdk updated to version 1.112.1452.

Created 2014-04-11 05:00:50 | Updated | Tags: selectvariants bug multithreading ad bug-fixed nt

Comments (1)

This is not exactly new (it was fixed in GATK 3.0) but it's come to our attention that many people are unaware of this bug, so we want to spread the word since it might have some important impacts on people's results.

Affected versions: 2.x versions up to 2.8 (not sure when it started)

Affected tool: SelectVariants

Trigger conditions: Extracting a subset of samples with SelectVariants while using multi-threading (-nt)

Effects: Genotype-level fields (such as AD) swapped among samples

This bug no longer affects any tools in versions 3.0 and above, but callsets generated with earlier versions may need to be checked for consistency of genotype-level annotations. Our sincere apologies if you have been affected by this bug, and our thanks to the users who reported experiencing this issue.

Comments (2)

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.

Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.


  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.
Comments (2)

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

  • Improved the algorithm around homopolymer runs to use a "delocalized context".
  • Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
  • Fixed bug where the tool failed for reads that begin with insertions.
  • Fixed bug in the scatter-gather functionality.
  • Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
  • Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
  • Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
  • Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
  • Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
  • The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
  • Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
  • Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

  • Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

Reduce Reads

  • Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
  • Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
  • Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

  • Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

  • AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Now outputs the first non-missing QUAL, not the maximum.

Select Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Removed the -number argument because it gave biased results.

Validate Variants

  • Added option to selectively choose particular strict validation options.
  • Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
  • improved the error message around unused ALT alleles.

Somatic Indel Detector

  • Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).


  • New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
  • Fixed raw HapMap file conversion bug in VariantsToVCF.
  • Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
  • Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
  • Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
  • Fixed bug in BCF2 writer for case where all genotypes are missing.
  • Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
  • Fixed bug in Phase By Transmission when there are no likelihoods present.
  • Fixed bug in fasta .fai generation.
  • Updated and improved version of the BadCigar read filter.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

Created 2012-08-20 18:52:48 | Updated 2012-08-23 14:11:29 | Tags: unifiedgenotyper baserecalibrator combinevariants haplotypecaller selectvariants varianteval release-notes

Comments (0)

Base Quality Score Recalibration

  • Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
  • Implemented support for SOLiD no call strategies other than throwing an exception.
  • Fixed smoothing in the BQSR bins.
  • Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

  • Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
  • UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
  • Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
  • In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
  • Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

  • Added LowQual filter to the output when appropriate.
  • Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
  • Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
  • Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
  • Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
  • Fixed bug where non-standard bases from the reference would cause errors.
  • Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

Reduce Reads

  • Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
  • Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
  • Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

  • Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
  • Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

  • Now outputs the first non-MISSING QUAL, instead of the maximum.
  • Now supports multi-threaded running (with the -nt argument).

Select Variants

  • Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
  • No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
  • If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).


  • Updated and improved the BadCigar read filter.
  • GATK now generates a proper error when a gzipped FASTA is passed in.
  • Various improvements throughout the BCF2-related code.
  • Removed various parallelism bottlenecks in the GATK.
  • Added support of X and = CIGAR operators to the GATK.
  • Catch NumberFormatExceptions when parsing the VCF POS field.
  • Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
  • Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
  • We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
  • Added support for handling complex events in ValidateVariants.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

Created 2016-05-18 10:20:34 | Updated | Tags: selectvariants variantfiltration genotype no-calls

Comments (1)


I have been trying to find a way to mark specific individual genotypes as No Call.

I know that in VariantFiltration it is possible to add the option --setFilteredGTToNocall in order to mark filtered genotypes as no call. However, in my case, there is no available filter corresponding to my criteria. Let me explain. I have some diploid-haploid paired related samples, i.e. I expect the haploid individual to share one of the two alleles from its diploid related. Therefore, for positions where there are discordant genotypes, I would like to mark individual genotypes as as no call in order to not include potentials errors in my data. I couldn't find a way to apply this rational to the pipeline of VariantFiltration or SelectVariants...

Eventually, I manage to manipulate the .vcf file by myself to mark any discordant genotypes as . or ./. However, when later I want to select the positions for which there are maximum of 0.2 ratio of no call (using the option --maxNOCALLfraction of SelectVariants from the nightly build version), there was an error message "there aren't enough comumns for line...". I suppose that it's because I did manipulate the vcf file by myself and there could have been errors (although I'm quite confident with my script).

Therefore, I think it's better to not manipulate the vcf file and try to do things using the pipeline....but how?

Created 2016-04-25 21:22:01 | Updated | Tags: selectvariants homozygosity vcf-filtering

Comments (2)


I am trying to use an or statement to select for homozygous opposite genotypes between two samples in select variants. I have tried some variations but never been able to get what I am looking for in the outcome. Can you please advise what is going wrong here.

Options in use are: -env -noTrim --selectTypeToExclude INDEL -select 'vc.getGenotype("SampleA").isHomVar() || vc.getGenotype("SampleA").isHomRef()' --selectexpressions "AF==0.500" --restrictAllelesTo BIALLELIC

If I can select for either HomVar or HomRef for that sample, then the AF should take care of ensuring the second sample is homozygous for the opposite genotype with the AF.

Thank you, Amanda

Created 2016-03-29 20:12:58 | Updated | Tags: selectvariants variantfiltration

Comments (2)


As many members have suggested I am using a standard method of running through VariantFiltration to change samples with low depth to missing genotype using -G_filter "DP<4" -G_filterName "SampleDepthFilter4" --setFilteredGtToNocall.

Then using SelectVariants to filter by MAF and Missing%s using allele numbers with the resulting file using --selectexpressions "AF>0.1" --selectexpressions "AF<0.9" --selectexpressions "AN>140" --restrictAllelesTo BIALLELIC --selectTypeToExclude INDEL.

However when I'm not going through and looking at the results, I see that in almost all cases the missing percentages are much higher than what I was theoretically filtering for with Allele Number greater than 140, ie 70 individuals. It seems that it is not taking into account the genotypes that have been changed over to missing based on the sample depth filter, even though it is using the resulting file from the VariantFiltration step. Why is filtering not performing as anticipated?

Thank you, Amanda

Created 2016-03-24 16:38:02 | Updated | Tags: selectvariants haploid

Comments (1)

I'm getting an error on the latest version of SelectVariants when I try and remove alternate alleles in a haploid organism. If I remove the 'TrimAlternates' flag it works correctly, but obviously then I'm left with ghost alleles. Can't see any reason why all alleles should need to be diploid for this?



sredmond$ java -jar /humgen/gsa-hpprojects/GATK/bin/GenomeAnalysisTK-3.5-0-g36282e4/GenomeAnalysisTK.jar -R /seq/plasmodium/sredmond//refs/Pf3D7_v3.fasta -T SelectVariants -V 3D7DD2_300100.VQSR.Pass90.vcf -o 3D7DD2_300100.VQSR.Pass90.SM-7LV8U.vcf -sn SM-7LV8U -env -trimAlternates Picked up _JAVA_OPTIONS: -Xmx8g -Xms1g INFO 12:31:30,195 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:31:30,198 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56 ...

ERROR MESSAGE: All samples must be diploid
ERROR ------------------------------------------------------------------------------------------

Created 2016-03-18 22:34:38 | Updated 2016-03-18 22:39:30 | Tags: selectvariants

Comments (3)

Hi GATK team,

I noticed an issue, which might be a bug, relating to removal of sites with only '*' mark in ALT field when subsetting samples.

Assume we have a deletion, and there is one SNP within this deletion interval (as showed below). For SNP site, the ALT field will have an asterisk mark, representing the deletion:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 Sample3 Sample4
1       1       .       AAAAAAAA        T       2000    PASS    AC=1;AF=0.125;AN=8      GT      0/0     0/0     0/0     0/1
1       3       .       A       C,*     2000    PASS    AC=1,1;AF=0.25,0.125;AN=8       GT      0/0     0/1     0/0     0/2

For this SNP, only sample2 has the alternative C allele. If we subset Sample 2, this SNP site should be removed, because the information of deletion is fully represented in the first row (the first row except the header).

However, if we use GATK selectVariants function to remove sample2, it will keep the second record, leaving ALT field an '*' mark only, which seems incorrect to me. Here is the output from GATK:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1 Sample3 Sample4
1       1       .       AAAAAAAA        T       2000    PASS    AC=1;AF=0.167;AN=6      GT      0/0     0/0     0/1
1       3       .       A       *       2000    PASS    AC=1;AF=0.167;AN=6      GT      0/0     0/0     0/1

The command I use to do subset is:

java -Xmx6g -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly19.fasta -T SelectVariants -V gatkSubsetVars.vcf -o gatkSubsetVarsNoSamp2.vcf --excludeNonVariants --removeUnusedAlternates  -xl_sn Sample2

I think we should remove line chr1_3. This should be an easy bug to fix.

Let me know if you want me to upload a snippet to help you fix this bug.


Created 2016-03-15 11:07:50 | Updated | Tags: selectvariants

Comments (2)

Hi all,

I am still working on how to process sex-chromosomes correctly, and as a result I have a VCF of haploid calls for a few hundred male X-chromosomes. I am trying to subset out by sample using SelectVariants, but it throws the following error:

##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
##### ERROR MESSAGE: All samples must be diploid

The relevant part of the command is the following:

java -Xmx8g -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.5/GenomeAnalysisTK.jar \ -T SelectVariants \ --sample_file Final385Samples.219males.list \ --excludeNonVariants \ --removeUnusedAlternates \ -R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \ -V 312Samples.ChrX.haploid.vcf.gz \ -o 219samples.ChrX.haploid.vcf.gz

I tried running with version 3.4, which doesn't seem to complain about this issue, however it is not happy with the "*" alleles.

Is there any simple workaround for this that will provide a correctly formatted output file, or am I doing something wrong?

Thanks, Steve

Created 2016-03-07 16:21:26 | Updated 2016-03-07 16:31:04 | Tags: selectvariants variantfiltration

Comments (15)

Dear GATK team,

I am trying to use VariantFiltration and SelectVariants to mask individual calls with GQ < 20 or DP < 10 in my VCF; however, as I succeed masking individuals who didn't pass the QC with './.', the AC, AN and AF in the INFO field stayed the same. Just want to make sure that I did everything correctly:

I first use VariantFiltration to mark filtered individuals (I am using GATK version: 3.5-0-g36282e4):

java -jar GenomeAnalysisTK.jar -R hg19.fasta -V test.vcf -o test.vf.vcf -G_filter 'GQ<20' --genotypeFilterName 'lowGQ' -G_filter 'DP<10' --genotypeFilterName 'lowDP' -V test.vcf -o test.out1.vcf

and then use SelectVariants to set those filtered individuals:

java -jar GenomeAnalysisTK.jar -T SelectVariants --removeUnusedAlternates --excludeNonVariants --setFilteredGtToNocall -V test.out1.vcf -o test.out2.vcf

Here is a toy example which might help you debug:

the original input was:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:GQ 0/0:20:30 0/1:20:30 0/1:20:30 0/1:5:10

After VariantFiltration: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:FT:GQ 0/0:20:PASS:30 0/1:20:PASS:30 0/1:20:PASS:30 0/1:5:lowDP;lowGQ:10

After SelectVariants: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:FT:GQ 0/0:20:PASS:30 0/1:20:PASS:30 0/1:20:PASS:30 ./.:5:lowDP;lowGQ:10

It seems to me that selectVariants should update the AC, AF and AN, as after we masking the individual's genotypes. I am not sure if I am doing this with the correct command. I've been reading previous threads about variantFiltration and selectVariants on individual genotypes, but not find cases like this.

Hope to hear from you. Thanks a lot.


Created 2016-03-01 15:20:28 | Updated | Tags: selectvariants

Comments (3)


I have a problem using SelectVariants at multiallelic sites. For a given patient (let's call him P1) I want to keep only the positions which are variant in his genome. I use the following options : --preserveAlleles : I keep the original form of the alleles, as they are called in the original vcf --excludeNonVariants : I do not want 0/0 positions for the patient P1 --removeUnusedAlternates : I want only the alleles which are specific to P1

The last point is the problematic one. Yes, it partially work. For example, let's say I have this variant in the original VCF, with two alleles in my population :

chrZ 375987 . TA T,TAA

In the P1-only-VCF, after extraction, I only have (let's say that P1 is 0/1) : chrZ 375987 . TA T Which is correct.

Nevertheless, even if only the good allele is kept, all the information from the INFO fields is preserved (for all the alleles) . A little sample from the ANN records of the P1-only-file : ANN=T|intron_variant|MODIFIER|GENE|GENE|transcript|NM_TR.1|Coding|1/2|c.62+5446delT||||||, TAA|intron_variant|MODIFIER|GENE|GENE|transcript|NM_TR.1|Coding|1/2|c.62+5445_62+5446insT

I put in bold the information from an allele absent in P1. This is annoying because it disturb the interpretation. If anybody have a suggestion, it will be the very welcomed !

Thanks by advance, BPR

Created 2016-02-29 00:11:59 | Updated | Tags: selectvariants variantfiltration phasing genotypegvcfs

Comments (9)


I'd be grateful for your help in understanding how GATK add phase information (PID field) in GenotypeGVCFs.

Here's an example of two lines from a VCF I'm getting for running GenotypeGVCFs on a set of samples:

chr1 8864083 . T C 462.27 . AC=4;AF=0.500;AN=8;BaseQRankSum=-4.950e-01;ClippingRankSum=0.406;DP=74;FS=0.000;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=0.306;QD=6.42;ReadPosRankSum=1.54;SOR=0.674 GT:AD:DP:GQ:PL 0/1:22,12:34:99:205,0,518 0/1:10,4:14:60:60,0,220 0/1:3,3:6:71:71,0,72 0/1:10,8:18:99:158,0,238 ./.:2,0:2 chr1 8864426 . C T 5302.72 . AC=5;AF=0.500;AN=10;BaseQRankSum=0.815;ClippingRankSum=-6.270e-01;DP=316;FS=7.664;MLEAC=5;MLEAF=0.500;MQ=60.00;MQRankSum=-6.130e-01;QD=17.05;ReadPosRankSum=-1.240e-01;SOR=0.777 GT:AD:DP:GQ:PGT:PID:PL 0/1:24,66:90:99:0|1:8864426_C_T:1596,0,649 0/1:20,62:82:99:.:.:1502,0,536 0/1:24,37:61:99:0|1:8864426_C_T:921,0,647 0/1:22,41:63:99:0|1:8864426_C_T:1014,0,717 0/1:2,13:15:74:0|1:8864426_C_T:304,0,74

What's the interpretation for the phase information of the last sample in chr1:8864426? in position chr1:8864083 its genotype is not reported but in the proceeding position chr1:8864426 it is phased 0|1 WRT position chr1:8864083.

If I subsequently run SelectVariants for that sample, with --excludeNonVariants this phase information is retained but clearly looses context. So another question is if there is a way to change the PID field when running SelectVariants or VariantFiltration in case the position it is phased with is filtered?

Thanks a lot

Created 2016-02-25 09:24:33 | Updated | Tags: selectvariants

Comments (1)

Hi guys, I was subsetting a VCF file with SelectVariants, based on a sample list which contains some additional samples we didn't sequence.

GATK SelectVariants complained about the samples missing from VCF file and the suggested:

##### ERROR 
##### ERROR To ignore these samples, run with --allowNonOverlappingCommandLineSamples

However, I followed the suggestion and got

  ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
  ##### ERROR
  ##### ERROR MESSAGE: Argument with name 'allowNonOverlappingCommandLineSamples' isn't defined.

I can obviously fix this myself as recommended :) but I thought the option was quite convenient, and seemed right to flag this message up as the output suggests to use an option apparently not in place anymore (version 3.5-0-g36282e4)

cheers, F

Created 2016-02-19 09:53:13 | Updated | Tags: selectvariants rna-seq somatic-variants

Comments (18)

Dear All,

I would like to know whether GATK has any way to get somatic variants by giving both normal and tumor samples on the command line, I mean as Varscan does. ?

Thank you

Created 2016-02-05 16:45:17 | Updated | Tags: selectvariants

Comments (2)

I have consistently failed to SelectVariants from a gvcf containing only one sample with -select "GQ > 60.0" (or other variations, such as -select "GQ > 60"): the output vcf always ends up with a header but no entries, though there are many, many entries that meet this threshold. I've come up with a workaround (VariantsToTable, filtering with awk, then SelectVariands --keepIDs ), but it's memory inefficient and longterm I would like to be able to select sites (especially homozygous reference sites) based on GQ. This article about JEXL queries suggests what I've done should work fine (see section 4), but this forum post shows that others have had this problem a couple years ago.

I have been able to get SelectVariants to pull out homozygous reference calls by including -sn samplename -select 'vc.getGenotype("samplename").isHomRef()', but whenever I include any -select query based on GQ, the resulting file is blank (when this is included alone or in conjunction with other calls, including .isHomRef()).

Is there another trick I should try? Has anyone had success filtering on a sample's GQ? Other suggestions for pulling out high-quality homozygous reference sites from a gvcf containing data from just one sample?

Other information/detail that may or may not be relevant: My gvcf is BP_RESOLUTION, not in block format. I get no error messages from SelectVariants, just blank output. No standard SelectVariants options give me trouble, just genotype-level filters (and not all of them - as I mentioned .isHomRef() works fine).

All thoughts welcome - thanks!

Created 2016-01-21 17:45:48 | Updated | Tags: variantrecalibrator selectvariants s

Comments (18)


I have my final vcf file after variant recalibration and I was wondering where the AF information comes from. I suppose it comes from the training data. Right?

I am asking because I have too many variants with AF=1.00 or AF=0.00. and what I want is to extract the variants that have AF<0.15 according to the 1000 genome.

Can I do this through GATK? is the AF in my final file what I should be based on to do make the filtering?

Look forward to your reply.

Best, Alexandra

Created 2015-12-04 16:53:10 | Updated | Tags: selectvariants variantfiltration

Comments (7)


I apologise if this has been answered already, I haven't managed to find the answer (I promise I tried). I have applied hard-filters to my HC-SNP data set as recommended by GATK as a starting point. However, I also want to filter by depth (I think this is no longer a popular option but variants with good coverage are surely still more reliable than variants with very low coverage?). Ideally, what I want is a filtered vcf file that contains variants which are genotyped in the majority of my samples at a minimum depth. I think this is what the CoveredByNSampleSites Walker used to do, but it seems this walker has been removed?

So instead I worked around this using VariantFiltration at the sample level: --genotypeFilterExpression 'DP < 10' --genotypeFilterName DP-10 --setFilteredGtToNocall, so all samples with a depth of less than 10 for the any given variant were set to ./.. This seemed to work fine.

Now I want to exclude variants for which a large proportion of my samples have no genotype (./.). For this, I used:

SelectVariants --maxFilteredGenotypes 10.

However, when I check how many samples have no genotype using VariantsToTable -F NO-CALL, I still have many variants where 20-40 of my samples have no genotype! The --maxFilteredGenotypes option seems to work fine, as I definitely have far fewer variants with many missing genotypes. I believe this is due to the fact that these missing genotypes weren't 'filtered', but simply have not been genotyped in the HaplotypeCaller. I guess to exclude these, I would need an option along the lines of "--maxNOCALL" instead of the --maxFilteredGenotypes, but as far as I can tell, this does not exist?

I am sure there must be a way to do this and I simply haven't found it?

Many thanks for any pointers :)


Created 2015-11-25 21:38:23 | Updated 2015-11-25 21:40:23 | Tags: selectvariants vcf gatk

Comments (2)


I have multi-sample vcf file and an example variant is shown below: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 03-071 04-051 04-071 06-044 07-085 10-009

chr1 6526093 . T C 197.77 . AC1=1;AC=1;AF1=0.5 GT:GQ:DP:PL:AD 0/1 1/1 0/0 1/1 1/1 0/1

For each variant i would need to retrieve the sample names based on genotype. If the genotype is "0/1" could it be possible to select only the samples for which the genotype is "0/1" and similarly for the genotype "1/1"?

SelectVariants filters variants based on the different quality values provided. Here i would need to subset the vcf by sample for each variant based on the genotype. Are there any tools which can do this?

Created 2015-11-17 14:42:23 | Updated | Tags: selectvariants variantfiltration

Comments (2)


I'm hoping it's possible to mark variants in repetitive elements using variant filtration, so they can be selected out later. I've tried a few command variations, the most recent of which:

GenomeAnalysisTK -R local_reference/dm6.fa \ -T VariantFiltration \ -V my.vcf \ -filter "QD<5.0" -filterName "LowQD" \ -mask ../masking_intervals/dmel6_repMask.gatk.intervals -filterNotInMask -maskName "RepMask" \ -o my_marked.vcf

..fails with "No tribble type provided" for the intervals file.

Is this possible at all to do with VariantFiltration? Note that I'd rather mark the suspect variants for removal later, rather than just remove them altogether.

...actually in writing this, I just thought to create a new vcf of variants only in the intervals, and then use this vcf to screen out the original vcf..... Sound good ?


William Gilks

Created 2015-11-10 15:03:13 | Updated 2015-11-10 15:06:13 | Tags: selectvariants genotypegvcfs

Comments (9)


I have questions about the combination genotypeGVCFs/SelectVariants.

  • I have some loci with the genotype 0/2 for some samples, 0/1 for other.. There is thus two different alleles , but the corresponding ALT is for example « G,*_». If the nucleotide G corresponds to 0/1, to what corresponds exactly the 0/2 ? Said otherly, what does this star mean ?

  • How to correctly extract the variants for each sample from the combined vcf ? For a given sample, my command line is :

java -jar GenomeAnalysisTK.jar -T SelectVariants -R hg19_min_oldM.fa -V Combined.vcf -o Sample.vcf --excludeNonVariants -sn Sample

Nevertheless, it gives a vcf which is not exploitable by SnpSift extractFields because of theses stars.

  • last question : I read that a downsampling is effectuated by SelectVariants, but I do not understand what it means. As you guess, I really want to conserve all the variants that have been called by HaplotypeCaller, so I don't want SelectVariants to perform an other thing that extracting all the variants from each sample....

Sorry for all these questions... but thanks by advance to who will help me !

Cheers, BPR

Created 2015-11-06 11:07:22 | Updated | Tags: selectvariants variantannotator variantfiltration dbsnp

Comments (5)

Hi team,

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers. .................................................... done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion. .................................................... I'm assuming this can be done with SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp .................................................... Perhaps with SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base. .................................................... Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field .....................................................e.g VRT=1 (for an SNV), VRT=2 for an indel etc. SNPeff doesn't seem to work for this but I could be wrong.

Knowing how to effectively format vcfs between GATK output and NCBI input might be quite useful for many people, and save rather a lot of time.

It would be really useful if the three exclusion criteria could be done using GATK. Is this possible and using what commands ?

I feel as though need to use the GATK variantAnnotator command as well. I'm looking into all of this today, and will post if I get any solutions.


William Gilks

Created 2015-11-02 05:36:25 | Updated | Tags: selectvariants

Comments (6)

Hi! I'm using GATK to call SNPs for a F2 mapping population design using GBS. So, my reference is derived from all my GBS lines. Running the GATK using UnifiedGenotyper worked great, generating perfectly valid output and a viewable vcf (in excel and IGV). However, I'm now trying to filter for just biallelic SNPs using SelectVariants, and I keep getting this error:

ERROR MESSAGE: Bad input: We encountered a non-standard non-IUPAC base in the provided reference: '10'

I'm using the same reference files (.fa, .fai, .dict, etc) as for the GATK, and haven't modified them in anyway. They've never been processed by Windows software, and I can't seem to find any non-standard formatting or data (at least in a cursory look). Any thoughts?


Created 2015-10-26 12:17:38 | Updated | Tags: selectvariants genotypegvcfs gvcf

Comments (5)

GATK Admins,

We are currently working on a project, and we have found that some of our samples were contaminated after the gVCF merging phase. Is it possible to remove samples from a merged gVCF (likely using SelectVariants), or would we need to re-merge only the good gVCFs into a new merged gVCF? (Note that we're actually working with a double-merged gVCF file containing ~5,000 samples, so re-merging would be potentially costly).


John Wallace

Created 2015-09-18 16:32:58 | Updated | Tags: selectvariants

Comments (6)

Just FYI,

In the example "Set filtered genotypes to no-call (./.):" in;


The argument for setting flagged GT to null is a typo.

The example has it as


should be


Created 2015-09-14 15:21:42 | Updated | Tags: commandlinegatk selectvariants runtime-error

Comments (3)

Hi, I'm trying to extract a few samples from a large VCF file with many samples using SelectVariants and I keep running into this error.

Command: java -Xmx4g -jar ~/GenomeAnalysisTK.jar -T SelectVariants -R /Volumes/odin/reference/16484/mafa5/mafa5. -V 16557.all.exons.mafa5.ann.vcf.gz -o 16580.mafa5.M1M1.vcf -sn CY0320 -sn CY0321 -sn CY0322 -sn CY0323 -sn CY0324 -sn CY0325

ERROR stack trace

java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at htsjdk.samtools.reference.IndexedFastaSequenceFile.getSubsequenceAt(IndexedFastaSequenceFile.java:195) at org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:329) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.initializeReferenceSequence(LocusReferenceView.java:150) at org.broadinstitute.gatk.engine.datasources.providers.LocusReferenceView.(LocusReferenceView.java:126) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:90) at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Any thoughts as to why this might be happening?

Created 2015-09-10 19:21:44 | Updated 2015-09-10 19:22:49 | Tags: selectvariants

Comments (2)

Hello gatk forums,

Novice gatk user here. I have a VCF file with 200 samples. I am currently trying to split/subset the VCF file such that I am only looking at individual samples. Below is my command line input and the error message I am getting.

java -jar /my/path/GenomeAnalysisTK.jar \ -T SelectVariants \ -R chr22.fa \ -V original.vcf \ -o indiv.vcf \ -sn D66001

Bad input: Samples entered on command line (through -sf or -sn) that are not present in the VCF.

I am sure that my sample is present. What am I doing wrong? Is there an easier way of subsetting a VCF file by sample?

Created 2015-09-04 14:18:00 | Updated | Tags: selectvariants

Comments (1)

I can't seem to find this in the documentation (I'm sure it's there!). Can I use SelectVariants to pull out variants that segregate in a particular way with case/control status.

e.g. Variants in which all the cases are homozygous AA and the controls are AB or BB (for instance)

Created 2015-08-04 19:11:47 | Updated | Tags: selectvariants variantstotable

Comments (4)

Hello GATK team,

Given a sample.txt file to be used with -sf flag, and sample names given on each line, is there an option in SelectVariants and/or VariantsToTable that allows output to be ordered as specified?

sample.txt file: Ind1 Ind4 Ind2 Ind3

desired sample fields in .vcf file: Ind1 Ind4 Ind2 Ind3

This can be done with command line tools, but thought I would ask here first. Thanks!

Created 2015-07-24 10:54:41 | Updated | Tags: selectvariants

Comments (4)

I am using BWA/GATK Haplotype caller on core exom sequences. In my pipeline I use an extended (250bp padding around exons for indels/50bp for SNPS) bed interval file to identify variants. Following that, I use SelectVariants to filter out core variations with narrower bed coordinates (50bp padding for indels, 10bp for SNPs).

Everything is fine with SNPs, and insertions, however in case of large deletions SelectVariant is filtering out large deletions that are started outside of the core bed coordinates, even though the end of the deletion overlaps the narrower bed coordinates.

To visualise the problem I am attaching an IGV view, where the extended list contains the large deletion, however in the core list it is absent.

Is there any way to fix this issue? Calling the variants twice with different bed coordinates would require unnecessary computation power.

Created 2015-05-29 13:25:28 | Updated | Tags: selectvariants

Comments (3)

Hi there,

Had a JEXL syntax question that I was hoping to get some thoughts on. Let's say that I have a VCF file with 10 samples in it, S1 through S10. I would like to select the sites where among 5 specific samples, 3 are homozygous variant. So something like:

select ' vc.getHomVarCount(S1, S2, S3, S4, S5) >= 10 '

However this syntax doesn't work unfortunately. Is there a way to specify how to do something like this using vc.GetHomVarCount()?

Thanks as always! Steve

Created 2015-05-01 05:07:41 | Updated | Tags: selectvariants snp gatk fastaalternatereferencemaker

Comments (2)

I have some VCF files, each of which I have merged to contain >300 genotypes. Furthermore, to make them more manageable I have subsetted them to just contain the chromosome regions I am interested in.

Now I wish to generate some genotype specific FASTA sequences using these files and a reference sequence; i.e. a sequence for each genotype which is the same as the reference sequence but with the SNPs specific to each genotype in place of their counterparts in the reference sequence.

Now I know that there is variation in the genotypes. Here is a picture visualizing three exemplar genotypes that I generated by loading the VCF file into Geneious.


I then try to create individual VCF files for each genotype using this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T SelectVariants --variant ~/Path/to/complete\vcf/example.vcf -o ~/Path/to/individual/genotype.vcf -sn genotype

While I can't be sure this had the desired effect as it is difficult to assess a whole VCF file I can say that the header now only contains the relevant genotype so I assume this is the case.

I then try and use this individual VCF file for each genotype like this:

java -jar GenomeAnalysisTK.jar -R ~/Path/to/reference\sequence/ref.fasta -T FastaAlternateReferenceMaker --variant~/Path/to/individual/genotype.vcf -L chrX:XX,XXX,XXX-XX,XXX,XXX -o ~/Path/to/individual/genotype.fasta

Here the Xs represent the location on the reference sequence of the regions of interest.

I did this in a loop and got identical sequences for every genotypes. I then implemented it individually for the 3 exemplar genotypes in the picture above and in both cases I get identical sequences for every genotype. Interestingly they are not the reference sequence.

What am I doing wrong?

I have also posted this on the Biostars forum..

Created 2015-03-11 20:47:24 | Updated | Tags: selectvariants

Comments (2)


I have used VariantFiltration to filter the variants based on quality thresholds. Subsequently, SelectVariants to select the filter passed variants using "--excludeFiltered" flag in the command line. Could it be possible to use SelectVariants to select only the filter failed variants? I could not find it in the documentation of the tool. Is there any such option available?

Thank you

Created 2015-03-07 16:14:34 | Updated | Tags: selectvariants snp

Comments (4)

Hi! I am currently evaluating different methods to select tagSNPs for a gene. Is it possible to identify tagSNPs for a gene with GATK by scanning a given list of SNPs? Thank you very much.

Created 2015-03-03 21:03:41 | Updated | Tags: haplotypecaller selectvariants combinegvcfs

Comments (5)

GATK team,

I currently have many WES gVCFs called with GATK 3.x HaplotypeCaller, and I'm now looking to combine them and run GenotypeGVCFs. Unfortunately, I forgot to add the "-L" argument to HC to reduce the size of the resulting gVCFs, and CombineGVCFs looks like it's taking much longer than I expect it to.

Is there any potential problem with using the "-L" argument to SelectVariants to reduce the size of my gVCFs and then use those smaller gVCFs in the CombineGVCFs stage (and beyond), or do I have to re-call HaplotypeCaller again? Would it be better to extend the boundaries of the target file by a certain amount to avoid recalling HaplotypeCaller?


John Wallace

Created 2015-02-24 16:58:48 | Updated | Tags: selectvariants

Comments (3)

Hi GATK team,

I have a joint VCF that contains three samples, I want to filter the VCF based on the DP in one sample. I noticed that SelectVariants can only select variants based on "INFO" field in the VCF. Any suggestions?

Many thanks, Linda

Created 2015-01-15 23:50:50 | Updated | Tags: selectvariants jexl

Comments (2)


I am trying to use select variants along with the JEXL syntax. In this specific case I wish to select those variants that

(1)Were heterozygous and,

(2)The ALT allele had more than 8 reads and,

(3) The alternate allele frequency exceeds 20% or as I define in a calculation: (ALT_AD)/(ALT_AD+REF_AD) >= 0.20

My JEXL looks like this:

(vc.getGenotype(“SAMP1").isHet()) && (vc.getGenotype(“SAMP1").getAD().1 >= 8) && (((vc.getGenotype(“SAMP1").getAD().1) / ((vc.getGenotype(“SAMP1").getAD().0)+(vc.getGenotype("SAMP1").getAD().1))) >= 0.2)

In this case, I get no results ( although GATK executes succesfully without errors). However when I exclude the third clause I get the expected results namely those variants that are (1)heterozygous and (2) The ALT_AD exceeds 8 reads. Like so:

(vc.getGenotype(“SAMP1").isHet()) && (vc.getGenotype(“SAMP1").getAD().1 >= 8)

To troubleshoot this issue I thought it might be a decimal vs. integer issue and I added a decimal value to my AD calculation, like so:

(vc.getGenotype(“SAMP1").isHet()) && (vc.getGenotype(“SAMP1").getAD().1 >= 8) && (((vc.getGenotype(“SAMP1").getAD().1 + 0.000000001) / ((vc.getGenotype(“SAMP1").getAD().0)+(vc.getGenotype("SAMP1").getAD().1))) >= 0.2)

The result was that select variants proceeded to select calls as if that third clause did not exist. I am not sure what I might be overlooking, but any insight much appreciated.

Created 2015-01-11 23:30:38 | Updated | Tags: selectvariants variantfiltration jexl snps

Comments (2)

I am experimenting with JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.


Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like: vc.getGenotype( sample() ).isHet()

Sorry if this is a really dumb question. (BTW I realise you could use a genotype_filter e.g. --genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.

Created 2015-01-10 00:09:06 | Updated 2015-01-10 00:12:37 | Tags: selectvariants

Comments (3)


I am attempting to run the tool SelectVariants and everything seems ok but for one error I receive: ##### ERROR MESSAGE: Invalid argument value '—-variant' at position 4.

##### ERROR Invalid argument value 'input.vcf' at position 5.

My intitial command was:

java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta —-variant input.vcf -select '[my select code]' -o output.vcf

Any help appreciated.

Thank you!

Created 2014-11-20 23:34:28 | Updated | Tags: selectvariants bug

Comments (1)

Hi I'm using select variants to extract out variants with PASS filter, and those that are biallelic SNPs with the command below using gatk 3.2-2.

java -Xmx${MEM} -jar ${gatk_dir}/GenomeAnalysisTK.jar -R ${refdir}/${genome} \ -T SelectVariants \ --variant "1"${VCF_FILE}"excCpGRepeat.vcf" \ -ef \ -L $CHROM \ --restrictAllelesTo BIALLELIC \ -selectType SNP \ -o "2"${VCF_FILE}"_PASSBiSNP_excCpGRepeat.vcf"

I split my vcf using snpsift - so I run it on each chromosome seperately. For 12 of the 38 chromosomes I get the error - the others work fine. I re-extracted one of the chromosomes that failed, to be sure it wasn't a snpsift issue, and it still fails. Any ideas on what the problem might be?

INFO 15:26:17,318 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,322 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 15:26:17,322 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:26:17,323 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:26:17,328 HelpFormatter - Program Args: -R /u/home/c/projectdata/c/canids/reference/canfam31/canfam31_chr1NOTchr01/canfam31_chr1NOTchr01.fa -T SelectVariants --variant 1.chr35_excCpGRepeat.vcf -ef -L chr35 --restrictAllelesTo BIALLELIC -selectType SNP -o 2_chr35_PASSBiSNP_excCpGRepeat.vcf INFO 15:26:17,338 HelpFormatter - Executing as @n263 on Linux 2.6.32-431.20.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_55-mockbuild_2014_04_16_12_11-b00. INFO 15:26:17,338 HelpFormatter - Date/Time: 2014/11/20 15:26:17 INFO 15:26:17,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,339 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:26:17,469 GenomeAnalysisEngine - Strictness is SILENT INFO 15:26:17,715 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:26:19,240 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:189) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadFromDisk(RMDTrackBuilder.java:336) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.attemptToLockAndLoadIndexFromDisk(RMDTrackBuilder.java:320) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:279) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:225) at org.broadinstitute.gatk.engine.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:148) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:990) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:772) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:285) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at htsjdk.tribble.index.IndexFactory.loadIndex(IndexFactory.java:185) ... 14 more Caused by: java.io.EOFException at htsjdk.tribble.util.LittleEndianInputStream.readFully(LittleEndianInputStream.java:138) at htsjdk.tribble.util.LittleEndianInputStream.readLong(LittleEndianInputStream.java:80) at htsjdk.tribble.index.linear.LinearIndex$ChrIndex.read(LinearIndex.java:271) at htsjdk.tribble.index.AbstractIndex.read(AbstractIndex.java:363) at htsjdk.tribble.index.linear.LinearIndex.(LinearIndex.java:101) ... 19 more

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: java.lang.reflect.InvocationTargetException
ERROR ------------------------------------------------------------------------------------------

Created 2014-11-06 16:59:58 | Updated | Tags: selectvariants

Comments (12)

It'd be nice if SelectVariants can update ALT; for example, from ALT A,G to ALT A, or from ALT A to ALT . if the loci in the selected samples are hom-ref

Created 2014-09-25 00:20:55 | Updated | Tags: selectvariants jexl

Comments (2)


Thanks to you, we've got an amazing .vcf file just bursting with SNPs. If my file contains five replicates of a control group and five replicates of a treatment group (so 10 individuals total) and I'd like to select variants which are the same across all members of the control group, the same across all members of the replicate group, but different between the two groups, can I accomplish that with SelectVariants and JEXL? Or via some even easier method?

Thanks for your time. Please feel free to rather curtly request a clarification if I've garbled my question.



Created 2014-09-23 15:08:19 | Updated | Tags: selectvariants gatk tumor normal

Comments (0)

I am wondering if it would be okay to use the "SelectVariants" tool to select mutants specific to the tumor sample in a pair of tumor and normal samples. I tried the following command, and it seems working. But I am suspecting whether it differentiates the genotype differences (e.g. 0/1 in tumor sample and 0/0 in normal sample) or coverage differences (e.g. 0/1 in tumor sample and ./. in normal sample).

Could someone help me understand "SelectVariants" for my question? Thanks a lot!

java -Xmx10g -Djava.io.tmpdir=$tempSpillFolder -jar $CLASSPATH/GenomeAnalysisTK.jar \
-R $GenomeReference \
-T SelectVariants \
--variant $file \
-o $OutputFolder/$filename.selected.vcf \
-select 'set=="tumor"'

Created 2014-08-06 04:25:42 | Updated | Tags: selectvariants haplotype-caller

Comments (1)

Hi, I successfully ran a variant calling pipeline and got the raw.vcf file obtained by haplotype caller which contains exactly 4,484,688 records. And then I processed it further by extracting SNPs and INDELs to seperate VCF files using SelectVariants , and it yielded a vcf for raw SNPs with 85,239 records and a vcf for raw INDELs with 14,501 records. As you can see there is a significant decrease of the total number of records. where almost 4 million records were ignored. Could you please explain me what's the process behind all these and why is this happening ?

Created 2014-07-18 02:49:46 | Updated | Tags: selectvariants

Comments (1)

Hi there,

The documentation for Select Variants (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html) doesn't seem to be working.

Would you be able to fix the link?

Thanks!! :-) Steve

Created 2014-06-11 13:59:49 | Updated | Tags: selectvariants

Comments (1)

Would it be possible to add an option to keep the original read depth (DP) of a variant in the INFO field when subsetting multi-sample VCF files using SelectVariants. This is already implemented for AC, AF, and AN values (--keepOriginalAC).

Created 2014-05-06 23:07:46 | Updated | Tags: selectvariants vcf

Comments (1)

Is there a Walker to select a certain number of random variants from a VCF file? I have checked only the SelectVariants, but the only option similar is --select_random_fraction, that select a fraction of mutations instead of exact number.

Created 2014-04-28 19:36:40 | Updated | Tags: selectvariants

Comments (1)


I'm trying to get specific variants from a VCF file. I'm using SelectVariants' '--concordance'. This option is outputting any variant in the same genome position without regards for the actual REF and ALTs fields. Is there a way to get the variants matching REF exactly and ALTs at least containing the ALT in the concordance file?

Thanks, Carlos

Created 2014-02-13 15:25:44 | Updated | Tags: selectvariants

Comments (0)


I'm analyzing two exome deep sequencing libraries, one from cancer cells and the other from normal cells. I have been through the GATK best practices to end with a recalibrated filtered vcf file (my last step was the Variant Recalibration).

Now I would like to keep the variants (variations) found between normal and cancer cells and keep only those which have been retained according to plots produced during the previous step i.e VQSR.

I'm not sure which parameters I should select in my command line in order to achieve my goal.

I have searched through the documentation on your website, but still not sure...

Any help will be welcome

Created 2014-02-13 12:42:10 | Updated | Tags: selectvariants vcf

Comments (5)

Is there a way to include only variant sites and no-calls in your final vcf. I know during SNP calls you can only emit variants, or only confident sites or all. However is there a way to reduce your vcf in the end to only variant sites (vsqr passed) and places where no calls could be made. So the end vcfs have only variant sites and missing data - and everything that is not listed in the vcf file is reference. I need such a file for merging with other vcf files - so that every position that is not in the vcfs while merging can be called ref.

So far i have called snps with emit-all and done vsqr - I now want to reduce vcfs in size by excluding NO_VARINATION sites (but want to keep information on "missing" sites)

Created 2014-01-21 13:32:12 | Updated | Tags: vqsr selectvariants vcf

Comments (18)

I just wanted to select variants from a VCF with 42 samples. After 3 hours I got the following Error. How to fix this? please advise. Thanks I had the same problem when I used "VQSR". How can I fix this problem?

INFO 20:28:17,247 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,250 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51 INFO 20:28:17,250 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 20:28:17,251 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 20:28:17,255 HelpFormatter - Program Args: -T SelectVariants -rf BadCigar -R /groups/body/JDM_RNA_Seq-2012/GATK/bundle-2.3/ucsc.hg19/ucsc.hg19.fasta -V /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf -L chr1 -L chr2 -L chr3 -selectType SNP -o /hms/scratch1/mahyar/Danny/data/Filter/extract_SNP_only3chr.vcf INFO 20:28:17,256 HelpFormatter - Date/Time: 2014/01/20 20:28:17 INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,256 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:28:17,305 ArgumentTypeDescriptor - Dynamically determined type of /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf to be VCF INFO 20:28:18,053 GenomeAnalysisEngine - Strictness is SILENT INFO 20:28:18,167 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 20:28:18,188 RMDTrackBuilder - Creating Tribble index in memory for file /hms/scratch1/mahyar/Danny/data/Overal-RGSM-42prebamfiles-allsites.vcf INFO 23:15:08,278 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NegativeArraySizeException at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:97) at org.broad.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:116) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:84) at org.broad.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:73) at net.sf.samtools.util.AbstractIterator.next(AbstractIterator.java:57) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:46) at org.broad.tribble.readers.AsciiLineReaderIterator.next(AsciiLineReaderIterator.java:24) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:73) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:35) at org.broad.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40) at org.broad.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:428) at org.broad.tribble.index.IndexFactory$FeatureIterator.next(IndexFactory.java:390) at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:288) at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:278) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:388) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:274) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:211) at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:140) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208) at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:964) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:758) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

Created 2013-12-18 20:50:41 | Updated | Tags: randomlysplitvariants selectvariants

Comments (3)

When I use SelectVariants -fraction option more than once, I always get the same variant sites. The same happens for RandomlySplitVariants. Is there a way of randomly taking a fraction of the vcf that is not always the same, but truly random (which means it will pick different sites each time, of course with a certain probability of including common sites)?

Created 2013-12-12 04:27:34 | Updated | Tags: selectvariants

Comments (4)

I hope I have not duplicated the question since I did not find solution.

Suppose I have one variant dataset which just includes variants from ONE sample . If i have another outer datasets (not my test dataset), I can produce variants that are unique in my test call dataset by using --discordance argument like this with no problem:

$ java -Xmx2g -jar GenomeAnalysisTK.jar \ -R ref.fasta \ -T SelectVariants \ --variant myCalls.vcf \ --discordance outerdatasets.vcf \ -o unique_in_my_set.vcf

However, If my sample dataset includes one clinical affected sample and three controls, I want produce all the variants of this affected sample that are unique within this dataset ( discordance of this affected sample comparing with three controls), what tools or commends I can use?

Thank you,

Created 2013-11-26 18:10:04 | Updated | Tags: selectvariants queue qscript walkers java

Comments (12)

I am working on a Queue script that uses the selectVariants walker. Two of the arguments that I am trying to use both use an enumerated type: restrictAllelesTo and selectTypeToInclude. I have tried passing these as strings however I get java type mismatch errors. What is the simplest way to pass these parameters to the selectVariant walker in the qscript?

Created 2013-11-21 16:54:58 | Updated | Tags: selectvariants

Comments (3)

how to remove those variants with "./."?

Thanks a lot!

Created 2013-11-14 16:35:47 | Updated | Tags: selectvariants variantfiltration vcf plink vcftools

Comments (1)


I was wondering if you could use the toolkit to generate a separate VCF file containing only SNPs that are found at a predetermined chromosome and base pair position. I have a plink file which I want to convert back to VCF format and it seems unbelievably hard to do so I thought this may be a good way to get around that problem?

I am aware that vcftools offers this function with the "--positions " option, however for some reason I am getting far more variants than I listed and there is nothing wrong that is obvious with my listed positions/vcf file.

Thanks in advance, Danica

Created 2013-11-12 23:37:26 | Updated 2013-11-13 17:06:24 | Tags: selectvariants

Comments (4)


I am trying to subset a few samples from a VCF using the following command

java -Xmx8g -jar GenomeAnalysisTK-2.6-4-g3e5ff60/GenomeAnalysisTK.jar -T SelectVariants -R Homo_sapiens_assembly19.fasta -V Input.vcf -sf samples.txt-o Out.vcf

Getting the error

ERROR MESSAGE: Key Indel_FS,Indel_QD found in VariantContext field FILTER at 1:985458 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.

I checked the Input.vcf and can find that the following lines exists in the VCF


Not sure why the error ?

Any help much appreciated.. Many Thanks, Tinu

Created 2013-11-12 22:09:51 | Updated 2013-11-13 17:20:48 | Tags: combinevariants selectvariants

Comments (5)

Hi! This is my code:

java -Xmx4g -jar GenomeAnalysisTK-2.7-2-g6bda569/GenomeAnalysisTK.jar \
    -R genome.fa \
    --filter_reads_with_N_cigar \
    -T CombineVariants \
    -V:NORMALSpost RNA-edit_SNPs_NORMALS-post_SNPeff_VA.vfc \
    -V:OMNI 1000G_omni2.5.hg19.vcf \
    -V:db137 dbsnp_137.hg19.vcf \
    -V:Hapmap hapmap_3.3.hg19.vcf \
    -V:ESP1,VCF ESP6500SI-V2-SSA137.updatedRsIds.Allchr.snps_indels_FIX.vcf \
    --out NORMALS-post_TruePositives_raw.vcf &&

java -Xmx4g -jar GenomeAnalysisTK.jar \
    -R genome.fa \
    --filter_reads_with_N_cigar \
    -T SelectVariants \
    -V NORMALS-post_TruePositives_raw.vcf \
    -select "set == 'NORMALSpost'" \
    --out NORMALS-post_TruePositives.vcf

I hope I got this right: I now have SNPs (I have only called SNPs, not indels) that is found in my sample, 1000G, dbSNP and ESP. Right?

Question: Can I easily make a VCF with all SNPs in MY sample only, not found in anything else by tweaking this code? Or must I do it some way else?


Created 2013-11-10 11:18:58 | Updated | Tags: selectvariants

Comments (3)


I am running SelectVariants on a vcf file for removing indels >50bps. I am getting following error:

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.4-7-g5e89f01):
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Line: chr1 2277268 rs140545544 CCACA C,CCACACA 515.33 PASS . GT:GQ:DP:PL:AD 2/2:5:21:.,.,381,.,5,0:1,12 1/1:45:21:723,45,0,.,.,.:0,15
ERROR ------------------------------------------------------------------------------------------

I am reporting error for version 2.4 as it shows actual line with problem.

Error with version 2.7

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR MESSAGE: Error parsing line: org.broad.tribble.readers.LineIteratorImpl@1b6dc6c9, for input source: /home/gaurav/Filter/Arabian_INDELS_NEXOME.vcf
ERROR ------------------------------------------------------------------------------------------

Error Stack trace for version 2.4

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException: Line: chr1 2277268 rs140545544 CCACA C,CCACACA 515.33 PASS . GT:GQ:DP:PL:AD 2/2:5:21:.,.,381,.,5,0:1,12 1/1:45:21:723,45,0,.,.,.:0,15 at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:68) at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.readNextRecord(TribbleIndexedFeatureReader.java:342) at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:297) at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:261) at org.broadinstitute.sting.gatk.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:60) at org.broadinstitute.sting.gatk.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:42) at org.broadinstitute.sting.gatk.iterators.PushbackIterator.next(PushbackIterator.java:65) at org.broadinstitute.sting.gatk.iterators.PushbackIterator.element(PushbackIterator.java:51) at org.broadinstitute.sting.gatk.refdata.SeekableRODIterator.next(SeekableRODIterator.java:223) at org.broadinstitute.sting.gatk.refdata.SeekableRODIterator.next(SeekableRODIterator.java:66) at org.broadinstitute.sting.utils.collections.RODMergingIterator$Element.next(RODMergingIterator.java:72) at org.broadinstitute.sting.utils.collections.RODMergingIterator.next(RODMergingIterator.java:111) at org.broadinstitute.sting.gatk.datasources.providers.RodLocusView.next(RodLocusView.java:122) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:173) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:154) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:271) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:145) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:100) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:283) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91) Caused by: java.lang.NumberFormatException: For input string: "." at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:481) at java.lang.Integer.valueOf(Integer.java:582) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeInts(AbstractVCFCodec.java:703) at org.broadinstitute.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:664) at org.broadinstitute.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:114) at org.broadinstitute.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:131) at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:309) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:241) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:220) at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:46) at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:65) ... 25 more

No error stack trace with version 2.7

I tried to validate vcf file using ValidateVariants too, but getting same error.

Created 2013-11-05 16:43:51 | Updated | Tags: allelebalancebysample selectvariants jexl

Comments (3)


I would like to filter my variants using the SelectVariants walker but it throws an error when I try to filter on allele balance by sample. The jexl expression I use is:


error is: unknown, ambiguous or inaccessible method getAB

Is there any way of filtering on this parameter?

Best wishes,


Created 2013-10-30 18:51:30 | Updated | Tags: selectvariants vcf filtervariants

Comments (3)

Hi Team, I have a VCF which I'd like to filter by variant frequency. The problem is, my frequencies are percentages rather than decimals. Is there a workaround in JEXL which allows it to parse the '%' operator as a percentage (or ignore it entirely) rather than considering the field a string upon seeing the modulo operator? The VCF also has two columns in the format column (a normal and a tumor). Is it possible to drill down into these using just the genotypeFilterExpression/genotypeFilterName flags or must do something else?

Thanks, Eric T Dawson

Created 2013-10-10 10:03:25 | Updated | Tags: selectvariants

Comments (10)


Is there a way to select ONLY homozygote calls from a vcf file using SelectVariants?

I understand that the GT for homozygotes is 1/1, whereas the genotype for heterozygotes is 0/1. But when I ude the following command, I get an empty vcf file (i.e. only with a header):

$ java -Xmx4g -jar GenomeAnalysisTK.jar -SelectVariants -R reference.fasta --variant input.vcf -select 'GT == "1/1";' -o output.vcf

I'd appreciate your help on this matter.



Created 2013-10-08 11:58:10 | Updated | Tags: selectvariants

Comments (7)


I am using both GATK's UnifiedGenotyper and samtools mpileup as callers.

I've used CombineVariants in order to merge the two sets into a single .vcf file as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T CombineVariants -R reference.fasta --variant:GATK GATK.vcf --variant:samtools samtools.vcf -o GATK_samtools.union.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,samtools --filteredrecordsmergetype KEEP_UNCONDITIONAL

Now, I would like to select all calls that were called by both callers, regardless of whether they've been filtered or not.

From opening the GATK_samtools.union.vcf file, I understand that I need to select for the following expressions:

set=Intersection set=FilteredInAll set=filterInGATK-samtools

(I was also wondering why I don't get an expression like 'filterInsamtools-GATK'? does this have anything to do with the PRIORITIZE command?)

So... I've been trying to run the following with no luck (i.e. the output .vcf file doesn't contain any variants, but rather only the header):

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta --variant GATK_samtools.union.vcf -select 'set == "Intersection"; -select 'set == "FilteredInAll";' -select 'set == "filterInGATK-samtools";' -o GATK_samtools.overlap.vcf

I've also tried the following, but in this case I only get the an output of the 'set=Intersection' variants, without the rest:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta --variant GATK_samtools.union.vcf -select 'set == 'Intersection';'FilteredInAll';'filterInGATK-samtools'" -o GATK_samtools.overlap.vcf

I'd appreciate any help on this.



Created 2013-09-11 23:21:01 | Updated | Tags: selectvariants

Comments (3)

Hello Team,

Is there a best practice for finding compound heterozygotes using GATK? I can easily find recessive pattern variants using the SelectVariants tool, however, I have not been able to find a way to select compound hets. I am sure the Broad has a straightforward way. Is it somehow integrated into the GATK tool set?

I have considered using something like Gemini, but I would prefer to keep tools use to fewer product lines whenever possible.

Thanks for your help! Sean

Created 2013-09-06 17:43:41 | Updated | Tags: selectvariants variantfiltration variantcontext ssc

Comments (16)

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}$"

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Created 2013-08-18 12:24:49 | Updated | Tags: selectvariants

Comments (1)

Just noticed that SelectVariants produces "ERROR MESSAGE: Invalid argument value '>' at position 10" when I use the '-select' parameter with the syntax given in your third example [-select "QD > 10.0"]. I'm using GATK 2.6-5, java/1.7.0_25. It worked well without the whitespace in the expression [-select "QD>10.0"]. Hope that's not just me ;-) Also, does the example miss the line continuation character just before the '-select' option?

No worries, many thanks for the great tool!

Created 2013-08-15 14:43:33 | Updated | Tags: selectvariants vcf criteria

Comments (15)

As I said in my last post about splitting my 11 samples from the recalibrated VCF file. I now have a different question which is how to set up a criteria to select variants from this 11-sample-combined VCF. My criteria would be DP >= 20 and # of ALT reads >= 10. I know the AD is the sum of both REF and ALT reads, but I was wondering if there's any way to select by the # of ALT and DP >=20?

Should I use the "-T SelectVariants" or "-T VariantFiltration"? I am using GATK 2.5 on a remote Mac OS X server by the way.

Created 2013-08-12 08:01:33 | Updated | Tags: selectvariants

Comments (4)


I was just wondering if anyone uses GATK's SelectVariants walker to call de novo mutations (Mendelian violations) and, if so, what -mvq cut-off do they use? My data is exome sequencing with a large range of read depths - from mean target coverage of 14X to >50X.



Created 2013-07-29 16:59:10 | Updated | Tags: selectvariants intervals

Comments (2)


I have a vcf file of indels, and an interval file of single-base positions I am interested in. I would like to select all of the variants in the file that overlap the positions I'm looking at. Using select variants with the interval list I have returns nothing, because the indels are not contained within any intervals. I don't want to just add an arbitrary amount of padding to my intervals because I don't want to include other nearby variants that don't actually overlap my sites.

Is there a way to do this easily in GATK?

Created 2013-06-30 14:38:27 | Updated | Tags: selectvariants jexl indels

Comments (1)

How can I select indels with lenght smaller than 10 bp from a vcf file?

I tried

java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght10.vcf -select 'vc.getIndelLengths().0 < 10'

but the output still contains all the Indels, also the ones larger than 10 bp.

Created 2013-05-20 12:44:58 | Updated | Tags: selectvariants gatk

Comments (3)

Greetings GATK team!

I hope I'm not making a duplicate question here, but I couldn't find anything regarding this in the forum.

Basically, what I want to do is to use SelectVariants to filter against another call set, but I do not want to be as strict as using -discordance (i.e. 100% discordance rate between the two call sets). I want to say for example: "filter call set A against variants that occur in >90% of call set B".

Is there a way to do this with JEXL expressions maybe?

Kind regards

Created 2013-05-13 15:35:57 | Updated | Tags: selectvariants

Comments (1)


Can you use SelectVariants with a combined vcf to produce a new vcf containing only variants present in a particular sample eg. you can select out de novo mutations from a combined family vcf?



Created 2013-04-18 21:42:10 | Updated 2013-04-19 15:53:19 | Tags: selectvariants

Comments (1)

My indel calling VCF has the following information:

##INFO=<ID=N_MQ,Number=2,Type=Float,Description="In NORMAL: average mapping quality of consensus indel-supporting reads/reference-supporting reads">

so one example of my indels is:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  exom_aladaz
chr1    54690639        .       G       GCCC    .       .       N_AC=0,0;N_DP=19;N_MM=0.0,0.36842105;N_MQ=0.0,33.894737;N_NQSBQ=0.0,31.80368;N_NQSMM=0.0,0.0;N_SC=0,0,19,0;SOMATIC;T_AC=6,6;T_DP=11;T_MM=0.5,0.2;T_MQ=44.5,29.0;T_NQSBQ=30.85,33.36;T_NQSMM=0.016666668,0.0;T_SC=6,0,5,0  GT:GQ   0/1:0

If I want to filter indel calls with N_MQ <30 for both consensus indel-supporting reads/reference-supporting reads, how should I write the SelectVariants for N_MQ=0.0,33.894737?

Created 2013-04-04 00:42:42 | Updated | Tags: selectvariants vcf

Comments (1)

Hello I would like to subset a VCF file to only save a few specific regions of the whole genome. I know some of your tools allow for an interval list to be used to subset the region analyzed. Do you have a tool or are you aware of a tool that would allow me to quickly do this from an interval list or something similar? I could make a little script myself, but I figure sub setting and printing out a specific genomic region of interest in a VCF file has to be a solved problem by GATK.

Thanks for your help! ~Sean

Created 2013-03-20 14:56:12 | Updated | Tags: unifiedgenotyper selectvariants

Comments (3)


I am using the following command to generate SNPs and Indels from matching tumor normal pair bam files (GenomeAnalysisTK-2.4-9-g532efad)

java -jar GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I tumor.bam -I normal.bam -D dbsnp_135.hg19.vcf -o raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0.

I would like to know the specific command (in SelectVariants) to separate SNPs unique to tumor but not to normal sample

Created 2013-03-14 19:37:04 | Updated 2013-03-14 19:37:44 | Tags: selectvariants jexl

Comments (5)

I'm trying to use JEXL to filter variants but something isn't working and I can't figure it out. I'm hoping someone can point me in the right direction. My VCF file contains an INFO field 1000g2012Apr_ALL. Some of the variants in my VCF have an entry for this field, some don't. I want to filter my VCF file for entries that are below a certain value or are NULL (empty).

Here's what my command looks like:

java -Xmx4G -jar GenomeAnalysisTK.jar -T SelectVariants -R hg19.fa -V my.vcf -o my.1kgfiltered.vcf -select 'vc.getAttribute("1000g2012Apr_ALL") < 0.01' -select '!vc.hasAttribute("1000g2012Apr_ALL")'

The problem is the 2nd select statement. I can't seem to get a JEXL select statement to give me the entries where 1000g2012Apr_ALL are empty. How do I accomplish this?

Created 2013-03-11 14:11:32 | Updated | Tags: selectvariants indel

Comments (3)

Dear All,

Just thought I should report a possible small bug with SelectVariant --maxIndelSize as I think it's only filtering insertions greater than the given value and not deletions. Of course using JEXL expressions I can still get the variants I'm after...

Cheers, Laurent

Created 2013-01-21 13:08:41 | Updated | Tags: selectvariants snp

Comments (7)

I ran in to the situation now a couple of times that I need to extract a set of private SNPs from a multisample VCF file. For example in a forward genetics knockout screen of a large set of samples.

It is possible with vcf-contrast from vcf-tools:

vcf-contrast +sample1 -sample2 -sample3 -n input.vcf > private sample1.vcf

vcf-contrast -sample1 +sample2 -sample3 -n input.vcf > private sample2.vcf

vcf-contrast -sample1 -sample2 +sample3 -n input.vcf > private sample3.vcf

After this I still would have to filter out the private 0/0 calls and doing this for a large multisample VCF means entering this command for all the combinations which is not really nice.

Surely this must be possible with GATK. Does anyone know how to do this with GATK.

Maybe it is somewhere in the SelectVariants? The --discordance option looked promissing but there is something about that the samples should be the same? Or is it possible to write another variant walker or a JEXL expression?



P.S. By accident I also posted this question in the XHMM, an admin could remove it there.

Created 2012-12-19 20:13:49 | Updated | Tags: selectvariants

Comments (1)


I wouldn't classify this as a bug, but thought I would just put this up here in anyways.

I'm creating a vcf file from exome targeted sequencing, using unified genotyper to just call SNVs on one sample, then using SelectVariants instead of VariantFiltration using hard-cutoffs. This is just a intermediate QC file (something that we can use to do a rough TiTv estimate, concordance to GWAS arrays, etc) in order to make the determination as to whether or not this sample was good enough as is to go into the pool of samples to run unified genotyper/haplotyper caller on together and then VQSR.

So when using the expression;

-select 'MQRankSum > -12.5'

any non-reference homozygote that has no reads containing an alternate allele logically won't have this annotation calculated and since this is missing then this record is removed. Nothing major. I plan on redoing it using the VariantFiltration walker instead and the just use the SelectVariants to pull out PASS records, but I though I would put this up in any case. I was trying to think of a way to use a more complex expression to say if GT was 0/1 then MQRankSum, but couldn't wrap around my head for the case were GT was 1/1 and MQRankSum was present and greater than -12.5.

I was using GenomeAnalysisTK-2.2-4-g4a174fb at the time and my command line is below.

java -jar $GATK/GenomeAnalysisTK.jar \ -T SelectVariants \ -R $REF_GENOME \ --variant $CORE_PATH/$OUTPATH/temp/$SM_TAG".QC.raw.OnBait.vcf" \ -select 'QD > 2.0' \ -select 'MQ > 30.0' \ -select 'FS < 40.0' \ -select 'HaplotypeScore < 13.0' \ -select 'MQRankSum > -12.5' \ -select 'ReadPosRankSum > -8.0' \ -select 'DP > 8.0' \ -o $CORE_PATH/$OUTPATH/SNV/QC/Aggregate/filtered_on_bait/$SM_TAG".QC.OnBait.vcf"

Created 2012-11-22 21:09:04 | Updated | Tags: documentation selectvariants varianteval

Comments (3)


I've a couple of essentially documentation-centric questions...

Firstly, the SelectVariants documentation describes selecting 1000 random variants from a vcf using '-number 1000', however when I try that (with the command "java -jar GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37.fasta --variant variants.vcf -o 1000.vcf -number 1000") it produces the error 'Argument with the name number isn't defined'. Trying --number instead doesn't make any difference, while the --help output does not list this argument (GATK 2.2.2). It this option no longer available?

Secondly, the 'gatkforums.broadinstitute.org/discussion/48/using-varianteval#Understanding_Genotype_Concordance_values_from_Variant_Eval' section of the 'Using VariantEval' page has a series of images explaining the concordance values, however the images are missing. Please could these be restored?

Many thanks, James

Created 2012-11-15 15:03:29 | Updated 2012-11-15 22:57:54 | Tags: unifiedgenotyper selectvariants ad bug-fixed

Comments (3)


I used the UnifiedGenotyper (GATK 1.6) on a multi-sample set to call variants, and for some of the positions I get multiple mutated alleles. The genotype entries in the combined VCF file look like (GT:AD:DP:GQ:PL):



so it's three AD values per entry. Running SelectVariants yields the following line for the second example from above:

GT:AD:DP:GQ 0/1:27,0,54:81:99

Although it changed the genotype from 0/2 to 0/1, it did not update the AD field. I checked the forums, but I could not really find anything discussing specifically the update of AD, except for the GATK 2.2 release notes where it says SelectVariants: "Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file."

I was wondering whether you could confirm if cases like the one above would benefit from the bugfix, or if the bug description applies to something else.

Thanks a lot for all your hard work, Markus

Created 2012-11-06 16:13:18 | Updated | Tags: selectvariants gatkdocs genotypeconcordance

Comments (4)

I'm looking to find all the entries that change between two calls to UG on the same data. I would like to find all the entries where the call in the variant tract are different from those in the comparison track. So in effect I want those entries that would not be result from -using -conc in SelectVariants. From the documentation is is unclear if the -disc option does this:

A site is considered discordant if there exists some sample in the variant track that has a non-reference genotype and either the site isn't present in this track, the sample isn't present in this track, or the sample is called reference in this track.

What if the comp is HOM_VAR and the variant track is HET? Or if they are both HET but disagree on the specific allele?


Created 2012-10-31 20:43:20 | Updated 2013-01-07 20:10:14 | Tags: selectvariants filter

Comments (3)

I have completed filtering my SNP data using VariantFiltration, and now I want to use SelectVariants to output all calls marked "PASS" in the FILTER field. I used the following script, but only the header information writes to the output file.

java -Xmx20g -jar GenomeAnalysisTK.jar -T SelectVariants -R HC.fa --variant HC.SNPs.filtered.vcf -select "FILTER == 'PASS'" -o HC.SNPs.passed.vcf

My input file contains many records that should evaluate as true. Any idea why this doesn't this work?

Created 2012-10-30 15:55:37 | Updated 2013-01-07 20:11:03 | Tags: selectvariants variantfiltration vcf filtering

Comments (12)

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Created 2012-10-18 19:42:38 | Updated 2013-01-07 20:12:46 | Tags: selectvariants variantfiltration

Comments (6)

I am trying to filter variant calls which have "GQ>=20.0".

GATK SelectVariants, gives no error but gives only the header in the output file

java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T SelectVariants --variant xxx.var.flt.vcf -o xxx.vcf -select "GQ >= 20.0"

So, I tried using VariantFiltration followed by SelectVariants. The variant filtration seems to work fine adding FT tag to the format field. And then I am trying to get records having FT tag using the following commands

java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T VariantFiltration --variant xxx.var.flt.vcf -o xxx_filtered.vcf --genotypeFilterExpression "GQ >= 20.0" --genotypeFilterName "qual_1_filters"

java -Xmx4g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -T SelectVariants -R xxx --variant xxx_filtered.vcf -select 'vc.hasAttribute("FT")' -o xxx_qual20.vcf 

but I only get header in the output vcf file.

I am not sure if this is the right approach. Any help would be appreciated.

Created 2012-09-28 00:58:26 | Updated 2012-09-28 00:59:48 | Tags: selectvariants

Comments (2)

INFO  20:55:27,108 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-10-gdbc86ec, Compiled 2012/09/27 05:17:49 
INFO  20:55:27,109 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  20:55:27,109 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  20:55:27,109 HelpFormatter - Program Args: -R ../generated_consensus/ASW_woSNPsindelsdelsdupsinvsEUR.fa -T SelectVariants --variant:VCF /media/Data/raw_seq/ASW_allsamples_ASW_coords.vcf -o /media/Data/raw_seq/ASW_NA19914_ASW_coords.vcf -sn NA19914 
INFO  20:55:27,110 HelpFormatter - Date/Time: 2012/09/27 20:55:27 
INFO  20:55:27,110 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  20:55:27,110 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  20:55:27,132 GenomeAnalysisEngine - Strictness is SILENT 
INFO  20:55:27,208 RMDTrackBuilder - Creating Tribble index in memory for file /media/Data/raw_seq/ASW_allsamples_ASW_coords.vcf 
INFO  20:55:37,632 GATKRunReport - Uploaded run statistics report to AWS S3 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.IllegalArgumentException: Start must be greater than 0!
    at org.broad.tribble.index.interval.MutableInterval.setStart(IntervalIndexCreator.java:134)
    at org.broad.tribble.index.interval.IntervalIndexCreator.addFeature(IntervalIndexCreator.java:72)
    at org.broad.tribble.index.DynamicIndexCreator.addFeature(DynamicIndexCreator.java:155)
    at org.broad.tribble.index.IndexFactory.createIndex(IndexFactory.java:245)
    at org.broad.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:237)
    at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:362)
    at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:259)
    at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:203)
    at org.broadinstitute.sting.gatk.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:132)
    at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedQueryDataPool.<init>(ReferenceOrderedDataSource.java:205)
    at org.broadinstitute.sting.gatk.datasources.rmd.ReferenceOrderedDataSource.<init>(ReferenceOrderedDataSource.java:85)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:852)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:713)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:244)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.1-10-gdbc86ec):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Start must be greater than 0!
##### ERROR ------------------------------------------------------------------------------------------

Created 2012-09-19 21:40:58 | Updated 2013-01-07 20:36:46 | Tags: selectvariants variantfiltration

Comments (4)

Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:

For snps "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0"

For indels "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0"

I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.

My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.

Created 2012-09-18 19:16:05 | Updated 2012-09-18 20:21:25 | Tags: selectvariants variantfiltration jexl

Comments (10)


I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T SelectVariants -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf -select "AF > 0.04" -sn "S05-16209-1C_S4_L001_R1_001"

The error is:

MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![0,9]: 'AF > 0.04;' > error

For VariantFiltration the call is:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T VariantFiltration -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf --filterExpression 'AF > 0.040' --filterName "3perc"

The error is:

java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010])
at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796)
at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.evaluateExpression(VariantJEXLContext.java:267)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:233)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:118)
at org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.match(VariantContextUtils.java:293)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.filter(VariantFiltration.java:331)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:270)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:80)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

For both I have tried variations of double quotes and different sigfigs. Also, it works when I select on parameters other than AF.

Am I missing something?

Created 2012-09-09 02:52:32 | Updated 2012-09-09 02:53:52 | Tags: indelrealigner bqsr selectvariants

Comments (2)

My current workflow for analysing mouse exome-sequencing (based on v4 of Best Practices) can require me to use slightly different VCFs as --knownSites or --known parameters in BQSR, indel realignment etc. Basically, I have a "master" VCF that I subset using SelectVariants. The choice of subset largely depends on the strain of the mice being sequenced but also on other things such as AF'. It'd be great to be able to do this on-the-fly in conjunction with--known' in tools that required knownSites rather than having to create project-specific (or even tool-specific) VCFs.

Is there a way to do this that I've overlooked? Is this a feature that might be added to GATK?