Tagged with #variantfiltration
1 documentation article | 0 announcements | 51 forum discussions

Created 2012-08-01 23:04:18 | Updated 2016-03-16 21:58:01 | Tags: selectvariants variantfiltration jexl

Comments (38)

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).

5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Examples using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Examples using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

Example using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
No articles to display.

Created 2016-05-18 10:20:34 | Updated | Tags: selectvariants variantfiltration genotype no-calls

Comments (1)


I have been trying to find a way to mark specific individual genotypes as No Call.

I know that in VariantFiltration it is possible to add the option --setFilteredGTToNocall in order to mark filtered genotypes as no call. However, in my case, there is no available filter corresponding to my criteria. Let me explain. I have some diploid-haploid paired related samples, i.e. I expect the haploid individual to share one of the two alleles from its diploid related. Therefore, for positions where there are discordant genotypes, I would like to mark individual genotypes as as no call in order to not include potentials errors in my data. I couldn't find a way to apply this rational to the pipeline of VariantFiltration or SelectVariants...

Eventually, I manage to manipulate the .vcf file by myself to mark any discordant genotypes as . or ./. However, when later I want to select the positions for which there are maximum of 0.2 ratio of no call (using the option --maxNOCALLfraction of SelectVariants from the nightly build version), there was an error message "there aren't enough comumns for line...". I suppose that it's because I did manipulate the vcf file by myself and there could have been errors (although I'm quite confident with my script).

Therefore, I think it's better to not manipulate the vcf file and try to do things using the pipeline....but how?

Created 2016-05-09 17:28:33 | Updated | Tags: variantannotator variantfiltration bug

Comments (5)


I am planning to annotate VCF files with external frequency information such as 1KGP. So I used VariantAnnotator walker to do that and got a VCF file with OneKGP.AF in the INFO tag which is good. However, when I try to VariantFiltration this new VCF file, GATK actually filter/flag my variants using frequency information from "AF" instead of "OneKGP.AF" due to the dot in the name of "OneKGP.AF".

-- It is a great function for users to be able to annotate variant with external information and filter variants. I wonder if there is a way to get around of the problem I encounter here. Thanks

1 866511 rs60722469 C CCCCT 42628.59 PASS AC=42;AF=0.808;AN=52;AS_InbreedingCoeff=0.5047;BaseQRankSum=-1.570e-01;ClippingRankSum=-3.100e-02;DB;DP=1404;ExcessHet=0.0547;FS=1.709;FractionInformativeReads=0.796;GC=68.32;GQ_MEAN=90.42;GQ_STDDEV=16.42;HRun=0;HW=13.2;InbreedingCoeff=0.5047;MLEAC=42;MLEAF=0.808;MQ=58.15;MQRankSum=0.226;NCC=0;NDA=1;OneKGP.AF=0.148962;QD=32.93;RPA=3,4;RU=CCCT;ReadPosRankSum=0.361;SOR=1.037;STR;;VQSLOD=2.08;VariantType=INSERTION.NumRepetitions_3.EventLength_4;culprit=FS GT:AD:DP:GQ:MLPSAC:MLPSAF:PL

The commandline i used to annotate the original VCF file is

Annotate VCF with global AF from 1KGP

${PathToJava}/java -jar ${PatToGATK}/GenomeAnalysisTK.jar \ -R $Ref \ -T VariantAnnotator \ -V CONTROLS_PLUS_CGC_PedTest4.VQSR.ANNOTATED.VARIANT_ONLY.PASS.vcf \ -o CGC_PedTest4.VQSR.PASS.AnnoFreq.vcf \ --resource:OneKGP ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf \ -E OneKGP.AF

The commandline i used to Filter/Flag variants

Filter VCF file based on 1KGP.AF information:

${PathToJava}/java -jar ${PatToGATK}/GenomeAnalysisTK.jar \ -R $Ref \ -T VariantFiltration \ -o CGC_PedTest4.VQSR.PASS.AnnoFreqAll.Filter.vcf \ --variant CGC_PedTest4.VQSR.PASS.AnnoFreqAll.vcf \ --filterExpression "OneKGP.AF>0.1" \ --filterName "OneKGP.AF_Pct10"

As the issue coming from the dot in the name of "OneKPG.AF", I tried to replace " -E OneKGP.AF" with "-E OneKGP_AF" in the annotation step and apparently GATK doesn't allow me to do that as shown below.

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR MESSAGE: Invalid command line: Argument OneKGP_AF has a bad value: it should be in rodname.value format
ERROR ------------------------------------------------------------------------------------------

Created 2016-04-22 10:39:43 | Updated 2016-04-22 10:45:25 | Tags: variantfiltration vcf-format

Comments (2)

Hi there,

I want to report this in GATK's VCF output format after VariantFiltration task.

I was trying to use vcftools to read the output VCF file from that program but an error came out when parsing the header. Specifically I found this is related with the string related to filters used. The following are the filters as reported in the VCF:

##FILTER=<ID=HARD_TO_VALIDATE,Description=""QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0""> ##FILTER= 30.0 && QUAL < 100.0"">

The double quotes are repeated, as you can see in all the three filters and this is causing Vcf.pm perl library to fail with:

_Could not parse header line: FILTER=<ID=HARD_TOVALIDATE,Description=""QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0""> Stopped at [QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0""].

When I go to remove the repeated double-quotes than I solved this. Infact with ##FILTER= 30.0 && QUAL < 100.0"> the error is not there.

Moreover, one more error is there because of VariantFiltration. Few lines later the Command line printed has all the parameters given to VariantFiltration and this filter again:

##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-46-gbc02625,Date="Sat Apr 16 12:01:58 CEST 2016",Epoch=1460800918456,CommandLineOptions=.. .. .. ..... filterExpression=["QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0", "QUAL < 30.0", "QUAL > 30.0 && QUAL < 100.0"] filterName=[HARD_TO_VALIDATE, VeryLowQual, LowQual] genotypeFilterExpression=[] genotypeFilterName=[] . .. .... > And Vcf.pm stops with:

Could not parse header line: GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.4-46-gbc02625 .....

I solved in a similar way removing the quotes from the filter elements:

filterExpression=[QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0, QUAL < 30.0, QUAL > 30.0 && QUAL < 100.0]

I do not know if it is a problem in the Vcf.pm library or the VCF format is not respected in VariantFiltration.

Hope this will be useful


Francesco Musacchia

Created 2016-04-02 10:14:11 | Updated | Tags: variantfiltration jexl qd mq hard-filtering

Comments (2)

Hi GATK team again !

I have been trying to do the hard filter on my SNPs. What I want to do is to filter out all snps with depth less than 300 and more than 2000, and also all the SNPs with the following features: QD < 10, MQ < 20, FS >20, SOR > 4, MQRankSum < -5 and ReadPosRankSum < -5. Here is my command:

java -Xmx4g -jar ~/path/to/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ~/path/to/reference \ -V NK_jointSNP.vcf \ --filterExpression "DP < 300 || DP > 2000 || QD < 10 || MQ < 20 || FS < 20 || MQRankSum < -5 || ReadPosRankSum < -5 || SOR > 4" \ --filterName "snp_filter_1st_round" \ -o filtered_snp.vcf &

Beside the error messages "undefined variable QD", SNPs which should not pass my filter seem to pass the filter as they were tagged "PASS" on the FILTER field....

For example,

PASS AC=3;AF=0.188;AN=16;BaseQRankSum=-2.800e-01;ClippingRankSum=0.742;DP=372;ExcessHet=3.0103;FS=5.152;MLEA C=3;MLEAF=0.188;MQ=18.78;MQRankSum=0.742;QD=33.91;ReadPosRankSum=1.40;SOR=3.321 PASS AC=4,4;AF=0.200,0.200;AN=20;BaseQRankSum=0.536;ClippingRankSum=0.742;DP=364;ExcessHet=3.0103;FS=0.000;M LEAC=4,4;MLEAF=0.200,0.200;MQ=19.80;MQRankSum=0.949;QD=30.96;ReadPosRankSum=0.124;SOR=1.493 PASS AC=1;AF=0.036;AN=28;DP=697;ExcessHet=3.4242;FS=0.000;MLEAC=1;MLEAF=0.036;MQ=12.76;QD=33.28 ;SOR=3.258 PASS AC=2;AF=0.080;AN=25;DP=727;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=0.080;MQ=7.42;QD=30.73; SOR=3.056 PASS AC=2;AF=0.077;AN=26;DP=727;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=0.077;MQ=7.44;QD=24.88; SOR=3.056 PASS AC=1;AF=0.032;AN=31;DP=768;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.032;MQ=11.33;QD=32.28 ;SOR=1.179 PASS AC=1;AF=0.040;AN=25;BaseQRankSum=-1.068e+00;ClippingRankSum=1.46;DP=606;ExcessHet=3.0103;FS=3.074;MLEAC=1;MLEAF=0.040;MQ=46.00;MQRankSum=-2.892e+00;QD=0.57;ReadPosRankSum=-1.824e+00;SOR=0.670 PASS AC=4;AF=0.250;AN=16;BaseQRankSum=0.464;ClippingRankSum=1.96;DP=441;ExcessHet=3.0103;FS=4.223;MLEAC=4;MLEAF=0.250; MQ=34.18;MQRankSum=-3.640e+00;QD=9.86;ReadPosRankSum=2.25;SOR=0.359

As you can see, SNPs with uncorresponding values are there....it seems also that, when this happens, SNPs pass the filter for MQ, they do not pass for QD and vice-versa.... The other annotations do not seem to be affected...

Am I making any mistake with my JEXL expression? I used "&&" as I do want my SNPs to correspond to all the criterial not just at least one and I used "||" between the two expressions on DP as a snp cannot have in the same time <300 and >2000 of depth...

I would like to have your input.

Thank you very much in advance. You guys are doing great job in supporting our community !

Best regards, Noppol.

Created 2016-03-29 20:12:58 | Updated | Tags: selectvariants variantfiltration

Comments (2)


As many members have suggested I am using a standard method of running through VariantFiltration to change samples with low depth to missing genotype using -G_filter "DP<4" -G_filterName "SampleDepthFilter4" --setFilteredGtToNocall.

Then using SelectVariants to filter by MAF and Missing%s using allele numbers with the resulting file using --selectexpressions "AF>0.1" --selectexpressions "AF<0.9" --selectexpressions "AN>140" --restrictAllelesTo BIALLELIC --selectTypeToExclude INDEL.

However when I'm not going through and looking at the results, I see that in almost all cases the missing percentages are much higher than what I was theoretically filtering for with Allele Number greater than 140, ie 70 individuals. It seems that it is not taking into account the genotypes that have been changed over to missing based on the sample depth filter, even though it is using the resulting file from the VariantFiltration step. Why is filtering not performing as anticipated?

Thank you, Amanda

Created 2016-03-07 16:21:26 | Updated 2016-03-07 16:31:04 | Tags: selectvariants variantfiltration

Comments (15)

Dear GATK team,

I am trying to use VariantFiltration and SelectVariants to mask individual calls with GQ < 20 or DP < 10 in my VCF; however, as I succeed masking individuals who didn't pass the QC with './.', the AC, AN and AF in the INFO field stayed the same. Just want to make sure that I did everything correctly:

I first use VariantFiltration to mark filtered individuals (I am using GATK version: 3.5-0-g36282e4):

java -jar GenomeAnalysisTK.jar -R hg19.fasta -V test.vcf -o test.vf.vcf -G_filter 'GQ<20' --genotypeFilterName 'lowGQ' -G_filter 'DP<10' --genotypeFilterName 'lowDP' -V test.vcf -o test.out1.vcf

and then use SelectVariants to set those filtered individuals:

java -jar GenomeAnalysisTK.jar -T SelectVariants --removeUnusedAlternates --excludeNonVariants --setFilteredGtToNocall -V test.out1.vcf -o test.out2.vcf

Here is a toy example which might help you debug:

the original input was:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:GQ 0/0:20:30 0/1:20:30 0/1:20:30 0/1:5:10

After VariantFiltration: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:FT:GQ 0/0:20:PASS:30 0/1:20:PASS:30 0/1:20:PASS:30 0/1:5:lowDP;lowGQ:10

After SelectVariants: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:FT:GQ 0/0:20:PASS:30 0/1:20:PASS:30 0/1:20:PASS:30 ./.:5:lowDP;lowGQ:10

It seems to me that selectVariants should update the AC, AF and AN, as after we masking the individual's genotypes. I am not sure if I am doing this with the correct command. I've been reading previous threads about variantFiltration and selectVariants on individual genotypes, but not find cases like this.

Hope to hear from you. Thanks a lot.


Created 2016-02-29 00:11:59 | Updated | Tags: selectvariants variantfiltration phasing genotypegvcfs

Comments (9)


I'd be grateful for your help in understanding how GATK add phase information (PID field) in GenotypeGVCFs.

Here's an example of two lines from a VCF I'm getting for running GenotypeGVCFs on a set of samples:

chr1 8864083 . T C 462.27 . AC=4;AF=0.500;AN=8;BaseQRankSum=-4.950e-01;ClippingRankSum=0.406;DP=74;FS=0.000;MLEAC=4;MLEAF=0.500;MQ=60.00;MQRankSum=0.306;QD=6.42;ReadPosRankSum=1.54;SOR=0.674 GT:AD:DP:GQ:PL 0/1:22,12:34:99:205,0,518 0/1:10,4:14:60:60,0,220 0/1:3,3:6:71:71,0,72 0/1:10,8:18:99:158,0,238 ./.:2,0:2 chr1 8864426 . C T 5302.72 . AC=5;AF=0.500;AN=10;BaseQRankSum=0.815;ClippingRankSum=-6.270e-01;DP=316;FS=7.664;MLEAC=5;MLEAF=0.500;MQ=60.00;MQRankSum=-6.130e-01;QD=17.05;ReadPosRankSum=-1.240e-01;SOR=0.777 GT:AD:DP:GQ:PGT:PID:PL 0/1:24,66:90:99:0|1:8864426_C_T:1596,0,649 0/1:20,62:82:99:.:.:1502,0,536 0/1:24,37:61:99:0|1:8864426_C_T:921,0,647 0/1:22,41:63:99:0|1:8864426_C_T:1014,0,717 0/1:2,13:15:74:0|1:8864426_C_T:304,0,74

What's the interpretation for the phase information of the last sample in chr1:8864426? in position chr1:8864083 its genotype is not reported but in the proceeding position chr1:8864426 it is phased 0|1 WRT position chr1:8864083.

If I subsequently run SelectVariants for that sample, with --excludeNonVariants this phase information is retained but clearly looses context. So another question is if there is a way to change the PID field when running SelectVariants or VariantFiltration in case the position it is phased with is filtered?

Thanks a lot

Created 2016-02-27 23:55:23 | Updated | Tags: haplotypecaller gatkdocs variantfiltration vcf developer gatk variant-calling

Comments (2)

In a discussion about using ERC, you provide some example VCF output lines like:

20  10000442   .   T  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,168,2095
20  10000443   .   A  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,169,2089
20  10000444   .   A  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,168,2093

and say:

  1. "For each reference base not part of a variant call we emit a VCF record with the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site,"
  2. and
  3. "Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it."

While the formal Variant Call Format 4.2 Specification says:

  1. "ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends,"
  2. and
  3. "QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified."

The issue is subtle, but introduces problems with downstream processing of HaplotypeCaller generated VCF files containing reference calls. The use of the symbol "<NON_REF>" instead of "." for reference calls is a little confusing, but I also see the logic of that. More seriously: According to the VCFv4.2 specs, QUAL is NOT always a measure of "the probability that the site is polymorphic". Perhaps when a variant is called, but not when a site is called as non-variant. All those QUAL field values should NOT be 0 there. It is about the quality and correctness of the call itself. It is defined as the PHRED score of the probability that the assertion made in ALT is wrong. So if ALT asserts "<NON_REF>" or "." CONFIDENTLY (meaning a low probability that the assertion is wrong), then the QUAL PHRED score should be HIGH, not ZERO. A confident call should have a high QUAL score, whether variant or monomorphic. That is the intention of the specification. If otherwise: filtering out low quality records from a non-compliant VCF output file by removing those with low QUAL scores, for instance, would also filter out all the high quality reference calls.

I have previously been in touch with the writers/keepers of the VCF Specs (now over at samtools) and they confirm this interpretation of QUAL scores as the correct one for VCFv4.2 (and other versions) compliant output. It's unfortunately couched in mathematical double negatives, but look at it carefully and you will see the correctness of this reading. As one who uses tools from many sources, I hope you will address this discrepancy. It complicates interoperability.

-- Fred P.

Created 2016-02-08 17:01:32 | Updated | Tags: variantfiltration mqranksum

Comments (3)

Hi, I am running VariantFiltration (GATK v 3.4) using the following command:

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg19.fa -V raw_snps.vcf --filterExpression "QUAL/DP < 2.0 || DP < 10.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0" --missingValuesInExpressionsShouldEvaluateAsFailing --filterName "current" -o filtered_snps.vcf

Only half of the variants have the MQRankSum values and in spite of having the argument '--missingValuesInExpressionsShouldEvaluateAsFailing', the output file has many variants marked as PASS that do not have MQRankSum. Ideally, the ones without MQRankSum annotation should not be marked as PASS, right? Is this a bug issue?

Also, on the side, how do get GATK to remove the variants from the file that do not PASS. So, only have the PASS variants in the output file?

Thanks, ~N

Created 2016-01-28 16:21:26 | Updated | Tags: variantfiltration

Comments (5)

Dear GATK team,

Thank you for the GATK software package and the great documentation which is very helpful! However, I observed that after filtering my VCF file includes more variants than before. As I understood, the same number of Variants should remain, only on some of them will be tagged with e.g. "LOWQUAL". How should I interpret that? I checked one of the variants which was there in the filtered but not in the unfiltered. It is a LOWQUAL-tagged SNP.

Thanks a lot for your help!

Created 2016-01-23 05:03:33 | Updated | Tags: variantfiltration snp gatk

Comments (1)


I am doing an analysis on chimp variants, and do not have enough variants to train the model in order to perform VQSR. I am using hard filters at this point, as per the recommendation. I keep reading that the recommended cutoff values are only recommendations and that I should expect to tweak the parameters further. However, how do I do this in an effective manner? I'm just not sure how to approach exploring the parameter space.

Should I try out different values around the recommended cutoffs and, down the line, analyze how many mutations I get/calculate certain values for quality control and check that they agree with what is expected (CpG transition rate, and so on)? Is there a way to inspect the data to see whether it is behaving well, or do I have to wait until down the line to see how things are going? Any insight into this would help tremendously.

Thanks! Alva

Created 2015-12-04 16:53:10 | Updated | Tags: selectvariants variantfiltration

Comments (7)


I apologise if this has been answered already, I haven't managed to find the answer (I promise I tried). I have applied hard-filters to my HC-SNP data set as recommended by GATK as a starting point. However, I also want to filter by depth (I think this is no longer a popular option but variants with good coverage are surely still more reliable than variants with very low coverage?). Ideally, what I want is a filtered vcf file that contains variants which are genotyped in the majority of my samples at a minimum depth. I think this is what the CoveredByNSampleSites Walker used to do, but it seems this walker has been removed?

So instead I worked around this using VariantFiltration at the sample level: --genotypeFilterExpression 'DP < 10' --genotypeFilterName DP-10 --setFilteredGtToNocall, so all samples with a depth of less than 10 for the any given variant were set to ./.. This seemed to work fine.

Now I want to exclude variants for which a large proportion of my samples have no genotype (./.). For this, I used:

SelectVariants --maxFilteredGenotypes 10.

However, when I check how many samples have no genotype using VariantsToTable -F NO-CALL, I still have many variants where 20-40 of my samples have no genotype! The --maxFilteredGenotypes option seems to work fine, as I definitely have far fewer variants with many missing genotypes. I believe this is due to the fact that these missing genotypes weren't 'filtered', but simply have not been genotyped in the HaplotypeCaller. I guess to exclude these, I would need an option along the lines of "--maxNOCALL" instead of the --maxFilteredGenotypes, but as far as I can tell, this does not exist?

I am sure there must be a way to do this and I simply haven't found it?

Many thanks for any pointers :)


Created 2015-11-17 14:42:23 | Updated | Tags: selectvariants variantfiltration

Comments (2)


I'm hoping it's possible to mark variants in repetitive elements using variant filtration, so they can be selected out later. I've tried a few command variations, the most recent of which:

GenomeAnalysisTK -R local_reference/dm6.fa \ -T VariantFiltration \ -V my.vcf \ -filter "QD<5.0" -filterName "LowQD" \ -mask ../masking_intervals/dmel6_repMask.gatk.intervals -filterNotInMask -maskName "RepMask" \ -o my_marked.vcf

..fails with "No tribble type provided" for the intervals file.

Is this possible at all to do with VariantFiltration? Note that I'd rather mark the suspect variants for removal later, rather than just remove them altogether.

...actually in writing this, I just thought to create a new vcf of variants only in the intervals, and then use this vcf to screen out the original vcf..... Sound good ?


William Gilks

Created 2015-11-10 15:38:11 | Updated | Tags: clippingranksumtest variantfiltration baseqranksum

Comments (1)

Hi team,

I was hoping to filtered out variants with extreme rank sum values for base qual and clipping, with: echo "date +%y/%m/%d_%H:%M:%S.... Mark bad quality variants .... " GenomeAnalysisTK -R local_reference/dm6.fa \ -T VariantFiltration \ -V ${input_vcf} \ -filter "MQ<60" -filterName "lowMQ" \ -filter "BaseQRankSum<-4" -filterName "lowBQSR" \ -filter "BaseQRankSum>4" -filterName "highBQSR" \ -filter "ClippingRankSum<-2.5" -filterName "lowCRS" \ -filter "ClippingRankSum>2.5" -filterName "hiCRS" \ -filter "QD<5.0" -filterName "LowQD" \ -filter "FS>30" -filterName "hiSB" \ -G_filter "DP<10" -G_filterName "lowIDP" \ -o ${input_vcf}_marked.vcf

but it returns : WARN 15:12:34,785 Interpreter - ![0,12]: 'BaseQRankSum > 4;' undefined variable BaseQRankSum WARN 15:12:34,786 Interpreter - ![0,12]: 'BaseQRankSum < -4;' undefined variable BaseQRankSum WARN 15:12:34,786 Interpreter - ![0,15]: 'ClippingRankSum < -2.5;' undefined variable ClippingRankSum WARN 15:12:34,787 Interpreter - ![0,15]: 'ClippingRankSum > 2.5;' undefined variable ClippingRankSum

I thought this information could be filtered on. What am I doing wrong ?



Created 2015-11-06 11:07:22 | Updated | Tags: selectvariants variantannotator variantfiltration dbsnp

Comments (5)

Hi team,

Prior to submission to NCBI dbSNP a vcf generated by e.g HaplotypeCaller requires several modifications:

  1. Addition of in-house identifiers. .................................................... done
  2. Exclude if alternate allele is "*" i.e. they are in a deletion. .................................................... I'm assuming this can be done with SelectVariants or FilterVariants.
  3. Exclude if ref or alt allele is greater than 50bp .................................................... Perhaps with SelectVariants or FilterVariants --maxIndelSize 50
  4. Exclude if ref and alt alleles do not have a common leading base. .................................................... Not sure ... removing larger indels won't exclude all of these.
  5. Add VRT (variant type) to Info field .....................................................e.g VRT=1 (for an SNV), VRT=2 for an indel etc. SNPeff doesn't seem to work for this but I could be wrong.

Knowing how to effectively format vcfs between GATK output and NCBI input might be quite useful for many people, and save rather a lot of time.

It would be really useful if the three exclusion criteria could be done using GATK. Is this possible and using what commands ?

I feel as though need to use the GATK variantAnnotator command as well. I'm looking into all of this today, and will post if I get any solutions.


William Gilks

Created 2015-10-29 14:19:26 | Updated | Tags: haplotypecaller variantfiltration drosophila

Comments (4)

Hi team, (this is really two questions)

  1. Do you have any recommendations for hard-filtering haplotypecaller-generated vcfs ?

This was my previous filter for the unifiedgenotyper output"

`GenomeAnalysisTK -R ${ref} \
    -T VariantFiltration \
        -V ${my_vcf} \
            -filter "QUAL<1000.0" -filterName "LowQual" \
                -filter "MQ0>=4&&((MQ0/(1.0*DP))>0.1)" -filterName "BadVal" \
                    -filter "MQ<60" -filterName "LowMQ" \
                        -filter "QD<5.0" -filterName "LowQD" \
                            -filter "FS>60" -filterName "FishStra" \
                                  -filter "DP<2000" -filterName "lowTotDP" \
                                        -o qual_marked.vcf`

Obviously fields such as MQ0 won't work as this isn't present in the HC-generated vcf, and obviously there are many fields to filter on. (There are 222 samples and 1.9m variants in the vcf)

  1. One filter that I'm really keen to apply but never got the hang-of, is to drop all individual genotype calls where the coverage is less than 10X. (This is because I'm really interested in getting the genotype correct, rather than actually detecting mutations).


William Gilks

Created 2015-10-26 22:47:05 | Updated | Tags: variantfiltration

Comments (8)


In trying to use VariantFiltration to hard filter vcf I continue to get the same runtime warning, e.g.

WARN 15:28:05,017 Interpreter - ![0,2]: 'DP;' undefined variable DP

and all sites have 'FILTER' set to 'PASS'.

Here a typical command:

java -jar /software/gatk/3.4-46/static/GenomeAnalysisTK.jar \ -T VariantFiltration \ -L 3R:12000000-12200000 \ -R /home/chuck/shrd/reference_genomes/D_mel_RELEASE6_Sue/norm_dmel_R6_SL.fasta \ -V /home/chuck/chl_working/testgatk/DPGP3_ZI118N.raw.snps.indels.gGVCF.g.vcf \ --filterName test_DPGT38 \ --filterExpression "DP>38.0" \ -o test_filtered_CHL.vcf

I suspect I am missing an obvious requirement or constraint.

Any help appreciated.

Cheers, Chuck

Created 2015-09-02 18:50:56 | Updated | Tags: variantfiltration

Comments (1)

GATK Team,

Currently in our variant calling pipeline, one step is to use VariantFiltration to flag any genotype calls with low confidence (defined to be GQ<20). I noticed that the VariantFiltration tool is not parallelizable (either -nt or -nct), while the ApplyRecalibration tool is (via -nt). From a novice, it seems that the same logic to parallelize ApplyRecalibration could be used to parallelize VariantFiltration. Is this the case, or is there something deeper that is different between the two? If VariantFiltration is paralleizable, is this currently in the works? Can I request this feature (and help)?

Created 2015-08-27 12:34:52 | Updated | Tags: variantfiltration

Comments (6)


My problem is that the VariantFiltration command fails with the -G_filter terms.

My input is: GenomeAnalysisTK -R ${my_ref} \ -T VariantFiltration -V my_genos.vcf \ -G_filter "isNoCall == 0 || isHomVar == 0" \ -G_filterName "badgenos" \ --invalidatePreviousFilters \ -o dodgey_genos_marked.vcf

The error messages are: ERROR A USER ERROR has occurred (version 3.4-46-gbc02625): ERROR MESSAGE: Invalid argument value '==' at position 8. ERROR Invalid argument value '0' at position 9. ERROR Invalid argument value '||' at position 10. ERROR Invalid argument value 'isHomVar' at position 11. ERROR Invalid argument value '==' at position 12. ERROR Invalid argument value '0' at position 13.

I've remade the input vcf and idx files (with GATK) so there should be no issue there (as reported in other posts). I've successfully performed a VariantFiltration followed by SelectVariants earlier in the pipeline using another vcf as a mask. --FilterExpression also generates similar error messages.

I was hoping that it's a problem with my syntax but I can't find any mistakes. Do you know of a solution?



Created 2015-08-18 12:42:10 | Updated | Tags: haplotypecaller variantfiltration

Comments (5)

Hi I am new to using HC. I am just using the HapMap sample NA12878 to validate my pipeline. I ran the following command on a BAM file that I generated. I noticed that certain SNPs were filtered out most of which I can understand the rationale.

java -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/home/tx/tmpYdAKAC -jar /usr/local/share/gatk/3.4-46-gbc02625/GenomeAnalysisTK.jar \ -R /usr/local/share/genomes/Hsapiens/GRCh37/seq/GRCh37.fa \ -I /home/INDELanalysis/validationrun/NA12878-sort-11_0_15994663-prep.bam \ --dbsnp /usr/local/share/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \ -L 11:1628982-1651500 \ -T HaplotypeCaller \ -o /home/INDELanalysis/validationrun/NA12878-11_0_15994663-raw.vcf.gz

I ran the same analysis using Samtools and the following SNP was in the final filtered output

Samtools output 11 1651228 rs71454096 C G 117 PASS AC=1;AN=2;BQB=0.61207;BaseQRankSum=-0.458;DB;DP=21;DP4=11,0,5,3;FS=7.225;GC=71.29;HOB=0.5;HRun=3;ICB=1;MQ=56.97;MQ0=0;MQ0F=0;MQB=0.395294;MQRankSum=-1.091;MQSB=0.798516;QD=5.57;RPB=0.902014;ReadPosRankSum=-1.162;SGB=-0.651104;VDB=0.0103095;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCg/gGg|A53G|237|KRTAP5-5|protein_coding|CODING|ENST00000399676|1|G) GT:DP:PL 0/1:19:150,0,158

However when I run it through HC with the above settings the above variant is filtered out of the analysis. When I ran HC with--emitRefConfidence GVCF I got the following output for this variant:

HC output 11 1651228 rs71454096 C G,<NON_REF> 0 . DB;DP=20;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.22 GT:AD:DP:GQ:PGT:PID:PL:SB 0/0:10,0,0:10:34:0|1:1651120_G_T:0,34,442,34,442,442:10,0,0,0

I am a little confused why it got a QUAL score of 0. So looking nearby I also noticed that there were some sizeable INDELS

11 1651198 . GAGGCTGTGGGGGCTGTGGCTCCGGCTGTGC G,<NON_REF> 0 . DP=25;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.38 GT:AD:DP:GQ:PL:SB 0/0:14,0,0:14:37:0,37,585,42,589,594:12,2,0,0 11 1651199 rs71025763 A AGGCTGTGGCTCC,<NON_REF> 0 . DB;DP=18;MLEAC=0,0;MLEAF=0.00,0.00;MQ=57.74 GT:AD:DP:GQ:PL:SB 0/0:6,0,0:6:28:0,28,332,28,332,332:5,1,0,0

However neither of these INDELS were called either. Can anyone shed some light as to why nothing is called in this region? If I look at the site 11 1651228 in the bam file I can see that most of the calls were made on the forward strand so there is likely to be some strand bias here and in actual fact we can see that HC called the genotype 0/0 with only 10 reads contributing to this genotype call. I presume the remaining reads were filtered out. However when I look at the pileup for this position I don't understand why over half the reads were filtered


Any advice help would be much appreciated.

Created 2015-06-30 12:54:03 | Updated | Tags: combinevariants variantfiltration jexl

Comments (4)

Hi GATK team,

I've been asked to merge a large number of VCF file (one sample/vcf) but only if QUAL is lower than a given value. I would like to avoid to filter+copy each VCF and then merge with CombineVariants. So my question is: is it possible to give a (JEXL?) expression to accept/reject a variant in one vcf ?

Thanks ! Pierre

Created 2015-03-20 17:34:55 | Updated | Tags: variantfiltration

Comments (0)

Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument. Here is what I ran:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf \ --filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \ --filterName “darwinfinchfilter”

I got the following error message: "2: No such file or directory."

However, I know my file path was set correctly as if i remove the last two arguments and only ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf

It would run.

I also tried Geraldine example filterExpression:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_geraldinefilter.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName “Geraldine_snp_filter"

and I got the following error: Unmatched ".


Created 2015-03-20 17:30:23 | Updated | Tags: variantfiltration bug

Comments (14)

Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument. Here is what I ran:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf \ --filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \ --filterName “darwinfinchfilter”

I got the following error message: "2: No such file or directory."

However, I know my file path was set correctly as if i remove the last two arguments and only ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf

It would run.

I also tried Geraldine example filterExpression:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_geraldinefilter.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName “Geraldine_snp_filter"

and I got the following error: Unmatched ".

Created 2015-03-11 20:38:44 | Updated | Tags: vqsr best-practices variantfiltration variant-calling

Comments (3)

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

Created 2015-02-24 17:03:28 | Updated | Tags: variantfiltration

Comments (1)

Hi GATK team,

I need to flag poor quality SNPs and Indels in the joint VCF. Basically, we call a family (trio) together so a typical joint VCF contains calls from three samples. I followed the rules that proposed in the your post "how-to-apply-hard-filters-to-a-call-set". In fact, I want to flag the "FILTER" field based on Information of one sample in the VCF. How to specify it in "VariantFiltration"? Any suggestions?

Many thanks, Linda

Created 2015-01-27 10:30:58 | Updated | Tags: variantfiltration jexl bug variantcontext hardfilters

Comments (8)

I was pulling my hair out over this one.

I was applying a hard filter to a genotyped gVCF using JEXL to access variant context attributes to decide what filter setting I would apply. The filter was

"vc.getGenotype("%sample%").isHomRef() ? vc.getGenotype("%sample%").getAD().size() == 1 ? DP < 10 : ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum <= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 : false"

In pseudocode it says:

 `if ( isHomRef ) then
   if ( getAD().size() == 1 ) then DP < 10 else
      ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum >= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 else ignore record`

The idea being that for records where not all reads contained the reference allele, we would filter out those positions where there was evidence to suggest that the reads supporting an alternate allele were of a significantly better quality. However, running this filter I keep getting the warning (snipped for clarity):

WARN [SNIP]... MQRankSum <= 3.2905 [SNIP]... : false;' undefined variable MQRankSum

So I thought the filter was failing. However, just as a test, I changed the direction of MQRankSum from >=3.2905 to <=3.2905 (a bit nonsensical, it should basically apply the filter to almost all HomRef positions that had any reads supporting an alternate allele).

I still get the warning but I found the filter was applied to variant records as it should be. e.g. the following went from PASS to BAD_HOMREF:

Supercontig_1.1 87 . C A . BAD_HOMREF BaseQRankSum=2.79;DP=40;MQ=39.95;MQ0=0;MQRankSum=-2.710e+00;ReadPosRankSum=0.819;VariantType=SNP GT:AB:AD:DP 0/0:0.870:34,5:39

So the filter is being correctly applied, but I am not sure why all the warnings are being generated? Is this a bug? Have I done something wrong?

Created 2015-01-11 23:30:38 | Updated | Tags: selectvariants variantfiltration jexl snps

Comments (2)

I am experimenting with JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.


Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like: vc.getGenotype( sample() ).isHet()

Sorry if this is a really dumb question. (BTW I realise you could use a genotype_filter e.g. --genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.

Created 2015-01-09 16:47:35 | Updated | Tags: bqsr haplotypecaller variantfiltration vcf bam workflows snps gatk3

Comments (3)

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

  • 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
  • 02- BAM: MarkDuplicates
  • 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
  • 04- BAM: Calling variants using HaplotypeCaller
  • 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
  • 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
  • 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
  • 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

Created 2014-11-13 03:40:49 | Updated | Tags: haplotypecaller variantfiltration

Comments (3)

I have 23 samples and I want to look over 63807197 bp region. Many thanks before.

Kind regards, Angelica

Created 2014-10-29 10:11:15 | Updated | Tags: commandlinegatk variantfiltration filterexpression

Comments (2)

Hi all, I tried to apply the following command to my raw vcf file to filter it with the command: java -Xmx30g -jar ../GATK/GenomeAnalysisTK.jar -R ../ref.fa -T VariantFiltration --filterExpression " QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0 || QUAL < $MEANQUAL || MQ <30.0 || DP< 10.0 " --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant ../s1.raw.vcf --logging_level ERROR -o ../s1.makered.raw.vcf

grep -v "Filter" s1.makered.raw.vcf >s1.flt.vcf

After that, I checked the result file s1.flt.vcf and found the following makered "PASS" .Obviously, the command doesn't work as ‘DP=8“ should be makered "LowQualFiter".

Chr01 231575 . A G 241.78 PASS AC=2;AF=1.00;AN=2;DP=8;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=29.00;MQ0=0;QD=30.22 GT:AD:DP:GQ:PL 1/1:0,8:8:24:270,24,0 Chr01 237476 . T C 238.78 PASS AC=2;AF=1.00;AN=2;DP=8;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=29.00;MQ0=0;QD=29.85 GT:AD:DP:GQ:PL 1/1:0,8:8:24:267,24,0

There is no error reported.Any suggestion will be appreciated.

Created 2014-10-22 22:02:52 | Updated | Tags: variantfiltration

Comments (3)

Hi ,

I am trying to filter my variants only if they have covergare of >=10, Below is the error I am getting

$ java -jar /opt/NGSTools/GATK-3.2.2/GenomeAnalysisTK.jar -T VariantFiltration -R /home/data/GATK_test/gatk/ucsc.hg19.fasta --variant 12-0116KZ_vcf_snp-indel.vcf -o 12- 0116KZ__filtered.vcf --filterExpression "DP > 10" INFO 12:59:02,106 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:59:02,108 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 12:59:02,109 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:59:02,109 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:59:02,113 HelpFormatter - Program Args: -T VariantFiltration -R /home/data/GATK_test/gatk/ucsc.hg19.fasta --variant 12-0116KZ_vcf_snp-indel.vcf -o 12-0116KZ__filtered.vcf --filterExpression DP > 10 INFO 12:59:02,123 HelpFormatter - Executing as clnsxr@MSJMJ794LD4229 on Linux 2.6.32-431.23.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_65-mockbuild_2014_07_14_06_19-b00. INFO 12:59:02,124 HelpFormatter - Date/Time: 2014/10/22 12:59:02 INFO 12:59:02,124 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:59:02,124 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:59:02,199 GenomeAnalysisEngine - Strictness is SILENT INFO 12:59:02,352 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 12:59:02,487 GenomeAnalysisEngine - Preparing for traversal INFO 12:59:02,508 GenomeAnalysisEngine - Done preparing for traversal INFO 12:59:02,508 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:59:02,509 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 12:59:02,509 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime WARN 12:59:03,266 RestStorageService - Error Response: PUT '/yzrGUOlIfLoDxQzDGnPauLbDm4fD0TwM.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 809, Content-MD5: BxyNhjXmVTFtPLFC1PfHvQ==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 071c8d8635e655316d3cb142d4f7c7bd, Date: Wed, 22 Oct 2014 17:59:02 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:xPIn4b04nmKanO/VjUHgupuql8w=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-431.23.3.el6.x86_64; amd64; en; JVM 1.7.0_65), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 7CC503EC75535B0F, x-amz-id-2: fAfDXCnwIDLYgjwyQf84lgZBbHe/DyKdCzO2K5I3pQXQPC8kSeRu/ceSKfd2FS3I, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 22 Oct 2014 21:58:59 GMT, Connection: close, Server: AmazonS3] WARN 12:59:03,415 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 14396 seconds. Retrying connection. INFO 12:59:03,557 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: Inconsistent number of provided filter names and expressions: names=[] exps=[DP > 10] at htsjdk.variant.variantcontext.VariantContextUtils.initializeMatchExps(VariantContextUtils.java:238) at htsjdk.variant.variantcontext.VariantContextUtils.initializeMatchExps(VariantContextUtils.java:249) at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.initialize(VariantFiltration.java:230) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Inconsistent number of provided filter names and expressions: names=[] exps=[DP > 10]
ERROR ------------------------------------------------------------------------------------------

Any help will be appreciated.


Created 2014-10-11 16:43:53 | Updated | Tags: variantfiltration

Comments (11)

Hi all,

I am using GATK version-3.2-2 and called the variants using HaplotypeCaller using the below shown command:

 java -jar GenomeAnalysisTK.jar -R ref.fa -T HaplotypeCaller -I input.vcf -L region.bed -stand_emit_conf 10 -stand_call_conf 30  
 --genotyping_mode DISCOVERY -o var.vcf

And then selected the variants using SelectVariants and filtered using VariantFiltration by following the steps in the tutorial: https://www.broadinstitute.org/gatk/guide/topic?name=tutorials . However, i met with the following error:

"undefined variable ReadPosRankSum" and undefined variable "MappingQualityRankSum" . The same issue is discussed in the forum but could find a concrete solution to fix this. Could someone help?

Created 2014-10-11 15:08:10 | Updated | Tags: variantfiltration

Comments (6)


I am using GATK VariantFiltration tool to do some hardfiltering of variants and it works fine. However, the total variants remain same before and after filtering by marking the variants "PASS" that pass the filter. I explored through the documentation and forum to find out if there is a way to drop the variants from the file that do not meet the filtering criteria but couldn't find. Could someone give any suggestions to fix this.

Created 2014-08-22 10:43:41 | Updated | Tags: variantfiltration gatk filterexpression

Comments (2)

Hi all,

I tried to apply the following command to my raw vcf file to filter it using the filtering expression specified in the command:

java -XX:ConcGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_g1k_v37.fa --variant human_g1k_v37.CHL124.vcf_snps.vcf -o CHL124.vcf_snps.vcf_filter_marked.vcf --filterExpression "QD < 2.0 && MQ < 40.0 && FS > 60.0 && HaplotypeScore > 13.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" --filterName "very_small_SNPs_default_filter"

After that, I check my result file which is CHL124.vcf_snps.vcf_filter_marked.vcf, I found that, all reads are marked as "PASS" whether its QD is > 2.0 or < 2.0. Obviously, the command doesn't work, but I cannot find why everything seems goes well, no error reported.

bless~ XL

Created 2014-06-03 11:39:15 | Updated | Tags: variantfiltration

Comments (1)

Are there any recommendations about what percentage of SNPs should be filtered out from the data set when tweaking the filter criteria?

Many thanks

Created 2014-04-25 12:58:05 | Updated | Tags: variantfiltration

Comments (2)

Hi, is it possible to have variants filtered out based on the number of samples with a GQ < 99 like you can do with DiagnoseTargets?

I believe that would come in handy if searching for denovo variants in trios (at least that's what I'm doing now and I have a lot of samples in which one of the parents has low GQ).

Thanks for your time!

Created 2014-03-19 15:37:57 | Updated | Tags: variantfiltrationwalker variantfiltration

Comments (8)

I wanted to use VariantFiltration/-G_filter/-G_filterName to filter some low quality genotype calls. With VariantFiltration (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html) I can use JEXL expressions (http://www.broadinstitute.org/gatk/guide/article?id=1255) to filter bad genotpe calls. For example I can use JEXL expression "DP<10" to filter calls in regions where coverage is too low. However, this only achieves adding a tag to the genotype calls. What I would really want is to set those genotype calls to missing (i.e. ./.), but I don't see an option to do so in the Walker. It seems like a missing feature.

Or maybe there are other ways to get what I need? Again, I don't want to filter the variant, only the genotypes.

Created 2014-02-10 19:57:41 | Updated 2014-02-10 20:24:46 | Tags: variantannotator variantstovcf variantfiltration vcf variant-calling minor-allele-frequency

Comments (5)


I have annotated my vcf file of 20 samples from Unified genotyper using the following steps.

Unified genotyper->Variantrecalibration->Applyrecalibration->VariantAnnotator

My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples?

Created 2013-11-22 16:53:32 | Updated | Tags: variantfiltration

Comments (1)

Hi, I have run UnifiedGenotyper followed by application of hard filters as recommended in the GATK best practices on my targeted sequencing data. I've noticed, however, there are several variants with very high no-call rates (>90%) which still passed the variant filtration. I'm pasting below part of the vcf files for two such variants.

I've also noticed that most of high no-call rate variants have very low read depths. I read in other discussions that you don't recommend filtering variants by read depth, but I wonder if there is another filtering criteria you can recommend so that such variants wouldn't pass the filtering step (i.e. more stringent std_call_conf values?)?

I can surely filter out the variants based on their call rate before the downstream applications, but I'm trying to understand the sequencing quality metrics, and GATK's behavior here as to what quality of these variants makes them to get a pass in the filtration.

Thanks a lot,


for these two variants below, genotypes for only 2 and 1 (out of 278) people, respectively, were called:


1 11857410 rs7537955 A G 101.85 PASS AC=6;AF=1.00;AN=6;DB;DP=3;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;MLEAF=1.00;MQ=45.96;MQ0=0;QD=33.95; GT:AD:DP:GQ:PL ./. ./. 4 156661872 . C A 53.39 PASS AC=2;AF=1.00;AN=2;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=26.70; GT:AD:DP:GQ:PL ./. ./.

Created 2013-11-14 16:35:47 | Updated | Tags: selectvariants variantfiltration vcf plink vcftools

Comments (1)


I was wondering if you could use the toolkit to generate a separate VCF file containing only SNPs that are found at a predetermined chromosome and base pair position. I have a plink file which I want to convert back to VCF format and it seems unbelievably hard to do so I thought this may be a good way to get around that problem?

I am aware that vcftools offers this function with the "--positions " option, however for some reason I am getting far more variants than I listed and there is nothing wrong that is obvious with my listed positions/vcf file.

Thanks in advance, Danica

Created 2013-10-05 09:07:13 | Updated | Tags: variantfiltration

Comments (4)

Dear all,

I want to use GATK to filter a multisample VCF based on the PL numbers but I cannot figure out how to do this. I want to flag genotypes that are likely to be homozygous reference (with the aim of removing these eventually). I would like to look at the PL field and if the first PL number (which refers to the probability of a homozygous reference call) is 10 or less, flag these genotypes.

The variant Filtration page is quite helpful and suggests accessing arrays using things like: 'vc.getGenotype("NA12878").getAD().0 > 10'

but this is not quite what I want, as I want the filtering to be genotype based (not variant based) and I don't want to base the filtering on a single sample but on each sample separately. Basically I am looking for something like this:

--genotypeFilterExpression "PL.0 > 10"

where PL/0 is the first number of the PL array. I cannot figure out anywhere a way to do this. Can someone suggest a recipe to achieve this?

Thank you in advance.

Created 2013-09-16 14:32:55 | Updated | Tags: variantfiltration

Comments (2)


I would like to filter my multi-sample vcf using per sample metrics such as AD and PL. However, these are provided as comma-separated lists of numbers. Does anyone know how I can filter, for example, on PL of 0/1 genotype in sample A?

Best wishes,


Created 2013-09-06 17:43:41 | Updated | Tags: selectvariants variantfiltration variantcontext ssc

Comments (16)

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}$"

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Created 2013-05-07 16:55:34 | Updated | Tags: qualbydepth variantfiltration

Comments (2)


I am hoping to perform hard filtering on some variants from a sequencing project where, unfortunately, I do not have information from enough samples for VQSR. I was planning to filter on the QD value, but it seems to be very low for variants that seem reasonable. Example:

chr7    55249063 .       G       A       225     PASS
GT:AD:GQ:PL     0/1:1917,2657:99:255,0,255

This variant is shown in IGV in the attached file- it looks to be a true positive, but because of the high depth, QD is very low. Based on the QD documentation, it looks as QD simply cannot be used to filter high-coverage data, since the value is QUAL/unfiltered depth.

Is there an alternative annotation that expresses the same measure, since QD is recommended in all the hard filtering documentation? Would GQ be a good substitute?

Your help is much appreciated!

Created 2013-02-27 23:26:08 | Updated | Tags: variantfiltration

Comments (9)

My question could seems like here but, the answer didn't help me.

I am using VariantFiltration over a VCF file which is generated directly after UnifiedGenotype under GenomeAnalysisTK-2.3-9-ge5ebf34.

The error I am facing is

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 126: there aren't enough columns for line 70 (we expected 9 tokens, and saw 1 )

Line number 126 is as following,

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  m1016ROUa.40287 m1023ROGa.40244 m1042ujba.40261 m1069FXFa.49470

And actually indeed it is the header of VCF file ! Should I re-run my samples ?!!

Created 2013-01-07 13:05:07 | Updated | Tags: variantfiltration dp

Comments (26)

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

Thanks for your time

Created 2012-10-30 15:55:37 | Updated 2013-01-07 20:11:03 | Tags: selectvariants variantfiltration vcf filtering

Comments (12)

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Created 2012-10-28 16:24:02 | Updated 2012-10-29 22:02:53 | Tags: variantfiltration filter

Comments (3)

Hi all,

I'm currently analysing non-human mammalian whole genome data (>30x). No previous variants databases are available.

I'm currently in the VariantFiltration step. I came around the following command which is used for human data, and I'm wondering if it will be good for non-human data:

java -Xmx10g -jar GenomeAnalysisTK.jar \
-R [reference.fasta] \
-T VariantFiltration \
--variant [input.recalibrated.vcf] \
-o [recalibrated.filtered.vcf] \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 1.5 " \
--filterName "LowQD" \
--filterExpression "SB > -10.0 " \
--filterName "StrandBias"

I would appreciate your thoughts on this matter.

Thank you very much!


Created 2012-10-18 19:42:38 | Updated 2013-01-07 20:12:46 | Tags: selectvariants variantfiltration

Comments (6)

I am trying to filter variant calls which have "GQ>=20.0".

GATK SelectVariants, gives no error but gives only the header in the output file

java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T SelectVariants --variant xxx.var.flt.vcf -o xxx.vcf -select "GQ >= 20.0"

So, I tried using VariantFiltration followed by SelectVariants. The variant filtration seems to work fine adding FT tag to the format field. And then I am trying to get records having FT tag using the following commands

java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T VariantFiltration --variant xxx.var.flt.vcf -o xxx_filtered.vcf --genotypeFilterExpression "GQ >= 20.0" --genotypeFilterName "qual_1_filters"

java -Xmx4g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -T SelectVariants -R xxx --variant xxx_filtered.vcf -select 'vc.hasAttribute("FT")' -o xxx_qual20.vcf 

but I only get header in the output vcf file.

I am not sure if this is the right approach. Any help would be appreciated.

Created 2012-09-19 21:40:58 | Updated 2013-01-07 20:36:46 | Tags: selectvariants variantfiltration

Comments (4)

Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:

For snps "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0"

For indels "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0"

I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.

My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.

Created 2012-09-18 19:16:05 | Updated 2012-09-18 20:21:25 | Tags: selectvariants variantfiltration jexl

Comments (10)


I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T SelectVariants -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf -select "AF > 0.04" -sn "S05-16209-1C_S4_L001_R1_001"

The error is:

MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![0,9]: 'AF > 0.04;' > error

For VariantFiltration the call is:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T VariantFiltration -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf --filterExpression 'AF > 0.040' --filterName "3perc"

The error is:

java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010])
at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796)
at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.evaluateExpression(VariantJEXLContext.java:267)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:233)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:118)
at org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.match(VariantContextUtils.java:293)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.filter(VariantFiltration.java:331)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:270)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:80)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

For both I have tried variations of double quotes and different sigfigs. Also, it works when I select on parameters other than AF.

Am I missing something?