Tagged with #jexl
2 documentation articles | 0 announcements | 15 forum discussions



Created 2012-08-01 23:04:18 | Updated 2015-05-15 08:21:26 | Tags: selectvariants variantfiltration jexl
Comments (14)

This document describes methods for applying hard filters using JEXL queries.


1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.


2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".


4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).


5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.


6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Examples using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Examples using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

Example using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'

Created 2012-07-23 17:50:07 | Updated 2015-05-15 23:55:20 | Tags: selectvariants jexl vcf callset
Comments (32)

This document describes why you might want to extract a subset of variants from a callset and how you would achieve this.


Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). The GATK tool that we use the most for subsetting calls in various ways is SelectVariants; it enables easy and convenient subsetting of VCF files according to many criteria.

Select Variants operates on VCF files (also sometimes referred to as ROD in our documentation, for Reference Ordered Data) provided at the command line using the GATK's built in --variant option. You can provide multiple VCF files for Select Variants, but at least one must be named 'variant' and this will be the file (or set of files) from which variants will be selected. Other files can be used to modify the selection based on concordance or discordance between the callsets (see --discordance / --concordance arguments in the tool documentation).

There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions; make sure to read the linked document for details, especially the section on working with complex expressions.

Note that in the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset. See further below for an explanation of how that works.


Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.


Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

  • isVariant => is there an ALT allele?

  • isPolymorphic => is some sample non-ref in the samples?

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.


How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.


Additional information

For information on how to construct regular expressions for use with this tool, see the method article on variant filtering with JEXL, or "Summary of regular-expression constructs" section here for more hardcore reading.

No posts found with the requested search criteria.

Created 2015-08-27 15:33:32 | Updated | Tags: jexl hard-filtering
Comments (8)

Does anyone know how to remove a whole variant record in which the call for any of the samples is ./. ?

It should be simple but have been trying filter-select and select+jexl for a while now but to no avail.


Created 2015-06-30 12:54:03 | Updated | Tags: combinevariants variantfiltration jexl
Comments (4)

Hi GATK team,

I've been asked to merge a large number of VCF file (one sample/vcf) but only if QUAL is lower than a given value. I would like to avoid to filter+copy each VCF and then merge with CombineVariants. So my question is: is it possible to give a (JEXL?) expression to accept/reject a variant in one vcf ?

Thanks ! Pierre


Created 2015-04-23 23:58:48 | Updated | Tags: jexl genotypeconcordance
Comments (17)

We have VCFs that we've applied genotype-level filters to using FilterVariants, and we are interested in using your –gfc/gfe flags in order to allow all genotypes with an non-"PASS" FT annotation to be set to no call for a concordance check using GenotypeConcordance.

I have tried using the follow JEXL expression 'FT!="PASS"' for –gfc and –gfe but this has not been successful in setting the genotypes that FAIL to no call. Is there way to achieve this genotype-level filtration within GATK GenotypeConcordace? Is there a different JEXL expression that I should be using?

(Asking on behalf of @azo121 who isn't able to post yet)

Thanks!


Created 2015-01-27 10:30:58 | Updated | Tags: variantfiltration jexl bug variantcontext hardfilters
Comments (8)

I was pulling my hair out over this one.

I was applying a hard filter to a genotyped gVCF using JEXL to access variant context attributes to decide what filter setting I would apply. The filter was

"vc.getGenotype("%sample%").isHomRef() ? vc.getGenotype("%sample%").getAD().size() == 1 ? DP < 10 : ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum <= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 : false"

In pseudocode it says:

 `if ( isHomRef ) then
   if ( getAD().size() == 1 ) then DP < 10 else
      ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum >= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 else ignore record`

The idea being that for records where not all reads contained the reference allele, we would filter out those positions where there was evidence to suggest that the reads supporting an alternate allele were of a significantly better quality. However, running this filter I keep getting the warning (snipped for clarity):

WARN [SNIP]... MQRankSum <= 3.2905 [SNIP]... : false;' undefined variable MQRankSum

So I thought the filter was failing. However, just as a test, I changed the direction of MQRankSum from >=3.2905 to <=3.2905 (a bit nonsensical, it should basically apply the filter to almost all HomRef positions that had any reads supporting an alternate allele).

I still get the warning but I found the filter was applied to variant records as it should be. e.g. the following went from PASS to BAD_HOMREF:

Supercontig_1.1 87 . C A . BAD_HOMREF BaseQRankSum=2.79;DP=40;MQ=39.95;MQ0=0;MQRankSum=-2.710e+00;ReadPosRankSum=0.819;VariantType=SNP GT:AB:AD:DP 0/0:0.870:34,5:39

So the filter is being correctly applied, but I am not sure why all the warnings are being generated? Is this a bug? Have I done something wrong?


Created 2015-01-26 11:01:54 | Updated 2015-01-26 11:47:17 | Tags: fisherstrand haplotypecaller jexl strand-bias filtering hardfilters vcf-filtering
Comments (1)

Hi, I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the Fisher Strand annotation, and in higher coverage variant calls, SOR (using a JEXL expression to switch between them: DP < 20 ? FS > 50.0 : SOR > 3).

The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high QD, high MQ, zero MQ0). However, there appears to be some strand bias (SOR=3.3):

788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0

In this instance the filter example above would be applied.

My Question

Is this filtering out a true positive? And what kind of cut-offs should I be using for FS and SOR?

The snipped annotations ReadPosRankSum=-1.809 and BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).

My goal

This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.

I'm filtering HomRef positions using this JEXL filter:

vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false

And filtering HomVar positions using this JEXL:

vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false

My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.


Created 2015-01-15 23:50:50 | Updated | Tags: selectvariants jexl
Comments (2)

Hi,

I am trying to use select variants along with the JEXL syntax. In this specific case I wish to select those variants that

(1)Were heterozygous and,

(2)The ALT allele had more than 8 reads and,

(3) The alternate allele frequency exceeds 20% or as I define in a calculation: (ALT_AD)/(ALT_AD+REF_AD) >= 0.20

My JEXL looks like this:

(vc.getGenotype(“SAMP1").isHet()) && (vc.getGenotype(“SAMP1").getAD().1 >= 8) && (((vc.getGenotype(“SAMP1").getAD().1) / ((vc.getGenotype(“SAMP1").getAD().0)+(vc.getGenotype("SAMP1").getAD().1))) >= 0.2)

In this case, I get no results ( although GATK executes succesfully without errors). However when I exclude the third clause I get the expected results namely those variants that are (1)heterozygous and (2) The ALT_AD exceeds 8 reads. Like so:

(vc.getGenotype(“SAMP1").isHet()) && (vc.getGenotype(“SAMP1").getAD().1 >= 8)

To troubleshoot this issue I thought it might be a decimal vs. integer issue and I added a decimal value to my AD calculation, like so:

(vc.getGenotype(“SAMP1").isHet()) && (vc.getGenotype(“SAMP1").getAD().1 >= 8) && (((vc.getGenotype(“SAMP1").getAD().1 + 0.000000001) / ((vc.getGenotype(“SAMP1").getAD().0)+(vc.getGenotype("SAMP1").getAD().1))) >= 0.2)

The result was that select variants proceeded to select calls as if that third clause did not exist. I am not sure what I might be overlooking, but any insight much appreciated.


Created 2015-01-11 23:30:38 | Updated | Tags: selectvariants variantfiltration jexl snps
Comments (2)

I am experimenting with JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.

vc.getGenotype("Sample1").isHet()

Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like: vc.getGenotype( sample() ).isHet()

Sorry if this is a really dumb question. (BTW I realise you could use a genotype_filter e.g. --genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.


Created 2014-09-25 00:20:55 | Updated | Tags: selectvariants jexl
Comments (2)

Hi GATK,

Thanks to you, we've got an amazing .vcf file just bursting with SNPs. If my file contains five replicates of a control group and five replicates of a treatment group (so 10 individuals total) and I'd like to select variants which are the same across all members of the control group, the same across all members of the replicate group, but different between the two groups, can I accomplish that with SelectVariants and JEXL? Or via some even easier method?

Thanks for your time. Please feel free to rather curtly request a clarification if I've garbled my question.

Aloha,

Brian


Created 2013-11-05 16:43:51 | Updated | Tags: allelebalancebysample selectvariants jexl
Comments (3)

Hello,

I would like to filter my variants using the SelectVariants walker but it throws an error when I try to filter on allele balance by sample. The jexl expression I use is:

vc.getGenotype("sample").getAB()>=0.25

error is: unknown, ambiguous or inaccessible method getAB

Is there any way of filtering on this parameter?

Best wishes,

Kath


Created 2013-06-30 14:38:27 | Updated | Tags: selectvariants jexl indels
Comments (1)

How can I select indels with lenght smaller than 10 bp from a vcf file?

I tried

java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght10.vcf -select 'vc.getIndelLengths().0 < 10'

but the output still contains all the Indels, also the ones larger than 10 bp.


Created 2013-03-14 19:37:04 | Updated 2013-03-14 19:37:44 | Tags: selectvariants jexl
Comments (5)

I'm trying to use JEXL to filter variants but something isn't working and I can't figure it out. I'm hoping someone can point me in the right direction. My VCF file contains an INFO field 1000g2012Apr_ALL. Some of the variants in my VCF have an entry for this field, some don't. I want to filter my VCF file for entries that are below a certain value or are NULL (empty).

Here's what my command looks like:

java -Xmx4G -jar GenomeAnalysisTK.jar -T SelectVariants -R hg19.fa -V my.vcf -o my.1kgfiltered.vcf -select 'vc.getAttribute("1000g2012Apr_ALL") < 0.01' -select '!vc.hasAttribute("1000g2012Apr_ALL")'

The problem is the 2nd select statement. I can't seem to get a JEXL select statement to give me the entries where 1000g2012Apr_ALL are empty. How do I accomplish this?


Created 2013-01-24 22:13:52 | Updated | Tags: varianteval jexl
Comments (2)

Hello,

I'm currently running variantEval to count up variants per individual stratified by a variety of annotations.

My GATK call looks like:

java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar \ -T VariantEval \ -R Homo_sapiens_assembly19.fasta \ -o output.txt \ -L input.vcf \ -eval input.vcf \ -ST Sample -noST \ -noEV -EV CountVariants \ -ST JexlExpression --select_names "nonsynon" --select_exps "resource.VAT_CDS == 'nonsynonymous' && resource.FOUNDERS_FRQ > 0.05" \ -ST JexlExpression --select_names "synon" --select_exps "resource.VAT_CDS == 'synonymous' && resource.FOUNDERS_FRQ > 0.05" ...

where the VAT_CDS section of the INFO field in the VCF has a functional annotation or is set to "na" if an annotation is unavailable. I'm getting the following error:

ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: Invalid JEXL expression detected for synon with message For input string: "nan"
ERROR ------------------------------------------------------------------------------------------

but weirdly the error is not consistent (my data is stratified by chromosome and most chromosomes will run without throwing the error while one or two chromosomes do exhibit the error. Do you have any ideas what's causing this behavior?

Thanks!


Created 2012-12-04 20:12:44 | Updated 2012-12-04 20:15:44 | Tags: varianteval jexl genotype
Comments (1)

While running VariantEval, I'm trying to stratify by a JexlExpression by setting using

-ST Sample -ST JexlExpression -select "GQ>20"

This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.

My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?

Would you like this behavior or are you happy with the current operation of jexlExpression?

Cheers!


Created 2012-10-10 14:48:07 | Updated 2012-10-18 01:16:20 | Tags: unifiedgenotyper jexl
Comments (2)

Dear GATK team,

I'm trying to call variants on some 'haploid' human data (Illumina reads from a mix of clones). I did exactly the following:

crd8% cd /wga/dev/jaffe
crd8% mkdir GATK2
crd8% cd GATK2
crd8% (got GenomeAnalysisTK-2.1-12.tar.bz2 from the GATK site on the web)
crd8% bunzip2 GenomeAnalysisTK-2.1-12.tar.bz2
crd8% cat GenomeAnalysisTK-2.1-12.tar | tar xf -

crd8% cd /local/scratch/jaffe/BroadCRD/fixed

crd8% java -jar /wga/dev/jaffe/GATK2/GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -R /wga/scr2/bigrefs/human19/genome.fasta -T UnifiedGenotyper -I frag.list -o raw2.vcf -U -baq CALCULATE_AS_NECESSARY -nt 48

crd8% java -jar /wga/dev/jaffe/GATK2/GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -R /wga/scr2/bigrefs/human19/genome.fasta -T VariantFiltration -U -V raw2.vcf -o var.vcf --filterExpression "QD<5.0||AC<2||DP<6" --filterName junk

The last command failed, with output

... 
INFO  10:38:07,429 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
WARN  10:38:08,453 Interpreter - ![12,18]: 'QD < 5.0 || AC < 2 || DP < 6;' < error 
java.lang.ArithmeticException: Long coercion: java.util.ArrayList:([1, 1])
    at org.apache.commons.jexl2.JexlArithmetic.toLong(JexlArithmetic.java:914)
    at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:718)
    at org.apache.commons.jexl2.JexlArithmetic.lessThan(JexlArithmetic.java:774)
    at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:967)
    at org.apache.commons.jexl2.parser.ASTLTNode.jjtAccept(ASTLTNode.java:18)
...

Can you please suggest a solution? You should be able to access my data, if that would help. Well actually you would have to login to crd8, but you could copy files from there.

Thank you very much.

David


Created 2012-09-18 19:16:05 | Updated 2012-09-18 20:21:25 | Tags: selectvariants variantfiltration jexl
Comments (10)

Hi,

I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T SelectVariants -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf -select "AF > 0.04" -sn "S05-16209-1C_S4_L001_R1_001"

The error is:

MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![0,9]: 'AF > 0.04;' > error

For VariantFiltration the call is:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T VariantFiltration -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf --filterExpression 'AF > 0.040' --filterName "3perc"

The error is:

java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010])
at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796)
at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.evaluateExpression(VariantJEXLContext.java:267)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:233)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:118)
at org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.match(VariantContextUtils.java:293)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.filter(VariantFiltration.java:331)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:270)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:80)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

For both I have tried variations of double quotes and different sigfigs. Also, it works when I select on parameters other than AF.

Am I missing something?