Tagged with #jexl
1 documentation article | 0 announcements | 7 forum discussions


Comments (48)

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Important caveats

Sensitivity to case and type

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

5. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Accessing the underlying VariantContext directly

If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

6. Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
No posts found with the requested search criteria.
Comments (3)

Hello,

I would like to filter my variants using the SelectVariants walker but it throws an error when I try to filter on allele balance by sample. The jexl expression I use is:

vc.getGenotype("sample").getAB()>=0.25

error is: unknown, ambiguous or inaccessible method getAB

Is there any way of filtering on this parameter?

Best wishes,

Kath

Comments (1)

How can I select indels with lenght smaller than 10 bp from a vcf file?

I tried

java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght10.vcf -select 'vc.getIndelLengths().0 < 10'

but the output still contains all the Indels, also the ones larger than 10 bp.

Comments (5)

I'm trying to use JEXL to filter variants but something isn't working and I can't figure it out. I'm hoping someone can point me in the right direction. My VCF file contains an INFO field 1000g2012Apr_ALL. Some of the variants in my VCF have an entry for this field, some don't. I want to filter my VCF file for entries that are below a certain value or are NULL (empty).

Here's what my command looks like:

java -Xmx4G -jar GenomeAnalysisTK.jar -T SelectVariants -R hg19.fa -V my.vcf -o my.1kgfiltered.vcf -select 'vc.getAttribute("1000g2012Apr_ALL") < 0.01' -select '!vc.hasAttribute("1000g2012Apr_ALL")'

The problem is the 2nd select statement. I can't seem to get a JEXL select statement to give me the entries where 1000g2012Apr_ALL are empty. How do I accomplish this?

Comments (2)

Hello,

I'm currently running variantEval to count up variants per individual stratified by a variety of annotations.

My GATK call looks like:

java -jar /humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar \ -T VariantEval \ -R Homo_sapiens_assembly19.fasta \ -o output.txt \ -L input.vcf \ -eval input.vcf \ -ST Sample -noST \ -noEV -EV CountVariants \ -ST JexlExpression --select_names "nonsynon" --select_exps "resource.VAT_CDS == 'nonsynonymous' && resource.FOUNDERS_FRQ > 0.05" \ -ST JexlExpression --select_names "synon" --select_exps "resource.VAT_CDS == 'synonymous' && resource.FOUNDERS_FRQ > 0.05" ...

where the VAT_CDS section of the INFO field in the VCF has a functional annotation or is set to "na" if an annotation is unavailable. I'm getting the following error:

ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: Invalid JEXL expression detected for synon with message For input string: "nan"
ERROR ------------------------------------------------------------------------------------------

but weirdly the error is not consistent (my data is stratified by chromosome and most chromosomes will run without throwing the error while one or two chromosomes do exhibit the error. Do you have any ideas what's causing this behavior?

Thanks!

Comments (1)

While running VariantEval, I'm trying to stratify by a JexlExpression by setting using

-ST Sample -ST JexlExpression -select "GQ>20"

This fails with a "variable does not exist" error despite the GQ existing in all genotypes in the vcf. Looking at the code it seems that the pathway that loads the JexlExpression in the VariantEval class specifically does not provide the genotype as context (only the VariantContext) and thus, the context for the Jexl does not include GT and the error is produced.

My question is: Is this a feature or a bug? It seems possible to add the genotype (when the VC only has one, or loop over the genotypes and either OR or AND the results (perhaps another input similar to -isr?), but perhaps I'm missing something subtle?

Would you like this behavior or are you happy with the current operation of jexlExpression?

Cheers!

Comments (2)

Dear GATK team,

I'm trying to call variants on some 'haploid' human data (Illumina reads from a mix of clones). I did exactly the following:

crd8% cd /wga/dev/jaffe
crd8% mkdir GATK2
crd8% cd GATK2
crd8% (got GenomeAnalysisTK-2.1-12.tar.bz2 from the GATK site on the web)
crd8% bunzip2 GenomeAnalysisTK-2.1-12.tar.bz2
crd8% cat GenomeAnalysisTK-2.1-12.tar | tar xf -

crd8% cd /local/scratch/jaffe/BroadCRD/fixed

crd8% java -jar /wga/dev/jaffe/GATK2/GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -R /wga/scr2/bigrefs/human19/genome.fasta -T UnifiedGenotyper -I frag.list -o raw2.vcf -U -baq CALCULATE_AS_NECESSARY -nt 48

crd8% java -jar /wga/dev/jaffe/GATK2/GenomeAnalysisTK-2.1-12-ga99c19d/GenomeAnalysisTK.jar -R /wga/scr2/bigrefs/human19/genome.fasta -T VariantFiltration -U -V raw2.vcf -o var.vcf --filterExpression "QD<5.0||AC<2||DP<6" --filterName junk

The last command failed, with output

... 
INFO  10:38:07,429 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
WARN  10:38:08,453 Interpreter - ![12,18]: 'QD < 5.0 || AC < 2 || DP < 6;' < error 
java.lang.ArithmeticException: Long coercion: java.util.ArrayList:([1, 1])
    at org.apache.commons.jexl2.JexlArithmetic.toLong(JexlArithmetic.java:914)
    at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:718)
    at org.apache.commons.jexl2.JexlArithmetic.lessThan(JexlArithmetic.java:774)
    at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:967)
    at org.apache.commons.jexl2.parser.ASTLTNode.jjtAccept(ASTLTNode.java:18)
...

Can you please suggest a solution? You should be able to access my data, if that would help. Well actually you would have to login to crd8, but you could copy files from there.

Thank you very much.

David

Comments (10)

Hi,

I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T SelectVariants -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf -select "AF > 0.04" -sn "S05-16209-1C_S4_L001_R1_001"

The error is:

MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![0,9]: 'AF > 0.04;' > error

For VariantFiltration the call is:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T VariantFiltration -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf --filterExpression 'AF > 0.040' --filterName "3perc"

The error is:

java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010])
at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796)
at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.evaluateExpression(VariantJEXLContext.java:267)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:233)
at org.broadinstitute.sting.utils.variantcontext.JEXLMap.get(VariantJEXLContext.java:118)
at org.broadinstitute.sting.utils.variantcontext.VariantContextUtils.match(VariantContextUtils.java:293)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.filter(VariantFiltration.java:331)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:270)
at org.broadinstitute.sting.gatk.walkers.filters.VariantFiltration.map(VariantFiltration.java:80)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:65)
at org.broadinstitute.sting.gatk.traversals.TraverseLoci.traverse(TraverseLoci.java:18)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:265)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

For both I have tried variations of double quotes and different sigfigs. Also, it works when I select on parameters other than AF.

Am I missing something?