JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.
In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.
JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:
"QUAL > 30.0"
QUAL is a key: the name of the annotation we want to look at30.0 is a value: the threshold that we want to use to evaluate variant quality against> is an operator: it determines which "side" of the threshold we want to selectThe complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:
"MY_STRING_KEY == 'foo'"
You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:
"QUAL / DP < 10.0"
You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):
"QUAL > 30.0 && DP == 10"
where && is the logical "AND".
Or if you want to select variants that have at least one of several conditions fulfilled:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
where || is the logical "OR".
It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record. It will throw an exception (i.e. fail with an error) when it encounters this scenario. The default behavior of the GATK is to handle this by having the entire expression evaluate to FALSE in such cases (although some tools provide options to change this behavior). This is extremely important especially when constructing complex expressions, because it affects how you should interpret the result.
For example, looking again at that last expression:
"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"
When run against a VCF record with INFO field
QD=10.0;FS=300.0;ReadPosRankSum=-10.0
it will evaluate to TRUE because the FS value is greater than 200.0.
But when run against a VCF record with INFO field
QD=10.0;FS=300.0
it will evaluate to FALSE because there is no ReadPosRankSum value defined at all and JEXL fails to evaluate it.
This means that when you're trying to filter out records with VariantFiltration, for example, the previous record would be marked as PASSing, even though it contains a bad FS value.
For this reason, we highly recommend that complex expressions involving OR operations be split up into separate expressions whenever possible. For example, the previous example would have 3 distinct expressions: "QD < 2.0", "ReadPosRankSum < -20.0", and "FS > 200.0". This way, although the ReadPosRankSum expression evaluates to FALSE when the annotation is missing, the record can still get filtered (again using the example of VariantFiltration) when the FS value is greater than 200.0.
Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.
The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).
Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.
If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.
For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'
Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:
! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263
The classic way of evaluating a boolean goes like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'
But you can also use the VariantContext object like this:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'
Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:
java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'