# Tagged with #variantfiltration 3 documentation articles | 0 announcements | 30 forum discussions

### 1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

### 2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"

• QUAL is a key: the name of the annotation we want to look at
• 30.0 is a value: the threshold that we want to use to evaluate variant quality against
• > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"


### 3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"


You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"


where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"


where || is the logical "OR".

### 4. Important caveats

#### Sensitivity to case and type

• Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

• Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

#### Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

### 5. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

#### Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

#### Examples using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'


Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263


#### Examples using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'


But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'


#### Example using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'


A new tool has been released!

Check out the documentation at VariantFiltration.

## VariantFiltration

For a complete, detailed argument reference, refer to the GATK document page here.

The documentation for Using JEXL expressions within the GATK contains very important information about limitations of the filtering that can be done; in particular please note the section on working with complex expressions.

## Filtering Individual Genotypes

One can now filter individual samples/genotypes in a VCF based on information from the FORMAT field: Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, one can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods so that one can now filter out hets (isHet == 1), refs (isHomRef == 1), or homs (isHomVar == 1).

No posts found with the requested search criteria.

Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument. Here is what I ran:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf \ --filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \ --filterName “darwinfinchfilter”

I got the following error message: "2: No such file or directory."

However, I know my file path was set correctly as if i remove the last two arguments and only ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf

It would run.

I also tried Geraldine example filterExpression:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_geraldinefilter.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName “Geraldine_snp_filter"

and I got the following error: Unmatched ".

Thanks!

Hello, I'm trying to figure out what's wrong in my script for this variant filtration argument. Here is what I ran:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf \ --filterExpression QD < 2 && FS > 60.0 && MQ < 50.0 && HaplotypeScore > 10.0 && MappingQualityRankSum < -4 && ReadPosRankSum < -2 \ --filterName “darwinfinchfilter”

I got the following error message: "2: No such file or directory."

However, I know my file path was set correctly as if i remove the last two arguments and only ran: java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_darwinfinch_filter.vcf

It would run.

I also tried Geraldine example filterExpression:

java -jar tools/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R ref/Taeniopygia_guttata.taeGut3.2.4.dna_rm.toplevel.fa \ -V vcftoolsextractGATK_snps.vcf \ -o GATK_geraldinefilter.vcf \ --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filterName “Geraldine_snp_filter"

and I got the following error: Unmatched ".

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

Hi GATK team,

I need to flag poor quality SNPs and Indels in the joint VCF. Basically, we call a family (trio) together so a typical joint VCF contains calls from three samples. I followed the rules that proposed in the your post "how-to-apply-hard-filters-to-a-call-set". In fact, I want to flag the "FILTER" field based on Information of one sample in the VCF. How to specify it in "VariantFiltration"? Any suggestions?

Many thanks, Linda

I was pulling my hair out over this one.

I was applying a hard filter to a genotyped gVCF using JEXL to access variant context attributes to decide what filter setting I would apply. The filter was

"vc.getGenotype("%sample%").isHomRef() ? vc.getGenotype("%sample%").getAD().size() == 1 ? DP < 10 : ( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum <= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 : false"

In pseudocode it says:

 if ( isHomRef ) then
if ( getAD().size() == 1 ) then DP < 10 else
( DP - MQ0 ) < 10 or ( MQ0 - (1.0 * DP) ) >= 0.1 or MQRankSum >= 3.2905 or ReadPosRankSum >= 3.2905 or BaseQRankSum >= 2.81 else ignore record


The idea being that for records where not all reads contained the reference allele, we would filter out those positions where there was evidence to suggest that the reads supporting an alternate allele were of a significantly better quality. However, running this filter I keep getting the warning (snipped for clarity):

WARN [SNIP]... MQRankSum <= 3.2905 [SNIP]... : false;' undefined variable MQRankSum

So I thought the filter was failing. However, just as a test, I changed the direction of MQRankSum from >=3.2905 to <=3.2905 (a bit nonsensical, it should basically apply the filter to almost all HomRef positions that had any reads supporting an alternate allele).

I still get the warning but I found the filter was applied to variant records as it should be. e.g. the following went from PASS to BAD_HOMREF:

So the filter is being correctly applied, but I am not sure why all the warnings are being generated? Is this a bug? Have I done something wrong?

I am experimenting with JEXL expressions in order to do some hard filtering on variants. I wonder if there is a method to tell the filter expression to operate on the current sample (assuming here a single sample VCF file) without knowing the sample name a priori e.g.

vc.getGenotype("Sample1").isHet()

Works just fine to sample heterozygous positions from a VCF where I know the sample will be called "Sample1", but can I refer to a sample name programmatically, e.g. something like: vc.getGenotype( sample() ).isHet()

Sorry if this is a really dumb question. (BTW I realise you could use a genotype_filter e.g. --genotypeFilterExpression "isHet == 1" to do the same thing, but I want to annotate the VCF in the FORMAT/FILTER field, rather than the INFO field for downstream variant selection operations.

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

• 01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
• 02- BAM: MarkDuplicates
• 03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
• 04- BAM: Calling variants using HaplotypeCaller
• 05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
• 06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
• 07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
• 08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

I have 23 samples and I want to look over 63807197 bp region. Many thanks before.

Kind regards, Angelica

Hi all, I tried to apply the following command to my raw vcf file to filter it with the command: java -Xmx30g -jar ../GATK/GenomeAnalysisTK.jar -R ../ref.fa -T VariantFiltration --filterExpression " QD < 20.0 || ReadPosRankSum < -8.0 || FS > 10.0 || QUAL < $MEANQUAL || MQ <30.0 || DP< 10.0 " --filterName LowQualFilter --missingValuesInExpressionsShouldEvaluateAsFailing --variant ../s1.raw.vcf --logging_level ERROR -o ../s1.makered.raw.vcf grep -v "Filter" s1.makered.raw.vcf >s1.flt.vcf After that, I checked the result file s1.flt.vcf and found the following makered "PASS" .Obviously, the command doesn't work as ‘DP=8“ should be makered "LowQualFiter". Chr01 231575 . A G 241.78 PASS AC=2;AF=1.00;AN=2;DP=8;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=29.00;MQ0=0;QD=30.22 GT:AD:DP:GQ:PL 1/1:0,8:8:24:270,24,0 Chr01 237476 . T C 238.78 PASS AC=2;AF=1.00;AN=2;DP=8;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=29.00;MQ0=0;QD=29.85 GT:AD:DP:GQ:PL 1/1:0,8:8:24:267,24,0 There is no error reported.Any suggestion will be appreciated. Hi , I am trying to filter my variants only if they have covergare of >=10, Below is the error I am getting$ java -jar /opt/NGSTools/GATK-3.2.2/GenomeAnalysisTK.jar -T VariantFiltration -R /home/data/GATK_test/gatk/ucsc.hg19.fasta --variant 12-0116KZ_vcf_snp-indel.vcf -o 12- 0116KZ__filtered.vcf --filterExpression "DP > 10" INFO 12:59:02,106 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:59:02,108 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 12:59:02,109 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 12:59:02,109 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 12:59:02,113 HelpFormatter - Program Args: -T VariantFiltration -R /home/data/GATK_test/gatk/ucsc.hg19.fasta --variant 12-0116KZ_vcf_snp-indel.vcf -o 12-0116KZ__filtered.vcf --filterExpression DP > 10 INFO 12:59:02,123 HelpFormatter - Executing as clnsxr@MSJMJ794LD4229 on Linux 2.6.32-431.23.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_65-mockbuild_2014_07_14_06_19-b00. INFO 12:59:02,124 HelpFormatter - Date/Time: 2014/10/22 12:59:02 INFO 12:59:02,124 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:59:02,124 HelpFormatter - -------------------------------------------------------------------------------- INFO 12:59:02,199 GenomeAnalysisEngine - Strictness is SILENT INFO 12:59:02,352 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 12:59:02,487 GenomeAnalysisEngine - Preparing for traversal INFO 12:59:02,508 GenomeAnalysisEngine - Done preparing for traversal INFO 12:59:02,508 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 12:59:02,509 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 12:59:02,509 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime WARN 12:59:03,266 RestStorageService - Error Response: PUT '/yzrGUOlIfLoDxQzDGnPauLbDm4fD0TwM.report.xml.gz' -- ResponseCode: 403, ResponseStatus: Forbidden, Request Headers: [Content-Length: 809, Content-MD5: BxyNhjXmVTFtPLFC1PfHvQ==, Content-Type: application/octet-stream, x-amz-meta-md5-hash: 071c8d8635e655316d3cb142d4f7c7bd, Date: Wed, 22 Oct 2014 17:59:02 GMT, Authorization: AWS AKIAI22FBBJ37D5X62OQ:xPIn4b04nmKanO/VjUHgupuql8w=, User-Agent: JetS3t/0.8.1 (Linux/2.6.32-431.23.3.el6.x86_64; amd64; en; JVM 1.7.0_65), Host: broad.gsa.gatk.run.reports.s3.amazonaws.com, Expect: 100-continue], Response Headers: [x-amz-request-id: 7CC503EC75535B0F, x-amz-id-2: fAfDXCnwIDLYgjwyQf84lgZBbHe/DyKdCzO2K5I3pQXQPC8kSeRu/ceSKfd2FS3I, Content-Type: application/xml, Transfer-Encoding: chunked, Date: Wed, 22 Oct 2014 21:58:59 GMT, Connection: close, Server: AmazonS3] WARN 12:59:03,415 RestStorageService - Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 14396 seconds. Retrying connection. INFO 12:59:03,557 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

Any help will be appreciated.

Satish

Hi all,

I am using GATK version-3.2-2 and called the variants using HaplotypeCaller using the below shown command:

 java -jar GenomeAnalysisTK.jar -R ref.fa -T HaplotypeCaller -I input.vcf -L region.bed -stand_emit_conf 10 -stand_call_conf 30
--genotyping_mode DISCOVERY -o var.vcf


And then selected the variants using SelectVariants and filtered using VariantFiltration by following the steps in the tutorial: https://www.broadinstitute.org/gatk/guide/topic?name=tutorials . However, i met with the following error:

"undefined variable ReadPosRankSum" and undefined variable "MappingQualityRankSum" . The same issue is discussed in the forum but could find a concrete solution to fix this. Could someone help?

Hi,

I am using GATK VariantFiltration tool to do some hardfiltering of variants and it works fine. However, the total variants remain same before and after filtering by marking the variants "PASS" that pass the filter. I explored through the documentation and forum to find out if there is a way to drop the variants from the file that do not meet the filtering criteria but couldn't find. Could someone give any suggestions to fix this.

Hi all,

I tried to apply the following command to my raw vcf file to filter it using the filtering expression specified in the command:

java -XX:ConcGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=4 -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_g1k_v37.fa --variant human_g1k_v37.CHL124.vcf_snps.vcf -o CHL124.vcf_snps.vcf_filter_marked.vcf --filterExpression "QD < 2.0 && MQ < 40.0 && FS > 60.0 && HaplotypeScore > 13.0 && MQRankSum < -12.5 && ReadPosRankSum < -8.0" --filterName "very_small_SNPs_default_filter"

After that, I check my result file which is CHL124.vcf_snps.vcf_filter_marked.vcf, I found that, all reads are marked as "PASS" whether its QD is > 2.0 or < 2.0. Obviously, the command doesn't work, but I cannot find why everything seems goes well, no error reported.

bless~ XL

Are there any recommendations about what percentage of SNPs should be filtered out from the data set when tweaking the filter criteria?

Many thanks

Hi, is it possible to have variants filtered out based on the number of samples with a GQ < 99 like you can do with DiagnoseTargets?

I believe that would come in handy if searching for denovo variants in trios (at least that's what I'm doing now and I have a lot of samples in which one of the parents has low GQ).

I wanted to use VariantFiltration/-G_filter/-G_filterName to filter some low quality genotype calls. With VariantFiltration (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html) I can use JEXL expressions (http://www.broadinstitute.org/gatk/guide/article?id=1255) to filter bad genotpe calls. For example I can use JEXL expression "DP<10" to filter calls in regions where coverage is too low. However, this only achieves adding a tag to the genotype calls. What I would really want is to set those genotype calls to missing (i.e. ./.), but I don't see an option to do so in the Walker. It seems like a missing feature.

Or maybe there are other ways to get what I need? Again, I don't want to filter the variant, only the genotypes.

Hi,

I have annotated my vcf file of 20 samples from Unified genotyper using the following steps.

Unified genotyper->Variantrecalibration->Applyrecalibration->VariantAnnotator

My question is how should I proceed if I have to select rare variants (MAF<1%) for the candidate genes that I have,for each of these 20 samples?

Hi, I have run UnifiedGenotyper followed by application of hard filters as recommended in the GATK best practices on my targeted sequencing data. I've noticed, however, there are several variants with very high no-call rates (>90%) which still passed the variant filtration. I'm pasting below part of the vcf files for two such variants.

I've also noticed that most of high no-call rate variants have very low read depths. I read in other discussions that you don't recommend filtering variants by read depth, but I wonder if there is another filtering criteria you can recommend so that such variants wouldn't pass the filtering step (i.e. more stringent std_call_conf values?)?

I can surely filter out the variants based on their call rate before the downstream applications, but I'm trying to understand the sequencing quality metrics, and GATK's behavior here as to what quality of these variants makes them to get a pass in the filtration.

Thanks a lot,

Gulum

for these two variants below, genotypes for only 2 and 1 (out of 278) people, respectively, were called:

# CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10134 10215

1 11857410 rs7537955 A G 101.85 PASS AC=6;AF=1.00;AN=6;DB;DP=3;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=6;MLEAF=1.00;MQ=45.96;MQ0=0;QD=33.95; GT:AD:DP:GQ:PL ./. ./. 4 156661872 . C A 53.39 PASS AC=2;AF=1.00;AN=2;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=26.70; GT:AD:DP:GQ:PL ./. ./.

Hi,

I was wondering if you could use the toolkit to generate a separate VCF file containing only SNPs that are found at a predetermined chromosome and base pair position. I have a plink file which I want to convert back to VCF format and it seems unbelievably hard to do so I thought this may be a good way to get around that problem?

I am aware that vcftools offers this function with the "--positions " option, however for some reason I am getting far more variants than I listed and there is nothing wrong that is obvious with my listed positions/vcf file.

Dear all,

I want to use GATK to filter a multisample VCF based on the PL numbers but I cannot figure out how to do this. I want to flag genotypes that are likely to be homozygous reference (with the aim of removing these eventually). I would like to look at the PL field and if the first PL number (which refers to the probability of a homozygous reference call) is 10 or less, flag these genotypes.

The variant Filtration page is quite helpful and suggests accessing arrays using things like: 'vc.getGenotype("NA12878").getAD().0 > 10'

but this is not quite what I want, as I want the filtering to be genotype based (not variant based) and I don't want to base the filtering on a single sample but on each sample separately. Basically I am looking for something like this:

--genotypeFilterExpression "PL.0 > 10"

where PL/0 is the first number of the PL array. I cannot figure out anywhere a way to do this. Can someone suggest a recipe to achieve this?

Hello,

I would like to filter my multi-sample vcf using per sample metrics such as AD and PL. However, these are provided as comma-separated lists of numbers. Does anyone know how I can filter, for example, on PL of 0/1 genotype in sample A?

Best wishes,

Kath

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

#CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TUMOR
1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}\$"

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Hello,

I am hoping to perform hard filtering on some variants from a sequencing project where, unfortunately, I do not have information from enough samples for VQSR. I was planning to filter on the QD value, but it seems to be very low for variants that seem reasonable. Example:

chr7    55249063 .       G       A       225     PASS
AC=1;AC1=1;AF=0.500;AF1=0.5;AN=2;BaseQRankSum=1.307;DP=4582;DP4=937,935,1299,1316;Dels=0.00;FQ=225;FS=0.323;


This variant is shown in IGV in the attached file- it looks to be a true positive, but because of the high depth, QD is very low. Based on the QD documentation, it looks as QD simply cannot be used to filter high-coverage data, since the value is QUAL/unfiltered depth.

Is there an alternative annotation that expresses the same measure, since QD is recommended in all the hard filtering documentation? Would GQ be a good substitute?

My question could seems like here but, the answer didn't help me.

I am using VariantFiltration over a VCF file which is generated directly after UnifiedGenotype under GenomeAnalysisTK-2.3-9-ge5ebf34.

The error I am facing is

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 126: there aren't enough columns for line 70 (we expected 9 tokens, and saw 1 )


Line number 126 is as following,

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  m1016ROUa.40287 m1023ROGa.40244 m1042ujba.40261 m1069FXFa.49470


And actually indeed it is the header of VCF file ! Should I re-run my samples ?!!

Hi Team,

I have a multi-sample VCF file produced by UnifiedGenotyper. I now want to filter this file marking those variants with a low depth. However the DP entry in the info field is across all samples, and even if it were possible to assess the individual's DPs, I would then have to resolve the issue of a variant having low depth in one sample, and high in another. Any suggestions are appreciated.

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Hi all,

I'm currently analysing non-human mammalian whole genome data (>30x). No previous variants databases are available.

I'm currently in the VariantFiltration step. I came around the following command which is used for human data, and I'm wondering if it will be good for non-human data:

java -Xmx10g -jar GenomeAnalysisTK.jar \
-R [reference.fasta] \
-T VariantFiltration \
--variant [input.recalibrated.vcf] \
-o [recalibrated.filtered.vcf] \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 1.5 " \
--filterName "LowQD" \
--filterExpression "SB > -10.0 " \
--filterName "StrandBias"


I would appreciate your thoughts on this matter.

Thank you very much!

Sagi

I am trying to filter variant calls which have "GQ>=20.0".

GATK SelectVariants, gives no error but gives only the header in the output file

java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T SelectVariants --variant xxx.var.flt.vcf -o xxx.vcf -select "GQ >= 20.0"


So, I tried using VariantFiltration followed by SelectVariants. The variant filtration seems to work fine adding FT tag to the format field. And then I am trying to get records having FT tag using the following commands

java -Xmx2g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -R xxx -T VariantFiltration --variant xxx.var.flt.vcf -o xxx_filtered.vcf --genotypeFilterExpression "GQ >= 20.0" --genotypeFilterName "qual_1_filters"

java -Xmx4g -jar ~/GenomeAnalysisTKLite-2.1-8-gbb7f038/GenomeAnalysisTKLite.jar -T SelectVariants -R xxx --variant xxx_filtered.vcf -select 'vc.hasAttribute("FT")' -o xxx_qual20.vcf


but I only get header in the output vcf file.

I am not sure if this is the right approach. Any help would be appreciated.

Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:

For snps "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0"

For indels "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0"

I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.

My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.

Hi,

I have been trying get variants out of a VCF file where the Allele Frequency (AF) is greater than 4%. I have tried both VariantFiltration and SelectVariants but I get different errors with each. Here is my call for SelectVariants:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T SelectVariants -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf -select "AF > 0.04" -sn "S05-16209-1C_S4_L001_R1_001"


The error is:

MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![0,9]: 'AF > 0.04;' > error


For VariantFiltration the call is:

java -Xmx4g -jar ~/tools/bin/GenomeAnalysisTK.jar -R /home/genome/human_g1k_v37.truseq_mask.fasta -T VariantFiltration -o S05-16209-1C_S4_L001_R1_001.30.10.sorted.3perc.vcf --variant S05-16209-1C_S4_L001_R1_001.30.10.sorted.vcf --filterExpression 'AF > 0.040' --filterName "3perc"


The error is:

java.lang.ArithmeticException: Double coercion: java.util.ArrayList:([0.010, 0.010])
at org.apache.commons.jexl2.JexlArithmetic.toDouble(JexlArithmetic.java:1023)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:699)
at org.apache.commons.jexl2.JexlArithmetic.greaterThan(JexlArithmetic.java:790)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:796)
at org.apache.commons.jexl2.parser.ASTGTNode.jjtAccept(ASTGTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)