# Tagged with #filtering 2 documentation articles | 0 announcements | 7 forum discussions

Created 2013-09-11 21:54:51 | Updated 2015-10-30 19:34:12 | Tags: vqsr callset filtering hard-filtering

### The problem:

Our preferred method for filtering variants after the calling step is to use VQSR, a.k.a. recalibration. However, it requires well-curated training/truth resources, which are typically not available for organisms other than humans, and it also requires a large amount of variant sites to operate properly, so it is not suitable for some small-scale experiments such as targeted gene panels or exome studies with fewer than 30 exomes. For the latter, it is sometimes possible to pad your cohort with exomes from another study (especially for humans -- use 1000 Genomes or ExAC!) but again for non-human organisms it is often not possible to do this.

### The solution: hard-filtering

So, if this is your case and you are sure that you cannot use VQSR, then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here and here based on the recommendations detailed further below.

### But first, some caveats

Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, what organism it belongs to, etc.

HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.

In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.

In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.

### Filtering recommendations

Here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you. Note that these JEXL expressions will tag as filtered any sites where the annotation value matches the expression. So if you use the expression QD < 2.0, any site with a QD lower than 2 will be tagged as failing that filter.

#### For SNPs:

• QD < 2.0
• MQ < 40.0
• FS > 60.0
• SOR > 4.0
• HaplotypeScore > 13.0 only for variants output byUnifiedGenotyper; for HaplotypeCaller's output it is not informative
• MQRankSum < -12.5
• ReadPosRankSum < -8.0

#### For indels:

• QD < 2.0
• ReadPosRankSum < -20.0
• InbreedingCoeff < -0.8
• FS > 200.0
• SOR > 10.0

### And now some more IMPORTANT caveats (don't skip this!)

• The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

• For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.

• The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

### Finally, a note of hope

Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.

HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.

Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)

We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.

Created 2013-06-17 22:48:43 | Updated 2015-03-30 11:18:39 | Tags: official analyst filtering hardfilters

#### Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

• TBD

#### Steps

1. Extract the SNPs from the call set
2. Determine parameters for filtering SNPs
3. Apply the filter to the SNP call set
4. Extract the Indels from the call set
5. Determine parameters for filtering indels
6. Apply the filter to the Indel call set

### 1. Extract the SNPs from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_variants.vcf \
-L 20 \
-selectType SNP \
-o raw_snps.vcf


#### Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

### 2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

• RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

• MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 3. Apply the filter to the SNP call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-o filtered_snps.vcf


#### Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

### 4. Extract the Indels from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_HC_variants.vcf \
-L 20 \
-selectType INDEL \
-o raw_indels.vcf


#### Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

### 5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 6. Apply the filter to the Indel call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-o filtered_indels.vcf


#### Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

No posts found with the requested search criteria.

Created 2015-07-29 01:58:31 | Updated | Tags: filtering baseqranksum

Hello,

I was plotting the distribution of BaseQRankSum and noticed a large number of variants with a BaseQRankSum outside of Z-score of +/- 2, which suggests that a lot of variants have significant base quality differences between the REF and ALT. I plotted the distribution of ClippingRankSum, MQRankSum, and ReadPosRankSum and the majority of variants had Z-scores inside +/- 2.

Is this typical and what is this suggestive of? I followed the best practices for DNA sequencing using GATK3.

I found this post (http://gatkforums.broadinstitute.org/discussion/2035/z-scores-for-baseqranksum), which is similar to what I'm asking but has a different distribution of Z-scores.

Dave

Created 2015-01-26 11:01:54 | Updated 2015-01-26 11:47:17 | Tags: fisherstrand haplotypecaller jexl strand-bias filtering hardfilters vcf-filtering

Hi, I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the Fisher Strand annotation, and in higher coverage variant calls, SOR (using a JEXL expression to switch between them: DP < 20 ? FS > 50.0 : SOR > 3).

The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high QD, high MQ, zero MQ0). However, there appears to be some strand bias (SOR=3.3):

788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0

In this instance the filter example above would be applied.

## My Question

Is this filtering out a true positive? And what kind of cut-offs should I be using for FS and SOR?

The snipped annotations ReadPosRankSum=-1.809 and BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).

## My goal

This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.

I'm filtering HomRef positions using this JEXL filter:

vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false

And filtering HomVar positions using this JEXL:

vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false

My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.

Created 2013-10-23 15:48:56 | Updated | Tags: filtering

Hi

It would be great if we can extract duplicated reads and check where they fall on the genome.

Shouyong

Created 2013-01-21 09:50:51 | Updated 2013-01-22 16:53:15 | Tags: variants filtering

Dear, GATK team, I have done raw snp and indel calling with UnifiedGenotyper following the command line below.

java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -glm BOTH -R ucsc.hg19.fasta -T UnifiedGenotyper -I ERR031029.marked.realigned.fixed.recal.bam -I ERR031030.marked.realigned.fixed.recal.bam -D dbsnp_135.hg19.vcf -o ERR031030.raw.snps.indels.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000


After that, I did snp filteration using the following command lines.

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.snpsonly.vcf -selectType SNP

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants --variant ERR031030.raw.snps.indels.vcf -o ERR031030.indelsonly.vcf -selectType INDEL

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T VariantRecalibrator -R ucsc.hg19.fasta -input ERR031030.snpsonly.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.vcf -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_phase1.indels.hg19.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.hg19.vcf -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an MQ -mode SNP -recalFile ERR031030.snp.recal.vcf -tranchesFile ERR031030.snp.tranches.vcf -rscriptFile ERR031030.plots.R

java -Xmx8g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -input ERR031030.snpsonly.vcf -tranchesFile ERR031030.snp.tranches.vcf -recalFile ERR031030.snp.recal.vcf -o ERR031030.snps.filtered.vcf

java -Xmx16g -jar GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T VariantFiltration --variant ERR031030.snps.filtered.vcf -o ERR031030.final.filtered.vcf --filterName "Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5"  --filterExpression "HaplotypeScore > 13.0"


The filtered snp.vcf file came up, however, it seems it contains some problem.

chrM    311     .       T       C       429.19  Nov28filters **_&& QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5;VQSRTrancheSNP99.90to100.00   AC=1;AF=0.250;AN=4;BaseQRankSum=-13.010;DP=2000;Dels=0.00;FS=50.500;HaplotypeScore=382.2016;MLEAC=1;MLEAF=0.250;MQ=50.86;MQ0=0;MQRankSum=1.458;QD=0.43;ReadPosRankSum=-10.687;VQSLOD=-6.143e+02;culprit=HaplotypeScore  GT:AD:DP:GQ:PL  0/0:634,353:949:99:0,232,7697   0/1:463,521:945:99:459,0,4190
chrM    410     .       A       T       64750.20        PASS    AC=4;AF=1.00;AN=4;DP=2000;Dels=0.00;FS=0.000;HaplotypeScore=7.3762;MLEAC=4;MLEAF=1.00;MQ=56.04;MQ0=0;QD=32.38;VQSLOD=2.27;culprit=HaplotypeScore        GT:AD:DP:GQ:PL  1/1:0,998:998:99:32010,2926,0   1/1:0,999:999:99:32767,2912,0
chrM    711     .       G       A       62989.20        PASS    AC=4;AF=1.00;AN=4;BaseQRankSum=2.500;DP=2000;Dels=0.00;FS=3.751;HaplotypeScore=8.7084;MLEAC=4;MLEAF=1.00;MQ=56.74;MQ0=1;MQRankSum=-0.107;QD=31.49;ReadPosRankSum=-2.169;VQSLOD=2.46;culprit=HaplotypeScore
chrM    1121    .       T       C       16719.20        Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5;VQSRTrancheSNP99.90to100.00   AC=4;AF=1.00;AN=4;BaseQRankSum=-0.239;DP=2000;Dels=0.00;FS=2.141;HaplotypeScore=22.9003;MLEAC=4;MLEAF=1.00;MQ=21.32;MQ0=703;MQRankSum=-1.627;QD=8.36;ReadPosRankSum=-0.027;VQSLOD=-4.195e+00;culprit=HaplotypeScore     GT:AD:DP:GQ:PL  1/1:3,985:986:99:9547,976,0     1/1:4,983:983:99:7199,739,0
chrM    2489    .       A       C       34.19   LowQual;Nov28filters && QD < 2.0 && ReadPosRankSum < -8.0 && MQ < 40.0 && FS > 60.0 && MQRandkSum < -12.5       AC=1;AF=0.250;AN=4;BaseQRankSum=-17.321;DP=2000;Dels=0.00;FS=180.208;HaplotypeScore=18.7245;MLEAC=1;MLEAF=0.250;MQ=46.52;MQ0=31;MQRankSum=3.365;QD=0.03;ReadPosRankSum=-4.198   GT:AD:DP:GQ:PL  0/1:278,719:950:64:64,0,4623    0/0:309,688:950:99:0,263,6065


For the filter option, most of the filtered snps show Nov28filters rather than PASS or LowQual, what's wrong with that, Are there some problems with my command lines? Thank you so much for your reply.

Created 2012-11-21 22:39:11 | Updated | Tags: filtering

Hello,

I am calling SNPs and Indels on a non-mammalian genome and do not have an empirical truth set for either. What would you recommend are the annotations I can use for filtering out low-confidence calls?

Created 2012-11-13 10:17:41 | Updated 2013-01-07 20:05:43 | Tags: filtering

Dear all, I have a set of 48 exomes which were analysed according to the best practices (using GATK-2.2-3 and HaplotypeCaller). According to the VQRS I have this first level of "uncertainty":

##FILTER=<ID=VQSRTrancheBOTH90.00to99.00,Description="Truth sensitivity tranche level for BOTH model at VQS Lod: -1.3455 <= x < 2.62">


that sets filter=PASS for variants with VQSLOD >= 2.62. I also have an external validation of some SNPs, 3 out of 20 have a VQSLOD lower than 2.62 (1.24, .1.37 and 1.69). Now the question: should I trust the validation and set the filter to, say, VQSLOD >= 1.2 or keep the GATK filter? What is your experience about this?

Thanks

d

Created 2012-10-30 15:55:37 | Updated 2013-01-07 20:11:03 | Tags: selectvariants variantfiltration vcf filtering

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.