# Tagged with #filter 0 documentation articles | 0 announcements | 11 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantFiltration \
-o output.vcf \
--variant input.vcf \
--filterExpression "AB < 0.2 || MQ0 > 50" \
--filterName "Nov09filters" \


In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T VariantFiltration \
-o output.vcf \
--variant input.vcf \
--filterExpression "AB < 0.2 || MQ0 > 50" \
--filterName "Nov09filters" \


So I have used the latest GATK best practices pipeline for variant detection on non-human organisms, but now I am trying to do it for human data. I downloaded the Broad bundle and I was able to run all of the steps up to and including ApplyRecalibration. However, now I am not exactly sure what to do. The VCF file that is generated contains these FILTER values:

.

PASS

VQSRTrancheSNP99.90to100.00

I am not sure what these mean. Does the "VQSRTrancheSNP99.90to100.00" filter mean that that SNP falls below the specified truth sensitivity level? Does "PASS" mean that it is above that level? Or is it vice versa? And what does "." mean? Which ones should I keep as "good" SNPs?

I'm also having some difficulty fully understanding how the VQSLOD is used.... and what does the "culprit" mean when the filter is "PASS"?

A final question.... I've been using this command to actually create a file with only SNPs that PASSed the filter:

java -Xmx2g -jar /share/apps/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T SelectVariants -R ~/broad_bundle/ucsc.hg19.fasta --variant Pt1.40300.output.recal_and_filtered.snps.chr1.vcf -o Pt1.40300.output.recal_and_filtered.passed.snps.chr1.vcf -select 'vc.isNotFiltered()'

Is this the correct way to get PASSed SNPs? Is there a better way? Any help you can give me would be highly appreciated. Thanks!

• Nikhil Joshi

Hello dear GATK Team,

it seams that the ignoreFilter Argument in VariantRecalibrator does not work. I want to include variants with LowQual filter in the calculation, but cant find the right way to do it. I tried all of these:

-ignoreFilter LowQual
-ignoreFilter [LowQual]
-ignoreFilter "LowQual"
-ignoreFilter "Low Quality"
-ignoreFilter Low Quality
-ignoreFilter [Low Quality]


and also with --ignore_filter instead of -ignoreFilter.

#### #

Found 2 solutions:

1) Remove the LowQual filter with VariantFiltration and "--invalidatePreviousFilters"

2) "-ignoreFilter LowQual" has to be applied to ApplyRecalibration also...

Hi all,

I am comparing the variant calls from samtools and GATK. For samtools, I have been using quality score cut-offs 100 for SNP and 1000 for INDEL (quite stringent) and as a result, many variants are excluded after filtering. In case of GATK, I have been using our default setting (99% sensitivity for SNPs and 95% sensitivity for INDEL) and included only the variants with FILTER field "PASS". I was wondering if there is any more stringent filters that I can apply and that could be equivalent to samtools QS thresholds since it does not look like this is a fair comparison. Any of your suggestions will be appreciated.

Hi team,

This is two separate questions:

1. Starting with a vcf file, plotting the depth (DP) distribution gives a nice, slightly asymmetrical bell-shaped curve. Given that SNPs with very high and very low coverages should be excluded, how does one decide what is very high and low. e.g. 5% either side ?

2. I'm only interested in chromosomes 2L, 2R, 3L, 3R and X of my Drosophila sequences. Filtering for these is easy with a Perl script but I'm trying to do this earlier on in the GATK processes. I've tried ...-L 2L -L 2R -L 3L ...etc, -L 2L 2R 3L ....etc and, -L 2L, 2R, 3R...etc but the result is either input error message or chromosome 2L only.

Many thanks and apologies if I've missed anything in the instructions.

Cheers,

Blue

In working with a mapping that has a very large coverage in certain places, I have noticed that the Unified Genotyper often calls SNPs even if the alternate allele is only present in a minuscule <1% fraction of the reads at a position. It is difficult to filter these SNPs out because QUAL is high for these SNPs the and QD, which is low in these situations, is also low for many other (good) SNPs.

There already is a fractional threshold option for indel calling, -minIndelFrac. Is there any similar option for SNPs – and if not, what is your reasoning for omitting this option and what steps do you recommend for filtering out these SNPs?

When using CombineVariants, my variants get a FILTER value of either PASS or LowQual. Would it be possible to add an option to CombineVariants which prevents the FILTER value to be set to PASS? Otherwise I have to do some file processing before I run ApplyRecalibration further downstream. It would be great if this was a feature of all walkers and not just VariantFiltration. I'm not sure if the forum is the right place for feature requests. Happy to use Bugzilla or similar instead. Thanks.

Hi, I would like to perform base quality score recalibration on only the reads that have the "properly aligned" bit (0x2) set in the FLAG column of the SAM format. Ideally, I would like to use the --read_filter argument. Below is some code that does this to my satisfaction with the PrintReads walker of GATK 2 lite. However, GATK 2 lite does not support base quality score recalibration table creation. Is there any way someone could add the code to the GATK 2 full version?

I am not sure why, but the code seems to only work with the System.out.println() line.

Thanks, Winni

/* * code written by Kiran Garimella */

import net.sf.samtools.SAMRecord;

public class ProperPairFilter extends ReadFilter { @Override public boolean filterOut(SAMRecord samRecord) { System.out.println(samRecord.getProperPairFlag()); return !samRecord.getProperPairFlag(); } }

I have completed filtering my SNP data using VariantFiltration, and now I want to use SelectVariants to output all calls marked "PASS" in the FILTER field. I used the following script, but only the header information writes to the output file.

java -Xmx20g -jar GenomeAnalysisTK.jar -T SelectVariants -R HC.fa --variant HC.SNPs.filtered.vcf -select "FILTER == 'PASS'" -o HC.SNPs.passed.vcf


My input file contains many records that should evaluate as true. Any idea why this doesn't this work?

Hi all,

I'm currently analysing non-human mammalian whole genome data (>30x). No previous variants databases are available.

I'm currently in the VariantFiltration step. I came around the following command which is used for human data, and I'm wondering if it will be good for non-human data:

java -Xmx10g -jar GenomeAnalysisTK.jar \
-R [reference.fasta] \
-T VariantFiltration \
--variant [input.recalibrated.vcf] \
-o [recalibrated.filtered.vcf] \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 1.5 " \
--filterName "LowQD" \
--filterExpression "SB > -10.0 " \
--filterName "StrandBias"


I would appreciate your thoughts on this matter.

Thank you very much!

Sagi