Tagged with #filtering
2 documentation articles | 0 announcements | 0 forum discussions


Comments (35)

If you are sure that you cannot use VQSR / recalibrate variants (typically because your dataset is too small, or because there are no truth/training resources available for your organism), then you will need to use the VariantFiltration tool to manually filter your variants. To do this, you will need to compose filter expressions as explained here, here and here based on the recommendations detailed further below.

But first, some caveats

Let's be painfully clear about this: there is no magic formula that will give you perfect results. Filtering variants manually, using thresholds on annotation values, is subject to all sorts of caveats. The appropriateness of both the annotations and the threshold values is very highly dependent on the specific callset, how it was called, what the data was like, etc.

HOWEVER, because we want to help and people always say that something is better than nothing (not necessarily true, but let's go with that for now), we have formulated some generic recommendations that should at least provide a starting point for people to experiment with their data.

In case you didn't catch that bit in bold there, we're saying that you absolutely SHOULD NOT expect to run these commands and be done with your analysis. You absolutely SHOULD expect to have to evaluate your results critically and TRY AGAIN with some parameter adjustments until you find the settings that are right for your data.

In addition, please note that these recommendations are mainly designed for dealing with very small data sets (in terms of both number of samples or size of targeted regions). If you are not using VQSR because you do not have training/truth resources available for your organism, then you should expect to have to do even more tweaking on the filtering parameters.

So, here are some recommended arguments to use with VariantFiltration when ALL other options are unavailable to you:

Filtering recommendations for SNPs:

  • QD < 2.0
  • MQ < 40.0
  • FS > 60.0
  • HaplotypeScore > 13.0
  • MQRankSum < -12.5
  • ReadPosRankSum < -8.0

Filtering recommendations for indels:

  • QD < 2.0
  • ReadPosRankSum < -20.0
  • InbreedingCoeff < -0.8
  • FS > 200.0

And now some more IMPORTANT caveats (don't skip this!)

  • The InbreedingCoeff statistic is a population-level calculation that is only available with 10 or more samples. If you have fewer samples you will need to omit that particular filter statement.

  • For shallow-coverage (<10x), it is virtually impossible to use manual filtering to reliably separate true positives from false positives. You really, really, really should use the protocol involving variant quality score recalibration. If you can't do that, maybe you need to take a long hard look at your experimental design. In any case you're probably in for a world of pain.

  • The maximum DP (depth) filter only applies to whole genome data, where the probability of a site having exactly N reads given an average coverage of M is a well-behaved function. First principles suggest this should be a binomial sampling but in practice it is more a Gaussian distribution. Regardless, the DP threshold should be set a 5 or 6 sigma from the mean coverage across all samples, so that the DP > X threshold eliminates sites with excessive coverage caused by alignment artifacts. Note that for exomes, a straight DP filter shouldn't be used because the relationship between misalignments and depth isn't clear for capture data.

Finally, a note of hope

Some bits of this article may seem harsh, or depressing. Sorry. We believe in giving you the cold hard truth.

HOWEVER, we do understand that this is one of the major points of pain that GATK users encounter -- along with understanding how VQSR works, so really, whichever option you go with, you're going to suffer.

And we do genuinely want to help. So although we can't look at every single person's callset and give an opinion on how it looks (no, seriously, don't ask us to do that), we do want to hear from you about how we can best help you help yourself. What information do you feel would help you make informed decisions about how to set parameters? Are the meanings of the annotations not clear? Would knowing more about how they are computed help you understand how you can use them? Do you want more math? Less math, more concrete examples?

Tell us what you'd like to see here, and we'll do our best to make it happen. (no unicorns though, we're out of stock)

We also welcome testimonials from you. We are one small team; you are a legion of analysts all trying different things. Please feel free to come forward and share your findings on what works particularly well in your hands.

Comments (27)

Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

Prerequisites

  • TBD

Steps

  1. Extract the SNPs from the call set
  2. Determine parameters for filtering SNPs
  3. Apply the filter to the SNP call set
  4. Extract the Indels from the call set
  5. Determine parameters for filtering SNPs
  6. Apply the filter to the Indel call set

1. Extract the SNPs from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_variants.vcf \ 
    -L 20 \ 
    -selectType SNP \ 
    -o raw_snps.vcf 

Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.


2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

  • QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

  • FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

  • HaplotypeScore 13.0

This is the consistency of the site with two (and only two) segregating haplotypes. Note that this is not applicable for calls made using the UnifiedGenotyper on non-diploid organisms.

  • MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

  • ReadPosRankSumTest (ReadPosRankSum) 8.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.


3. Apply the filter to the SNP call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_snps.vcf \ 
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0 || MappingQualityRankSum < -12.5 || ReadPosRankSum < -8.0" \ 
    --filterName "my_snp_filter" \ 
    -o filtered_snps.vcf 

Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.


4. Extract the Indels from the call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T SelectVariants \ 
    -R reference.fa \ 
    -V raw_HC_variants.vcf \ 
    -L 20 \ 
    -selectType INDEL \ 
    -o raw_indels.vcf 

Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.


5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

  • QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

  • FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

  • ReadPosRankSumTest (ReadPosRankSum) 20.0

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.


6. Apply the filter to the Indel call set

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T VariantFiltration \ 
    -R reference.fa \ 
    -V raw_indels.vcf \ 
    --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \ 
    --filterName "my_indel_filter" \ 
    -o filtered_indels.vcf 

Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

No posts found with the requested search criteria.
No posts found with the requested search criteria.