I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the
Fisher Strand annotation, and in higher coverage variant calls,
SOR (using a JEXL expression to switch between them:
DP < 20 ? FS > 50.0 : SOR > 3).
The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high
MQ0). However, there appears to be some strand bias (
788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0
In this instance the filter example above would be applied.
Is this filtering out a true positive? And what kind of cut-offs should I be using for
The snipped annotations
BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).
This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.
HomRef positions using this JEXL filter:
vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false
HomVar positions using this JEXL:
vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false
My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.
m struggeling with some statistics given by the vcf file: the Ranksumtests. I started googleing arround, but that turned out to be not helpfult for understanding it (in may case). I really have no idea how to interprete the vcf-statistic-values comming from ranksumtest. I have no clue whether a negative, positive or value near zero is good/bad. Therefore im asking for some help here. Maybe someone knows a good tutorial-page or can give me a hint to better understand the values of MQRankSum, ReadPosRankSum and BaseQRankSum. I have the same problem with the FisherStrand statistics. Many, many thanks in advance.
I have the following variant called by Unified Genotyper (GATK version : GenomeAnalysisTK-2.6-5) :
chr9 139413211 . T G 7.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-7.913;DP=296;Dels=0.00;FS=37.414;HaplotypeScore=22.3462;MLEAC=1;MLEAF=0.500;MQ=70.00;MQ0=0;MQRankSum=0.508;QD=0.03;ReadPosRankSum=-3.354 GT:AD:DP:GQ:PL 0/1:180,115:282:35:35,0,3884
The FS score is 37.414. But a closer look at the bam file indicates that the 115 reads supporting alternate allele G are all in + strand. Shouldn't the FS score be much higher for this variant? 113 reads reads supporting the reference allele T at this position are in + strand and 67 are in - strand.
Please help me understand if I am wrong about my understanding of FS score or if this is a bug.
I am filtering looking for rare variants and found some frameshift variants in an interesting gene. Some of them are noted as PASS in the QC column of the VCF and some are noted as Indel_FS . What exactly does that second notation mean? I am almost positive that these will validate given how they segregate in my subjects.
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.