Hi there, I have been struggling with the interpretation of the SOR annotation. I do get that a higher value is the sign of a strand bias and that this value is never negative.
I did read the SOR annotation documention but still cannot figure out how you calculate the SOR.
Here is one of my example where a SB is present for sure REF + strand = 2185 reads REF - strand = 5 reads ALT + strand = 7 reads ALT - strand = 2370 reads.
When I calculate "R" as indicated in the documentation, I obtain a very high value of 147 955. But for this variant in the VCF file, SOR = 11.382
How is the 11.382 obtained ? Could you please help me in understanding the computation ?
I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the
Fisher Strand annotation, and in higher coverage variant calls,
SOR (using a JEXL expression to switch between them:
DP < 20 ? FS > 50.0 : SOR > 3).
The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high
MQ0). However, there appears to be some strand bias (
788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0
In this instance the filter example above would be applied.
Is this filtering out a true positive? And what kind of cut-offs should I be using for
The snipped annotations
BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).
This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.
HomRef positions using this JEXL filter:
vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false
HomVar positions using this JEXL:
vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false
My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.
Hi, again, thanks a lot for the amazing workshop in Brussels! I have a question on dealing with strand bias, regarding the SB flag in the vcf file: The value of SB (strand bias) is calculated by Fisher exact test, using a 2X2 table that contains the reference, non-reference, fwd and reverse depths. Playing a little with the numbers given to Fisher exact test through web calculator I noticed that combinations which seem as clear strand bias receive non-significant value (e.g. 30,1,110,2 for ref-fwd, ref-reverse, non-ref-fwd, non-ref-reverse receive p-value of 0.52 or 2.84 when phred-scaled). Such variant are therefore considered as unbiased. The cases that are defined as bias are ones where 3 out of the 4 values are similar to each and only one is extremely different. As far as I understand, cases as the one I mentioned should be referred as biased. Do you recommend using this strand bias value and filter variants based on it?
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.
The Strand Bias (SB) value on the vcf file looks so difference from the gatk1.5 & 2.1 1. On the gatk 1.5, SB=0 or 0.1 is acceptable, the smaller SB the better ... etc 2. On gatk 2.1 SB~-6.1e03 ???
My question is what is cut-off SB value (acceptable!)
How to get all SNPs variants with Strand Bias (high SB value) ?
GenomeAnalysisTK.jar -T UnifiedGenotyper \ -R ucsc.hg19.fasta -D dbsnp_135.hg19.vcf \ -stand_emit_conf 1.5 -stand_call_conf 1.5 \ -o Sample_2.snv.vcf -I Sample_2.recal.bam -nt 10