I was wondering if it makes sense to filter for strand bias as stated in the Best Practice RNAseq Variant Calling guide as most of todays RNAseq data is strand specific. I would actually expect high strand biases of variants and be suspicious about variants which do NOT show strand bias =) ...or did i get something wrong with the Fisher Strand values?
There seems to be a documentation error on this page: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php
refRatio and altRatio should be calculated as min/max rather than max/min values. That is what is in code here: https://github.com/broadgsa/gatk-protected/blob/master/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandOddsRatio.java assuming the two correspond to each other. I coded it the same way myself and get more precise results too.
Hi GATK team, Again thanks a lot for the wonderful tools you're offering to the community.
I have recently switched from UnifiedGenotyper to Haplotype Caller (1 sample at a time, DNASeq). I was planning to use the same hard filtering procedure that I was using previously, including the filter of the variants with FS > 60. However I am facing an issue probably due to the downsampling done by HC.
I should have 5000 reads, but DP is around 500/600 which I understood is due to downsampling (even with -dt NONE). I did understand that it does not impact in the calling itself. However it is annoying me for 2 reasons 1) Calculating frequency of the variant using the AD field is not correct (not based on all reads) 2) I get variants with FS >60 whereas when you look at the entire set of reads, there is absolutely no strand bias.
Example with this variant chr17 41245466 rs1799949 G A 7441.77 STRAND_BIAS; AC=1;AF=0.500;AN=2;BaseQRankSum=7.576;DB;DP=1042;FS=63.090;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.666;QD=7.14;ReadPosRankSum=-11.896;SOR=5.810 GT:AD:GQ:PL:SB 0/1:575,258:99:7470,0,21182:424,151,254,4
When I observe all reads I have the following counts, well shared on the + and - strands Allele G : 1389 (874+, 515-) Allele A : 1445 (886+, 559-)
Could you please tell me how to avoid such an issue ? (By the way, this variant is a true one and should not be filtered out).
Thanks a lot.
Hi there, I have been struggling with the interpretation of the SOR annotation. I do get that a higher value is the sign of a strand bias and that this value is never negative.
I did read the SOR annotation documention but still cannot figure out how you calculate the SOR.
Here is one of my example where a SB is present for sure REF + strand = 2185 reads REF - strand = 5 reads ALT + strand = 7 reads ALT - strand = 2370 reads.
When I calculate "R" as indicated in the documentation, I obtain a very high value of 147 955. But for this variant in the VCF file, SOR = 11.382
How is the 11.382 obtained ? Could you please help me in understanding the computation ?
I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the
Fisher Strand annotation, and in higher coverage variant calls,
SOR (using a JEXL expression to switch between them:
DP < 20 ? FS > 50.0 : SOR > 3).
The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high
MQ0). However, there appears to be some strand bias (
788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0
In this instance the filter example above would be applied.
Is this filtering out a true positive? And what kind of cut-offs should I be using for
The snipped annotations
BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).
This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.
HomRef positions using this JEXL filter:
vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false
HomVar positions using this JEXL:
vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false
My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.
Hi, again, thanks a lot for the amazing workshop in Brussels! I have a question on dealing with strand bias, regarding the SB flag in the vcf file: The value of SB (strand bias) is calculated by Fisher exact test, using a 2X2 table that contains the reference, non-reference, fwd and reverse depths. Playing a little with the numbers given to Fisher exact test through web calculator I noticed that combinations which seem as clear strand bias receive non-significant value (e.g. 30,1,110,2 for ref-fwd, ref-reverse, non-ref-fwd, non-ref-reverse receive p-value of 0.52 or 2.84 when phred-scaled). Such variant are therefore considered as unbiased. The cases that are defined as bias are ones where 3 out of the 4 values are similar to each and only one is extremely different. As far as I understand, cases as the one I mentioned should be referred as biased. Do you recommend using this strand bias value and filter variants based on it?
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.
The Strand Bias (SB) value on the vcf file looks so difference from the gatk1.5 & 2.1 1. On the gatk 1.5, SB=0 or 0.1 is acceptable, the smaller SB the better ... etc 2. On gatk 2.1 SB~-6.1e03 ???
My question is what is cut-off SB value (acceptable!)
How to get all SNPs variants with Strand Bias (high SB value) ?
GenomeAnalysisTK.jar -T UnifiedGenotyper \ -R ucsc.hg19.fasta -D dbsnp_135.hg19.vcf \ -stand_emit_conf 1.5 -stand_call_conf 1.5 \ -o Sample_2.snv.vcf -I Sample_2.recal.bam -nt 10