m struggeling with some statistics given by the vcf file: the Ranksumtests. I started googleing arround, but that turned out to be not helpfult for understanding it (in may case). I really have no idea how to interprete the vcf-statistic-values comming from ranksumtest. I have no clue whether a negative, positive or value near zero is good/bad. Therefore im asking for some help here. Maybe someone knows a good tutorial-page or can give me a hint to better understand the values of MQRankSum, ReadPosRankSum and BaseQRankSum. I have the same problem with the FisherStrand statistics. Many, many thanks in advance.
I have the following variant called by Unified Genotyper (GATK version : GenomeAnalysisTK-2.6-5) :
chr9 139413211 . T G 7.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-7.913;DP=296;Dels=0.00;FS=37.414;HaplotypeScore=22.3462;MLEAC=1;MLEAF=0.500;MQ=70.00;MQ0=0;MQRankSum=0.508;QD=0.03;ReadPosRankSum=-3.354 GT:AD:DP:GQ:PL 0/1:180,115:282:35:35,0,3884
The FS score is 37.414. But a closer look at the bam file indicates that the 115 reads supporting alternate allele G are all in + strand. Shouldn't the FS score be much higher for this variant? 113 reads reads supporting the reference allele T at this position are in + strand and 67 are in - strand.
Please help me understand if I am wrong about my understanding of FS score or if this is a bug.
I am filtering looking for rare variants and found some frameshift variants in an interesting gene. Some of them are noted as PASS in the QC column of the VCF and some are noted as Indel_FS . What exactly does that second notation mean? I am almost positive that these will validate given how they segregate in my subjects.
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.