I am doing variant calling of multiple RNAseq datasets using GATK/3.4.46. For limitation of computational resources, I ran HaplotypeCaller on each dataset separately. Then I ran CombineVaraints to merge all output VCF files using this command
java -Xmx10g -jar GenomeAnalysisTK.jar \ -T CombineVariants \ -R $gatk_ref \ --variant set1.vcf \ --variant set2.vcf \ --variant set3.vcf \ -o combine_output.vcf \ -genotypeMergeOptions UNIQUIFY
Then I tried to run VariantFiltration using thic command
java -Xmx2g -jar $GATK/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R $gatk_ref \ -V combine_output.vcf \ -window 35 -cluster 3 \ -filterName FS -filter "FS > 30.0" \ -filterName QD -filter "QD < 2.0" \ -o $output
Several thousands variants though warning for absence of FS and QD. According to @Sheila advise in http://gatkforums.broadinstitute.org/discussion/2334/undefined-variable-variantfiltration, I ran VariantAnnotator to add these annotations using this command
java -Xmx45g -jar GenomeAnalysisTK.jar \ -R $gatk_ref \ -T VariantAnnotator \ -I input1.bam \ -I input2.bam \ . . -I input57.bam \ -V combine_output.vcf \ -A Coverage \ -A FisherStrand \ -A QualByDepth \ -nt 7 \ -o combine_output_ann.vcf
Then I repeated the VariantFiltration but I have 2 problems: 1) about 2000 variants are still not annotated for FS. All of them are indels and many of them are not homozygous for the ALT allele). Also ~ 40 variants are still not annotated for QD. All of them have multiple ALT alleles 2) The combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value. According to my manual calculations, I think VariantAnnotator calculates the QD value by dividing this QUAL value by the AD of samples with a non hom-ref genotype call. This cause many variants to fail the QD filter.
Hi GATK team, Again thanks a lot for the wonderful tools you're offering to the community.
I have recently switched from UnifiedGenotyper to Haplotype Caller (1 sample at a time, DNASeq). I was planning to use the same hard filtering procedure that I was using previously, including the filter of the variants with FS > 60. However I am facing an issue probably due to the downsampling done by HC.
I should have 5000 reads, but DP is around 500/600 which I understood is due to downsampling (even with -dt NONE). I did understand that it does not impact in the calling itself. However it is annoying me for 2 reasons 1) Calculating frequency of the variant using the AD field is not correct (not based on all reads) 2) I get variants with FS >60 whereas when you look at the entire set of reads, there is absolutely no strand bias.
Example with this variant chr17 41245466 rs1799949 G A 7441.77 STRAND_BIAS; AC=1;AF=0.500;AN=2;BaseQRankSum=7.576;DB;DP=1042;FS=63.090;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.666;QD=7.14;ReadPosRankSum=-11.896;SOR=5.810 GT:AD:GQ:PL:SB 0/1:575,258:99:7470,0,21182:424,151,254,4
When I observe all reads I have the following counts, well shared on the + and - strands Allele G : 1389 (874+, 515-) Allele A : 1445 (886+, 559-)
Could you please tell me how to avoid such an issue ? (By the way, this variant is a true one and should not be filtered out).
Thanks a lot.
I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the
Fisher Strand annotation, and in higher coverage variant calls,
SOR (using a JEXL expression to switch between them:
DP < 20 ? FS > 50.0 : SOR > 3).
The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high
MQ0). However, there appears to be some strand bias (
788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0
In this instance the filter example above would be applied.
Is this filtering out a true positive? And what kind of cut-offs should I be using for
The snipped annotations
BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).
This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.
HomRef positions using this JEXL filter:
vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false
HomVar positions using this JEXL:
vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false
My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.
m struggeling with some statistics given by the vcf file: the Ranksumtests. I started googleing arround, but that turned out to be not helpfult for understanding it (in may case). I really have no idea how to interprete the vcf-statistic-values comming from ranksumtest. I have no clue whether a negative, positive or value near zero is good/bad. Therefore im asking for some help here. Maybe someone knows a good tutorial-page or can give me a hint to better understand the values of MQRankSum, ReadPosRankSum and BaseQRankSum. I have the same problem with the FisherStrand statistics. Many, many thanks in advance.
I have the following variant called by Unified Genotyper (GATK version : GenomeAnalysisTK-2.6-5) :
chr9 139413211 . T G 7.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-7.913;DP=296;Dels=0.00;FS=37.414;HaplotypeScore=22.3462;MLEAC=1;MLEAF=0.500;MQ=70.00;MQ0=0;MQRankSum=0.508;QD=0.03;ReadPosRankSum=-3.354 GT:AD:DP:GQ:PL 0/1:180,115:282:35:35,0,3884
The FS score is 37.414. But a closer look at the bam file indicates that the 115 reads supporting alternate allele G are all in + strand. Shouldn't the FS score be much higher for this variant? 113 reads reads supporting the reference allele T at this position are in + strand and 67 are in - strand.
Please help me understand if I am wrong about my understanding of FS score or if this is a bug.
I am filtering looking for rare variants and found some frameshift variants in an interesting gene. Some of them are noted as PASS in the QC column of the VCF and some are noted as Indel_FS . What exactly does that second notation mean? I am almost positive that these will validate given how they segregate in my subjects.
I have seen the definition of strand bias on this site (below) but I need a little clarification. Does the FS filter (a) highlight instances where reads are only present on a single strand and contain a variant (as may occur toward the end of exome capture regions) or does it (b) specifically look for instances where there are reads on both strands but the variant allele is disproportionately represented on one strand (as might be indicative of a false positive), or does it (c) do both?
I had thought it did (b) but have encountered some disagreement.
** How much evidence is there for Strand Bias (the variation being seen on only the forward or only the reverse strand) in the reads? Higher SB values denote more bias (and therefore are more likely to indicate false positive calls.