Hi, I am running VariantFiltration (GATK v 3.4) using the following command:
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg19.fa -V raw_snps.vcf --filterExpression "QUAL/DP < 2.0 || DP < 10.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0" --missingValuesInExpressionsShouldEvaluateAsFailing --filterName "current" -o filtered_snps.vcf
Only half of the variants have the MQRankSum values and in spite of having the argument '--missingValuesInExpressionsShouldEvaluateAsFailing', the output file has many variants marked as PASS that do not have MQRankSum. Ideally, the ones without MQRankSum annotation should not be marked as PASS, right? Is this a bug issue?
Also, on the side, how do get GATK to remove the variants from the file that do not PASS. So, only have the PASS variants in the output file?
I found that VariantAnnotator sometimes does not annotate some annotations that are requested.
A ) The Rank Sum Test annotations MQRankSum & BaseQRankSum I was not able to identify the requirements that have to be met, so they are being calculated for a variant.
B ) InbreedingCoeff This one seems to be connected to the number of total called alleles (AN). For me there needed to be at least 10% alleles be called (19/186). The doc for that says  "at least 10 founder samples". Maybe this has to be updated to 10%?
These are the ones I observed. Can someone tell me more about that?
On 2000 samples I have run HC3.2, CGVCFs3.2, GGVCFs3.2 and VR3.2.
For the GenotypeGVCFs step I used the current default annotations:
InbreedingCoeff FisherStrand QualByDepth ChromosomeCounts GenotypeSummaries
And these non-default annotations:
When running VariantRecalibrator and plotting each of the dimensions I noticed all of the non-default annotations taking on discrete values; see bottom of this post. Is it no longer recommended to use ReadPosRankSum and MQRankSum for VR? Should I calculate these annotation with VariantAnnotator instead of GenotypeGVCFs? If I have to run VariantAnnotator, should I then run it separately for SNPs and INDELs cf. my previous question about annotations being different, when applied to BOTH and SNPs: http://gatkforums.broadinstitute.org/discussion/2620
zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep ReadPosRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1 41649 ReadPosRankSum=0.731 41760 ReadPosRankSum=0.550 46305 ReadPosRankSum=0.720 47060 ReadPosRankSum=0.00 87348 ReadPosRankSum=0.406 105254 ReadPosRankSum=0.736 116426 ReadPosRankSum=0.727 164855 ReadPosRankSum=0.358 zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep "MQ=" | sort | uniq -c | awk '$1>5000' | sort -k1n,1 5802 MQ=57.05 8382 MQ=29.00 8525 MQ=56.62 10069 MQ=51.77 10574 MQ=53.95 10682 MQ=47.12 10818 MQ=56.04 11553 MQ=55.21 802603 MQ=60.00 zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep MQRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1 21511 MQRankSum=-7.360e-01 27222 MQRankSum=0.322 33699 MQRankSum=0.550 34481 MQRankSum=0.731 37603 MQRankSum=0.720 60729 MQRankSum=0.00 76031 MQRankSum=0.406 85812 MQRankSum=0.736 98519 MQRankSum=0.727 186092 MQRankSum=0.358
I read the documentation for MappingQualityRankSumTest and ReadPosRankSumTest: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_MappingQualityRankSumTest.html http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_ReadPosRankSumTest.html
Both pages read: "The ... rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles."
I have quite a few sites for which MQRankSum and ReadPosRankSum are missing. How does VariantRecalibrator handle this missing information?
m struggeling with some statistics given by the vcf file: the Ranksumtests. I started googleing arround, but that turned out to be not helpfult for understanding it (in may case). I really have no idea how to interprete the vcf-statistic-values comming from ranksumtest. I have no clue whether a negative, positive or value near zero is good/bad. Therefore im asking for some help here. Maybe someone knows a good tutorial-page or can give me a hint to better understand the values of MQRankSum, ReadPosRankSum and BaseQRankSum. I have the same problem with the FisherStrand statistics. Many, many thanks in advance.
I have run UnifiedGenotyper with the -glm options SNP and BOTH. These two approaches yield identical variants and identical genotype likelihoods (at least the first 100k variants I checked). However, a few of the annotations have different values: BaseQRankSum MQRankSum ReadPosRankSum
-glm SNP on the left and -glm BOTH on the right:
Why is that?
I noticed another user got different variants, but I get the same variants and the same likelihoods: [http://gatkforums.broadinstitute.org/discussion/1782/unifiedgenotyper-different-glm-value-result-in-different-sets-of-variants]
I ran single threaded.
I use MQRankSum and ReadPosRankSum for VariantRecalibrator, so it affects my downstream results, if the annotations are -glm dependent. Hence I am asking my question. I hope you can illuminate me. Thank you.