From my whole-genome (human) BAM files, I want to obtain: For each variant in dbSNP, the GQ and VQSLOD associated with seeing that variant in my data.
Here's my situation using HaplotypeCaller -ERC GVCF followed by GenotypeGVCFs:
CHROM POS ID REF ALT
chr1 1 . A
How can I get something like this to work? Besides needing a GATK-style GVCF file for dbSNP, I'm not sure how GenotypeGVCFs behaves if "tricked" with a fake GVCF not from HaplotypeCaller.
My detailed reason for needing this is below:
For positions of known variation (those in dbSNP), the reference base is arbitrary. For these positions, I need to distinguish between three cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We have sufficient evidence to call position n as homozygous reference (0/0) with confidence scores GQ=x2 and VQSLOD=y2. 3. We do not have sufficient evidence to make any call for position n.
I was planning to use VQSR because the annotations it uses seem useful to distinguish between case 3 and either of 1 and 2. For example, excessive depth suggests a bad alignment, which decreases our confidence in making any call, homozygous reference or not.
Following the best practices pipeline using HaplotypeCaller -ERC GVCF, I get ALTs
Consequently, this seems to distinguish only between these two cases: 1. We have sufficient evidence to call position n as the variant genotype 0/1 (or 1/1) with confidence scores GQ=x1 and VQSLOD=y1. 2. We do not have sufficient evidence to call position n as a variant (it's either 0/0 or unknown).
This isn't sufficient for my application, because we care deeply about the difference between "definitely homozygous reference" and "we don't know".
Thanks in advance!
Hi Mark, Eric -
First, I wanted to thank you guys for providing advice with respect to running VQSR. I am already sold and a huge fan of the method :-).
I was wondering if either of you could comment on VQSLOD and sensitivity filter tranche? To be more specific, if I set a filter threshold of 99% for sensitivity and VQSLOD < 0 I imagine that probably is not a good idea! However, a VQSLOD of 3 or 5 may be appropriate in the statistical sense, i.e. pretty confident that this is a real variant. Finally, I am thinking we should include VQSLOD in our statistical genetic association mapping methods. I wanted to get a sense from either of you what VQSLOD you would want to completely remove from analysis?