GATK writes out consScoreGERP=NA and scorePhastCons=NA while they are declared to be type float. The fields should be set to '.' for empty values instead of NA (as per VCF standard)
This causes problems downstream when using VCF parsers that expect either a float or ".". A few other fields use 'NA' (eg granthamScore) but this doesn't break downstream parsers as they are String fields. I think it would be good to change them, though.
I've spent some time looking at the various documentation and have a few lingering questions for understanding my data a little better. I was hoping you could help to clarify a bit and make sure I am making correct assumptions.
I have a set of 96 samples that we ran targeted sequencing on. I have followed the best practices for bam file processing, etc, and ran Unified Genotyper on all 96 bam files at once using -EMIT ALL SITES. I have snp chip data on all of my samples as well so I have some "truth" to compare to. It looks like around 25% of the SNPs I expected to get (are present in dbSNP and I have chip data for them) are not present in my vcf file. Where did they go - is it normal to not get all of the dbSNP sites back? Is this because the variant site did not have good enough quality of bases to make it a variant? Does Unified Genotyper not emit sites for which all samples are homozygous reference even though it knows it's there because of the dbSNP file (like CASAVA)? I looked at the coverage at these positions and in some cases it was very high (over 100X) so I'm not sure that they could all be bad. I am somewhat new to this sort of data analysis - what are the best tools to use to look at the quality of the bases at a single site to see if this is the culprit? Any other ideas as to what could be causing this?
What are the internal things in Unified Genotyper that make a site not be called?
Would I be able to use the GENOTYPE GIVEN ALLELES option to make Unified Genotyper call all dbSNPs regardless of base quality or Phred Score? If so, do I use the same reference db snp .vcf file that I use for the rest of the steps?
For the variants I do have: If I am interested in the coverage at a site, the AD is the unfiltered depth and the DP is the filtered depth (the sum of the allele depths is the total depth), correct?
FInally, I have not run VSQR on my output. We did targeted sequencing and I'm not sure if the number of bases we did is enough got VSQR to be helpful. What is the minimum number of bases that VSQR is recommended with? I have 96 samples and I know that is enough. We are mostly interested in dbSNP values anyhow. In your opinion, is it reasonable to stop after Unified Genotyper since we are only interested in dbSNP calls or should we do the VSQR step anyhow?
Thank you for all your help and insights. I really appreciate the feedback allowed for in this forum.