Hello there! Thanks as always for the lovely tools, I continue to live in them.
Methods Thus Far
We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).
As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.
My semi-educated guess..
The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?
I was wondering if it is possible to train a VQSR model on one set of samples and then apply it to a different set of samples. For example I would like to run GATK on 5000 exomes using a panel of 30-50 common samples. It would be computationally desirable if I only had to run the common panel of samples used in training the VQSR step once and then simply apply it to each additional test sample.
From my readding of the VariantRecalibrator and ApplyRecalibration documentation it looked like the Apply method doesn't allow this flexibility.