One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.
Debugging runs on a small region have revealed the following:
A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).
Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach