I'd like to be able to perform stratifications in a multi sample vcf, by values that are in the format fields. Almost all of the existing stratifications are based on site specific information rather than sample specific ones. One stratification in particular that I would like to perform is by ReadDepth. I would like to be able to differentiate for instance, all samples with ReadDepth greater than 20. This works in single sample vcfs, but it produces strange results in ones with multiple samples, since each VariantContext contains multiple genotypes.
Melting my vcfs and reporting multiple lines for each position seems possible, but ugly. Splitting vcfs so that each sample is in it's own vcf is also possible and ugly. What is the recommended method for dealing with this sort of stratification?
I would like to evaluate variant calls to produce a plot (psuedo-ROC) of sensitivity vs. specificity (or concordance, etc) when I condition on a minimum/maximum value for a particular metric (coverage, genotype quality, etc.). I can do this by running VariantEval or GenotypeConcordance multiple times, once for each cutoff value, but this is inefficient, since I believe I should be able to compute these values in one pass. Alternatively, if there was a simple tool to annotate each variant as concordance or discordant, I could tabulate the results myself. I would like to rely upon GATK's variant comparison logic to compare variants (especially indels). Any thoughts on if current tools can be parameterized, or adapted for these purposes?
Thanks for your help in advance,