How many genomes do you run UnifiedGenotyper on?
Posted in Ask the GATK team | Last updated on

Comments (5)

I'm running UnifiedGenotyper on a large number of post-processing BAMs - which of course makes UnifiedGenotyper run a bit slower than when you run it on individual genomes. There's an accuracy advantage to running on large numbers of genomes at a time, but the returns begin to diminish at increasing n.

My question is: when you have a very large number of 4x full genomes (not exomes) available - say, high plural thousand - at what point do you want to say that the advantage to including another genome in a single variant call run is traded off by the disadvantage of longer runtime, higher chance of failure, gigantic VCF output file, etc? Where is your cutoff point? Of course, it depends on the speed and reliability of your computing system, but an experience on any system would be useful. Do you yourself do variant calls on max 50 BAMs at a time? 500? 5000?

The G1K methodology doesn't hint at where they've set their run-size cutoff point, but both they and uk10k seem to have their VCF's in filesizes of a thousand genomes per file or so. I'm tempted to take that as a hint, but I want to ask the smart people of the community first.

Return to top Comment on this article in the forum