I have been working primarily with non-model organisms (and mostly inbred-mapping populations, but that's a topic for a different discussion). To recalibrate base qualities, I have taken the approach of running through the Indel Realignment, SNP, and INDEL calling. Then, filtering around INDELs. I use multi-sample VCFs and have taken the following approach to recalibrate base quality: I grab the top 90th percentile SNPs from all SNPs in my filtered SNP VCF file (based on ALTQ), then I pull out these top SNPs for each SAMPLE in the VCF file (in my case I usually have between 100-300 samples) and write to SEPARATE VCF files for each SAMPLE if the GQ > 90 and it's a SNP for that sample. I then use these SAMPLE HQ VCF files for the BQSR tools.
I have a simple python script for this located here
usage: GetHighQualVcfs.py [-h] -i INFILE -o OUTDIR [--ploidy PLOIDY] [--GQ GQ] [--percentile PERCENTILE] Split multi-sample VCFs into single sample VCFs of high quality SNPs. optional arguments: -h, --help show this help message and exit -i INFILE, --infile INFILE Multi-sample VCF file -o OUTDIR, --outdir OUTDIR Directory to output HQ VCF files. --ploidy PLOIDY 1 for haploid; 2 for diploid --GQ GQ Filters out variants with GQ < this limit. --percentile PERCENTILE Reduces to variants with ALTQ > this percentile.
Thoughts? Concerns? Perhaps I'm going about this in a completely wrong way?