Does VQSR behave differently when the
-out_mode flag in UnifiedGenotyper is set to
EMIT_VARIANTS_ONLY as compared to
EMIT_ALL_CONFIDENT_SITES. I think by using
EMIT_ALL_CONFIDENT_SITES we might give VQSR more information to train the model, but I may be wrong. Can someone please help me with this ? Thanks.
When I use EMIT_ALL_CONFIDENT_SITES for SNPs, I get an expected very large list of genotypes regardless if the genotypes vary from the reference. When I use the same command line but I switch the model to Indels, I only get a VCF of variant sites. Is the EMIT_ALL_CONFIDENT_SITES option not compatible with Indel discovery?
I'm grateful for any clarification.
If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?
I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.
I previously reported an issue in which I could not emit all sites or all confident sites when using a ploidy of 1. I downloaded the most recent version and it seems to be able to print the reference calls in these modes. The odd thing is that the quality for all those calls is the same and really low, which I know isn't a reflection of reality in all cases given the knowledge we have about the sequence data and its relationship to the reference. It is also consistent across many samples, whether I run multiple samples through a single GATK run or a single sample on its own. I pasted in an example from a recent run (multiple samples in same run). The problem is that all of these reference calls get a LowQual filter, which makes it difficult to differentiate from good LowQual calls. Any thoughts as to if this is to be expected and why that might be?
Reference 72 . T . 3 LowQual DP=281;MQ=50.69;MQ0=0;NDA=1 GT . . . . . . Reference 187 . T . 3 LowQual DP=301;MQ=51.00;MQ0=0;NDA=1 GT . . . . . . Reference 188 . C . 3 LowQual DP=296;MQ=50.84;MQ0=0;NDA=1 GT . . . . . . Reference 206 . A . 3 LowQual DP=292;MQ=50.14;MQ0=0;NDA=1 GT . . . . . . Reference 1844 . T . 3 LowQual DP=369;MQ=58.59;MQ0=0;NDA=1 GT . . . . . . Reference 1854 . C . 3 LowQual DP=363;MQ=58.63;MQ0=0;NDA=1 GT . . . . . . Reference 1972 . A . 3 LowQual DP=345;MQ=59.11;MQ0=0;NDA=1 GT . . . . . . Reference 1993 . T . 3 LowQual DP=355;MQ=58.54;MQ0=0;NDA=1 GT . . . . . . Reference 2096 . C . 3 LowQual DP=355;MQ=58.92;MQ0=0;NDA=1 GT . . . . . . Reference 2376 . T C 1105.23 . AC=1;AF=0.167;AN=6;BaseQRankSum=-10.910;DP=417;Dels=0.00;FS=27.994;HaplotypeScore=1.6883;MLEAC=1;MLEAF=0.167;MQ=58.90;MQ0=0;MQRankSum=0.195;NDA=1;QD=16.75;ReadPosRankSum=-5.021;SB=-4.370e+02;Samples=Ba-4599_4 GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0:43,0:43:99:0:0.00:0,1813 1:0,66:66:99:1:1.00:1143,0 0:38,0:38:99:0:0.00:0,1627 0:127,0:127:99:0:0.00:0,5248 0:42,0:42:99:0:0.00:0,1739 0:100,0:100:99:0:0.00:0,4166
I am using Unified Genotyper to call variants from multiple samples. I have used the emit_all_confident_sites flag. The output vcf file occasionally has two entries for one position. It is always a monomorphic site and the depth between the two entries is quite different. Usually one entry has very high depth & when I return to the original bam file, the depth does not match. Any idea what I am missing here?
When I use UnifiedGenotyper with --genotype_likelihoods_model SNP --output_mode EMIT_ALL_CONFIDENT_SITES I get the reference SNP homozygote calls (or ./. if insufficient depth/quality etc). Great!
But when I use UnifiedGenotyper with --genotype_likelihoods_model INDEL --output_mode EMIT_ALL_CONFIDENT_SITES I only get non-reference calls, everything else (i.e. reference homozygotes, and anything uncallable) is ./.
I want to be able to select variants (SNPs and INDELs) on call rate across samples - as one would do for array genotype data. And avoid case-control bias due to differential missingness.
david van heel