Hi,
I'm having problems understanding a GATK output VCF. I have read the VCF standard, but I'm obviously missing something.
I /think/ I understand how SNPs and short indels are represented, but clearly I do not. Below is an excerpt that illustrates sites which I do not understand. I suspect it may be something to do with GATK quality filters that I'm not understanding, or something about using EMIT_ALL_SITES...
The excerpt below was generated using
GATK -l INFO -I my.bam -R my.fa -T UnifiedGenotyper -S LENIENT -nt 8 --heterozygosity 0.1 -o test.vcf --genotype_likelihoods_model BOTH --min_base_quality_score 10 --output_mode EMIT_ALL_SITES -ploidy 2
Thanks!
Darren
CH1 225 . T G 12.71 LowQual AC=1;AF=0.500;AN=2;BaseQRankSum=1.978;DP=59;Dels=0.03;FS=0.000;HaplotypeScore=10.2840;MLEAC=1;MLEAF=0.500;MQ=70.25;MQ0=8;MQRankSum=-5.349;QD=0.22;ReadPosRankSum=-3.188 GT:AD:DP:GQ:PL 0/1:41,16:55:20:20,0,1435
CH1 226 . T . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43
CH1 227 . A . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43
CH1 228 . T . 121.53 . AN=2;DP=59;MQ=70.25;MQ0=8 GT:DP 0/0:43
CH1 229 . A . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:38
CH1 230 . C . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:38
CH1 231 . T . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:36
CH1 232 . G . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:36
CH1 233 . C . 115.53 . AN=2;DP=57;MQ=69.66;MQ0=8 GT:DP 0/0:37
CH1 234 . A . 139.53 . AN=2;DP=70;MQ=59.20;MQ0=14 GT:DP 0/0:63
CH1 235 . A . 175.53 . AN=2;DP=84;MQ=51.67;MQ0=15 GT:DP 0/0:79
CH1 236 . A . 175.53 . AN=2;DP=84;MQ=51.67;MQ0=15 GT:DP 0/0:79
CH1 237 . T . 175.53 . AN=2;DP=85;MQ=51.37;MQ0=16 GT:DP 0/0:80
CH1 238 . A . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:97
CH1 238 . A AGAAAGAAAGCTTGTA 83.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=6.172;DP=102;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=46.90;MQ0=0;MQRankSum=-6.190;QD=0.05;ReadPosRankSum=-5.733 GT:AD:DP:GQ:PL 0/1:27,25:57:99:121,0,4853
CH1 239 . A . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:101
CH1 240 . T . 175.53 . AN=2;DP=102;MQ=46.90;MQ0=28 GT:DP 0/0:98
CH1 241 . A . 169.53 . AN=2;DP=108;MQ=44.14;MQ0=29 GT:DP 0/0:107
* CH1 242 . T . 169.53 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:DP 0/0:103
* CH1 242 . T . 118.27 . AN=2;DP=109;MQ=43.94;MQ0=29 GT:AD:DP 0/0:27:55
* CH1 243 . C . 172.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:108
* CH1 243 . CTTTT . 118.27 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:AD:DP 0/0:27:56
CH1 244 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:61
CH1 245 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:53
CH1 246 . T . 73.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:41
CH1 247 . T . 91.53 . AN=2;DP=110;MQ=43.76;MQ0=29 GT:DP 0/0:46
CH1 248 . A . 172.53 . AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP 0/0:100
CH1 249 . A . 172.53 . AN=2;DP=116;MQ=42.61;MQ0=31 GT:DP 0/0:100
CH1 250 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:101
CH1 251 . T . 169.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:96
CH1 251 . T . 118.27 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:AD:DP 0/0:27:56
CH1 252 . C . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:113
CH1 253 . C . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:110
CH1 254 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111
CH1 255 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111
CH1 256 . T . 172.53 . AN=2;DP=117;MQ=42.43;MQ0=32 GT:DP 0/0:111
Line 1 is a SNP
Lines 14 and 15 are an indel that I do understand
Lines 19 and 20 I do /not/ understand
Lines 21 and 22 I do /not/ understand
Dear All,
Just thought I should report a possible small bug with SelectVariant --maxIndelSize as I think it's only filtering insertions greater than the given value and not deletions. Of course using JEXL expressions I can still get the variants I'm after...
Cheers, Laurent