I have attached two VCF files generated with samtools (pass.vcf and fail.vcf). One of them (fail.vcf) contains this line:
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
When I run LeftAlignAndTrimVariants3.2 on the v4.2 VCF file containing the INFO line above, then I get this error:
##### ERROR MESSAGE: For input string: "R"
The line is perfectly valid according to the VCF4.2 (and 4.3) specifications:
"The Number entry is an Integer that describes the number of values that can be included with the INFO field." "If the field has one value for each possible allele (including the reference), then this value should be ‘R’."
It's an easy issue to handle, but it would be great, if you could eventually fix this low priority bug. Thanks!
I haven't attached the two small vcf files. "Uploaded file type is not allowed." But zip files are. Files attached.
I just noticed something odd about GATK read counts. Using a tiny test data set, I generated a BAM file with marked duplicates.
This is the output for samtools flagstat:
40000 + 0 in total (QC-passed reads + QC-failed reads) 63 + 0 duplicates 38615 + 0 mapped (96.54%:-nan%) 40000 + 0 paired in sequencing 20000 + 0 read1 20000 + 0 read2 37764 + 0 properly paired (94.41%:-nan%) 38284 + 0 with itself and mate mapped 331 + 0 singletons (0.83%:-nan%) 76 + 0 with mate mapped to a different chr 54 + 0 with mate mapped to a different chr (mapQ>=5)
This is what I get as part of GATK info stats when running RealignerTargetCreator:
INFO 14:42:05,815 MicroScheduler - 5175 reads were filtered out during traversal out of 276045 total (1.87%) INFO 14:42:05,816 MicroScheduler - -> 84 reads (0.03% of total) failing BadMateFilter INFO 14:42:05,816 MicroScheduler - -> 1014 reads (0.37% of total) failing DuplicateReadFilter INFO 14:42:05,816 MicroScheduler - -> 4077 reads (1.48% of total) failing MappingQualityZeroFilter
This is what I get as part of GATK info stats when running DepthOfCoverage (on the orignal BAM, not after realignment):
INFO 15:03:17,818 MicroScheduler - 2820 reads were filtered out during traversal out of 309863 total (0.91%) INFO 15:03:17,818 MicroScheduler - -> 1205 reads (0.39% of total) failing DuplicateReadFilter INFO 15:03:17,818 MicroScheduler - -> 1615 reads (0.52% of total) failing UnmappedReadFilter
Why are all of these so different? Why are there much more total reads and duplicate reads for GATK stats?
I have used GATK for multi sample SNP calling, the total depth DP at the variant site does not seem to be the same when the individual depths are summed up for few of the variants. For example, for the below shown variant site, DP=24 in the INFO column and from the FORMAT column depth= 3+3+8+2=16. chr1 3015369 . C A 39.22 AC=4;AF=1.00;AN=4;DP=24;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=4;MLEAF=1.00;MQ=6.96;MQ0=21;QD=2.45;SB=-2.321e+01 GT:AD:DP:GQ:PL ./. 1/1:3,3:6:6:48,6,0 1/1:8,2:10:3:25,3,0
Could someone give an explanation for differences in depths?