I was wondering if there is an option to remove duplicate reads when the coverage is determined using DepthOfCoverage from a .BAM file. Or is there an alternate way to remove the duplicate reads.
I just noticed something odd about GATK read counts. Using a tiny test data set, I generated a BAM file with marked duplicates.
This is the output for samtools flagstat:
40000 + 0 in total (QC-passed reads + QC-failed reads) 63 + 0 duplicates 38615 + 0 mapped (96.54%:-nan%) 40000 + 0 paired in sequencing 20000 + 0 read1 20000 + 0 read2 37764 + 0 properly paired (94.41%:-nan%) 38284 + 0 with itself and mate mapped 331 + 0 singletons (0.83%:-nan%) 76 + 0 with mate mapped to a different chr 54 + 0 with mate mapped to a different chr (mapQ>=5)
This is what I get as part of GATK info stats when running RealignerTargetCreator:
INFO 14:42:05,815 MicroScheduler - 5175 reads were filtered out during traversal out of 276045 total (1.87%) INFO 14:42:05,816 MicroScheduler - -> 84 reads (0.03% of total) failing BadMateFilter INFO 14:42:05,816 MicroScheduler - -> 1014 reads (0.37% of total) failing DuplicateReadFilter INFO 14:42:05,816 MicroScheduler - -> 4077 reads (1.48% of total) failing MappingQualityZeroFilter
This is what I get as part of GATK info stats when running DepthOfCoverage (on the orignal BAM, not after realignment):
INFO 15:03:17,818 MicroScheduler - 2820 reads were filtered out during traversal out of 309863 total (0.91%) INFO 15:03:17,818 MicroScheduler - -> 1205 reads (0.39% of total) failing DuplicateReadFilter INFO 15:03:17,818 MicroScheduler - -> 1615 reads (0.52% of total) failing UnmappedReadFilter
Why are all of these so different? Why are there much more total reads and duplicate reads for GATK stats?