Tagged with #duplicatereadfilter
1 documentation article | 0 announcements | 4 forum discussions


No posts found with the requested search criteria.
Comments (1)

Hello GATK team,

BaseRecalibrator applies the filters: DuplicateReadFilter MappingQualityZeroFilter I've noticed that in the bam after PrintReads, most of those reads indeed filtered out, but few of them were left - about 2% reads that were marked as dups by picard, and 4% reads with a mapping quality zero.

What exactly happens when a tool applies a filter?

Maya

Comments (2)

I have a quick question. What is the difference between DuplicateReadFilter and NotPrimaryAlignmentFilter? The documentation for each of them is identical; i.e.

Filter out duplicate reads.
Comments (8)

I just noticed something odd about GATK read counts. Using a tiny test data set, I generated a BAM file with marked duplicates.

This is the output for samtools flagstat:

40000 + 0 in total (QC-passed reads + QC-failed reads)
63 + 0 duplicates
38615 + 0 mapped (96.54%:-nan%)
40000 + 0 paired in sequencing
20000 + 0 read1
20000 + 0 read2
37764 + 0 properly paired (94.41%:-nan%)
38284 + 0 with itself and mate mapped
331 + 0 singletons (0.83%:-nan%)
76 + 0 with mate mapped to a different chr
54 + 0 with mate mapped to a different chr (mapQ>=5)

This is what I get as part of GATK info stats when running RealignerTargetCreator:

INFO  14:42:05,815 MicroScheduler - 5175 reads were filtered out during traversal out of 276045 total (1.87%) 
INFO  14:42:05,816 MicroScheduler -   -> 84 reads (0.03% of total) failing BadMateFilter 
INFO  14:42:05,816 MicroScheduler -   -> 1014 reads (0.37% of total) failing DuplicateReadFilter 
INFO  14:42:05,816 MicroScheduler -   -> 4077 reads (1.48% of total) failing MappingQualityZeroFilter 

This is what I get as part of GATK info stats when running DepthOfCoverage (on the orignal BAM, not after realignment):

INFO  15:03:17,818 MicroScheduler - 2820 reads were filtered out during traversal out of 309863 total (0.91%) 
INFO  15:03:17,818 MicroScheduler -   -> 1205 reads (0.39% of total) failing DuplicateReadFilter 
INFO  15:03:17,818 MicroScheduler -   -> 1615 reads (0.52% of total) failing UnmappedReadFilter 

Why are all of these so different? Why are there much more total reads and duplicate reads for GATK stats?

Comments (3)

I was wondering if there is an option to remove duplicate reads when the coverage is determined using DepthOfCoverage from a .BAM file. Or is there an alternate way to remove the duplicate reads.

Thanks Amin