following on from Pallavi's question "I Don't understand what is filtering out during traversal ", can you explain why the number of processed reads ("INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads" from Pallavi's example) differs so much from the total number of reads ("INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total"). Do these numbers relate to different sets of reads?
I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)
bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads
A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.
I have 85767226 paired end = 171534452 sequences in fastQ file
BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.
Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.
so nearly 6 million reads seem to miss.
CreateTargets MicroScheduler reports
35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter
so nearly 5 million reads seem to miss
The Realigner MicroScheduler reports
0 reads were filtered out during traversal out of 171551640 total (0.00%)
which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.
From Base recalibration MicroScheduler, I get
41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter
..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.
I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?
Thanks for any comments!