Tagged with #read counts
0 documentation articles | 0 announcements | 2 forum discussions

No articles to display.

No articles to display.

Created 2013-05-06 05:26:08 | Updated | Tags: baserecalibrator

Comments (1)

following on from Pallavi's question "I Don't understand what is filtering out during traversal ", can you explain why the number of processed reads ("INFO 02:09:56,512 BaseRecalibrator - Processed: 1059663038 reads" from Pallavi's example) differs so much from the total number of reads ("INFO 02:09:56,514 MicroScheduler - 25351 reads were filtered out during traversal out of 45356 total"). Do these numbers relate to different sets of reads?

Created 2012-11-28 11:47:29 | Updated | Tags: indelrealigner picard microscheduler baserecalibration createtarget

Comments (2)

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!