Tagged with #microscheduler
0 documentation articles | 0 announcements | 2 forum discussions


No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (23)

Hi guys,

I have Googled my problem, with no luck, so I am asking you directly.

I am currently testing an established pipeline on BAMs from a new source, so I advance step by step, and the last step, the calling with UG, seems to have trouble.

My BAM endured, in order, Picard AddOrReplaceReadGroup, Picard MarkDup, GATK RealignerTargetCreator, GATK IndelRealigner, Picard FixMateInformation, GATK BaseRecalibrator, GATK PrintReads.

Arrived at UG, this is my (stuck) output (I removed file names because of privacy):

INFO 14:54:29,692 ArgumentTypeDescriptor - Dynamically determined type of /scratch/appli57_local_duplicates/reference/exome_target_intervals.bed to be BED INFO 14:54:29,748 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:54:29,748 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.1-11-g13c0244, Compiled 2012/09/29 06:03:05 INFO 14:54:29,749 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 14:54:29,749 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 14:54:29,750 HelpFormatter - Program Args: -T UnifiedGenotyper -nt 6 -R /scratch/appli57_local_duplicates/reference/Homo_sapiens_assembly19.fasta -I /scratch/user/FILE.marked.realigned.fixed.recal.bam --dbsnp /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf -L /scratch/appli57_local_duplicates/reference/exome_target_intervals.bed --metrics_file /scratch/user/FILE.snps.metrics -o /scratch/user/FILE.vcf INFO 14:54:29,750 HelpFormatter - Date/Time: 2013/03/20 14:54:29 INFO 14:54:29,750 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:54:29,751 HelpFormatter - --------------------------------------------------------------------------------- INFO 14:54:29,783 ArgumentTypeDescriptor - Dynamically determined type of /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf to be VCF INFO 14:54:29,799 GenomeAnalysisEngine - Strictness is SILENT INFO 14:54:29,906 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 14:54:29,943 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 INFO 14:54:29,959 RMDTrackBuilder - Loading Tribble index from disk for file /scratch/appli57_local_duplicates/dbsnp/dbsnp_132.b37.vcf WARN 14:54:30,190 VCFStandardHeaderLines$Standards - Repairing standard header line for field AF because -- count types disagree; header has UNBOUNDED but standard is A -- descriptions disagree; header has 'Allele Frequency' but standard is 'Allele Frequency, for each ALT allele, in the same order as listed' INFO 14:54:32,484 MicroScheduler - Running the GATK in parallel mode with 6 concurrent threads

And it does not move from there. In my destination folder, a bamschedule.*.tmp file appears every 5 minutes or so, and in top, the program seems to be running:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29468 valleem 19 0 20.6g 3.9g 10m S 111.5 4.2 28:45.46 java

Can you help me?

Comments (2)

Hi all,

I am doing an exome analysis with BWA 0.6.1-r104, Picard 1.79 and GATK v2.2-8-gec077cd. I have paired end reads, my protocol until now is (in brief, omitting options etc.)

bwa aln R1.fastq bwa aln R2.fastq bwa sampe R1.sai R2.sai picard/CleanSam.jar picard/SortSam.jar picard/MarkDuplicates.jar picard/AddOrReplaceReadGroups.jar picard/BuildBamIndex.jar GATK -T RealignerTargetCreator -known dbsnp.vcf GATK -T IndelRealigner -known dbsnp.vcf GATK -T BaseRecalibrator -knownSites dbsnp.vcf GATK -T PrintReads

A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

I have 85767226 paired end = 171534452 sequences in fastQ file

BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.

MarkDuplicates reports:

Read 165619516 records. 2 pairs never matched. Marking 20272927 records as duplicates. Found 2919670 optical duplicate clusters.

so nearly 6 million reads seem to miss.

CreateTargets MicroScheduler reports

35915555 reads were filtered out during traversal out of 166579875 total (21.56%) -> 428072 reads (0.26% of total) failing BadMateFilter -> 16077607 reads (9.65% of total) failing DuplicateReadFilter -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

so nearly 5 million reads seem to miss

The Realigner MicroScheduler reports

0 reads were filtered out during traversal out of 171551640 total (0.00%)

which appears a miracle to me since 1) there are even more reads now than input sequences, 2) all those crappy reads reported by CreateTargets do not appear.

From Base recalibration MicroScheduler, I get

41397379 reads were filtered out during traversal out of 171703265 total (24.11%) -> 16010068 reads (9.32% of total) failing DuplicateReadFilter -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.

I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?

Thanks for any comments!