Tagged with #dataprocessingpipeline
0 documentation articles | 0 announcements | 12 forum discussions

No posts found with the requested search criteria.
No posts found with the requested search criteria.
Comments (1)

My lab is trying to use Queue, but out pipeline is spawning very large number of jobs (as would be expected). Our lab has very spiky data processing requiernments and it would be useful to be able to limit the number of jobs queue submits at a time so other, non-queue jobs can process in parallel. Is this possible?


Comments (1)

basically, I am using DataProcessingPiepline, and interestingly, given an unaligned bam file which is 6gig, the DataProcessingPipeline sometimes does the all processes of the pipeline but produces a very small output (200K) or most of the times, doesn't produce anything. The log file is as following,

INFO 11:57:55,892 QScriptManager - Compiling 1 QScript INFO 11:58:08,656 QScriptManager - Compilation complete INFO 11:58:09,062 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,062 HelpFormatter - Queue v2.4-3-g2a7af43, Compiled 2013/02/27 12:20:10 INFO 11:58:09,062 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:58:09,063 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:58:09,063 HelpFormatter - Program Args: -S /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/xc1094WELa.49487._new.bam -R /medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta -D /medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf -p c1094WELa.49487 -outputDir /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/ -bwape -run INFO 11:58:09,063 HelpFormatter - Date/Time: 2013/03/04 11:58:09 INFO 11:58:09,063 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,063 HelpFormatter - ---------------------------------------------------------------------- INFO 11:58:09,071 QCommandLine - Scripting DataProcessingPipeline INFO 11:58:09,314 QCommandLine - Added 13 functions INFO 11:58:09,314 QGraph - Generating graph. INFO 11:58:09,342 QGraph - Running jobs. INFO 11:58:09,358 QGraph - 0 Pend, 0 Run, 0 Fail, 13 Done INFO 11:58:09,409 QCommandLine - Writing final jobs report... INFO 11:58:09,410 QJobsReporter - Writing JobLogging GATKReport to file /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.txt INFO 11:58:09,432 QJobsReporter - Plotting JobLogging GATKReport to file /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.pdf WARN 11:58:09,452 RScriptExecutor - Skipping: Rscript (resource)org/broadinstitute/sting/queue/util/queueJobReport.R /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.txt /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/DataProcessingPipeline.jobreport.pdf INFO 11:58:09,456 QCommandLine - Script completed successfully with 13 total jobs

and the command I am using,

java -Xmx4g -Djava.io.tmpdir=/medpop/mpg-psrl/Parabase/tmp/ -jar /medpop/mpg-psrl/Parabase/Tools/Queue-2.4-3-g2a7af43/Queue.jar -S /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/c1094WELa.49487._new.bam -R /medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta -D /medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf -p c1094WELa.49487 -outputDir /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/ -bwape -run > /medpop/mpg-psrl/Parabase/Projects/Seattle/c1094WELa.49487/c1094WELa.49487.gatkpipline.log

What could be wrong here, since it doesn't really complain about anything ! Thanks,

Comments (1)

I am using Queue v2.3-5, and GenomeAnalysisTK-2.3-6-gebbba25 for a couple of my Bam files, I have encountered the following error, and I was wondering what could be the reason ...

ERROR 16:14:17,964 FunctionEdge - Error: hg19/ucsc.hg19.fasta /c1193JYUa.49483._new.1.2.sai /c1193JYUa.49483._new.reverted.bam /c1193JYUa.49483._new.reverted.bam > /c1193JYUa.49483._new.1.realigned.sam 
ERROR 16:14:18,264 FunctionEdge - Contents of /medpop/mpg-psrl/Parabase/Projects/Seattle/c1193JYUa.49483._new.1.realigned.sam.out:

my command is as following

java -Xmx4g -Djava.io.tmpdir=/tmp/ -jar /Tools/Queue-2.3-5-g49ed93c/Queue.jar -S /Tools/Queue-2.3-6-gebbba25/DataProcessingPipeline.scala -bwa /broad/software/free/Linux/redhat_5_x86_64/pkgs/bwa_0.5.9/bwa -i /c1193JYUa.49483/c1193JYUa.49483._new.bam -R /hg19/ucsc.hg19.fasta -D hg19/dbsnp_137.hg19.vcf -p /c1193JYUa.49483/ -bwape -run > /c1193JYUa.49483/c1193JYUa.49483.gatkpipline.log

I have shortened the path for the sake of clarity ... thank you

Comments (6)

I am facing a "fatal error by java runtime enviormnet" after using GATK DataProcessingPipeline, my java version is

> java version "1.6.0_35"
> Java(TM) SE Runtime Environment (build 1.6.0_35-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 20.10-b01, mixed mode)

and I am using GATK 2.3-6-gebbba25 The pipeline spit out the following error,

> lelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/medpop/mpg-psrl/Parabase/tmp'  '-cp' '/medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/Queue.jar'  'org.broadinstitute.sting.gatk.CommandLineGATK'  '-T' 'RealignerTargetCreator'  '-I' '/medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19._new.1.realigned.rg.bam'  '-R' '/medpop/mpg-psrl/Parabase/hg19/ucsc.hg19.fasta'  '-o' '/medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19.EX_c1004CARa_1ln_hg19.intervals'  '-known' '/medpop/mpg-psrl/Parabase/hg19/dbsnp_137.hg19.vcf'  '-mismatch' '0.0'  
> INFO  11:32:30,896 FunctionEdge - Output written to /medpop/mpg-psrl/Parabase/MyRuns/EX_c1004CARa_1ln_hg19.EX_c1004CARa_1ln_hg19.intervals.out 
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00002ae7a5a50380, pid=562, tid=47174397446800
> #
> # JRE version: 6.0_35-b10
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.10-b01 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x712380]  SR_handler(int, siginfo*, ucontext*)+0x30
> #
> # An error report file with more information is saved as:
> # /medpop/mpg-psrl/Parabase/MyRuns/hs_err_pid562.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #

the log file is lengthy; but If you like to have a look I can paste here it later on, thanks for your support,

Comments (3)

Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ?


Comments (16)

I am running the following command to test my scala using GATK-2.3.5

> java -Xmx4g -Djava.io.tmpdir=tmp -jar /Queue-2.3-5-g49ed93c/Queue.jar -S Queue-2.3-5-g49ed93c/ExampleCountReads.scala -R /GATK-2.3-5-g49ed93c/resources/exampleFASTA.fasta -I /GATK-2.3-5-g49ed93c/resources/exampleBAM.bam

and I am getting this error

> INFO  13:42:55,166 QScriptManager - Compiling 1 QScript 
> ERROR 13:42:55,348 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of 'G' 
> ERROR 13:42:55,356 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,356 QScriptManager -                           ^ 
> ERROR 13:42:55,357 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected 
> ERROR 13:42:55,358 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,358 QScriptManager -                            ^ 
> ERROR 13:42:55,358 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected 
> ERROR 13:42:55,359 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,360 QScriptManager -                             ^ 
> ERROR 13:42:55,360 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of '>' 
> ERROR 13:42:55,361 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,362 QScriptManager -                                               ^ 
> ERROR 13:42:55,362 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected 
> ERROR 13:42:55,363 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,365 QScriptManager -                                                 ^ 
> ERROR 13:42:55,366 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected 
> ERROR 13:42:55,367 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,367 QScriptManager -                                                  ^ 
> ERROR 13:42:55,367 QScriptManager - ExampleCountReads.scala:39: in XML literal: '>' expected instead of ' ' 
> ERROR 13:42:55,369 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,369 QScriptManager -                                                   ^ 
> ERROR 13:42:55,369 QScriptManager - ExampleCountReads.scala:67: in XML literal: in XML content, please use '}}' to express '}' 
> ERROR 13:42:55,369 QScriptManager -   } 
> ERROR 13:42:55,369 QScriptManager -   ^ 
> ERROR 13:42:55,370 QScriptManager - ExampleCountReads.scala:39:  I encountered a '}' where I didn't expect one, maybe this tag isn't closed <WalkerName> 
> ERROR 13:42:55,371 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,371 QScriptManager -                                                     ^ 
> ERROR 13:42:55,371 QScriptManager - ExampleCountReads.scala:68: '}' expected but eof found. 
> ERROR 13:42:55,371 QScriptManager - } 
> ERROR 13:42:55,372 QScriptManager -  ^ 
> ERROR 13:42:55,390 QScriptManager - 10 errors found 
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR stack trace 
> org.broadinstitute.sting.queue.QException: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
>   at org.broadinstitute.sting.queue.QScriptManager.loadScripts(QScriptManager.scala:46)
>   at org.broadinstitute.sting.queue.QCommandLine.org$broadinstitute$sting$queue$QCommandLine$$qScriptPluginManager(QCommandLine.scala:94)
>   at org.broadinstitute.sting.queue.QCommandLine.getArgumentSources(QCommandLine.scala:225)
>   at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:197)
>   at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
>   at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:61)
>   at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-5-g49ed93c):
> ##### ERROR
> ##### ERROR Please visit the wiki to see if this is a known problem
> ##### ERROR If not, please post the error, with stack trace, to the GATK forum
> ##### ERROR Visit our website and forum for extensive documentation and answers to 
> ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
> ##### ERROR
> ##### ERROR MESSAGE: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
> ##### ERROR ------------------------------------------------------------------------------------------
> INFO  13:42:55,487 QCommandLine - Shutting down jobs. Please wait... 
> 13.462u 1.029s 0:10.41 139.0% 0+0k 0+0io 1pf+0w

I am sure java, gatk and queue are installed properly and the files exist in the relevant directory. Thank you,

Comments (3)

Dear all , I am using Queue and DataProcessingPipeline.scala ( from https://github.com/broadgsa/gatk/blob/master/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala ) to process my bam file . The input is a sample level bam file which has been processed by BWA aligned , Samtools sampe and Picard merge and duplicate removed . The output bam file (~200GB/sample ) is much larger than the input bam file (~80GB/sample ) . I want to know what information was added into the bam file ? Thanks a lot .My Queue version is Queue-2.1-10 . My script :

java \ -Xmx4g \ -Djava.io.tmpdir=../tmp/ \ -jar ./Queue-2.1-10/Queue.jar \ -S ./DataProcessingPipeline.scala \ -i input.bam \ -R /db/human_g1k_v37.fasta \ -D /db//dbSnp_b137.vcf \ -run

Comments (6)

Hi there, I was trying to debug an error in the RScript generated after base recalibration, while running the DataProcessingPipeline.scala (run as it is). I get the following debug output

 Error in file(filename, "r", blocking = TRUE) : 
   cannot open the connection
 Calls: source ... eval.with.vis -> eval.with.vis -> gsa.read.gatkreport -> file
 In addition: Warning messages:
 1: In file(filename, "r", blocking = TRUE) :
   cannot open file '/SAN/scratch3/sample378_TTAGGC_L004_R1_001.fastq.pre_recal.table.recal': No such file or directory
  Execution halted

no file ending with "recal.table.recal" exists, but the file "recal.table" does exist. I couldn't find any step in the scala script where a ".recal" is added to "recal.table", nor a specific trait or class referring to the RScript itself, as I understand it's part of the walker BaseRecalibrator.

is this a small bug in the name handling, or am I doing something wrong somewhere?

thanks, Francesco

Comments (2)

I had a question that, while it might be more appropriate for a BWA or seqanswers audience, I noticed something in the GATK's "Data Processing Pipeline" under "Methods and Workflows" that made me wonder. The pipeline is described here and there's a nice flowchart as well: http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows#41

The process describes a BAM of reads that are either not aligned or aligned by some process you don't want to use, so the first step seems to be Picard's RevertSam and then a realignment with BWA. I'm wondering why the process described by this GATK document splits it into per-lane BAM files. There doesn't seem to be any process done at the per-lane level other than BWA alignment. I have two guesses.. the first was to allow more parallelization at that step.

But my second guess is that perhaps BWA doesn't play nice with read groups when reading reads from BAM input files. If that is true, that would explain why I'm having trouble with BAM (a single sample, multiple lanes, merged into one file) -> BWA -> realigned-BAM -> GATK (either UnifiedGenotyper or RealignerTargetCreator, etc)--somewhere along the way, read groups are getting lost. So my guess is that the above-described pipeline splits it per-lane so it can manually respecify read groups all over again to BWA?

Is that other people's experience as well?

Also, the Methods and Workflows page describes a Queue script, but there's no link or anything to the actual Queue script. Anyone know where to find it?


Comments (2)

Looking at the DataProcessingPipeline script I noticed that the "outputDir" is only applied to the cohort list file, but not to the actually processed files, which seems a bit inconsistent with the docs, which say "Output path for the processed BAM files". Here is my fix for this:

diff --git a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
index 56f6460..6dd84b5 100755
--- a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
+++ b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/DataProcessingPipeline.scala
@@ -249,8 +249,12 @@ class DataProcessingPipeline extends QScript {
     // put each sample through the pipeline
     for ((sample, bamList) <- sampleBAMFiles) {

-      // BAM files generated by the pipeline
-      val bam        = new File(qscript.projectName + "." + sample + ".bam")
+      // BAM files generated by the pipeline      
+      val bam        = if(outputDir.isEmpty()) 
+                                      new File(qscript.projectName + "." + sample + ".bam")
+                                  else
+                                      new File(outputDir + qscript.projectName + "." + sample + ".bam")
       val cleanedBam = swapExt(bam, ".bam", ".clean.bam")
       val dedupedBam = swapExt(bam, ".bam", ".clean.dedup.bam")
       val recalBam   = swapExt(bam, ".bam", ".clean.dedup.recal.bam")
@@ -292,6 +296,15 @@ class DataProcessingPipeline extends QScript {
     add(writeList(cohortList, cohortFile))

+  // Override the normal swapExt metod by adding the outputDir to the file path by default if it is defined.
+  override
+  def swapExt(file: File, oldExtension: String, newExtension: String) = {
+      if(outputDir.isEmpty())
+         new File(file.getName.stripSuffix(oldExtension) + newExtension)
+      else
+          swapExt(outputDir, file, oldExtension, newExtension);
+  }      


This of course, puts all the files in the directory specified by outputDir, but I think that this seems reasonable than putting them in the execution directory of the script.

Comments (4)

Can the DataProcessingPipeline (as used by Queue.jar) be used to process BAM files in the best recommended way ("Best: multi-sample realignment with known sites and recalibration")

Comments (7)

When I try to run the latest version of the DataProcessingPipeline I get the following error message:

ERROR MESSAGE: GATK Lite does not support all of the features of the full version: base insertion/deletion recalibration is not supported, please use the --disable_indel_quals argument

However in the recal case class this option is set to be true: this.disable_indel_quals = true

Any idea how to solve this? It seams to me this is a bug, but I cannot find the source for the PrintReads class (guessing that its not in the public code), so I can't check it myself.

Regards, Johan