Tagged with #gridengine
1 documentation article | 0 announcements | 4 forum discussions


Comments (11)

1. Background

Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.

As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.

Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.

If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.

2. Running Queue with GridEngine

Try out the Hello World example with -jobRunner GridEngine.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run

If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.

See QFunction and Command Line Options for more info on Queue options.

3. Debugging issues with Queue and GridEngine

If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2

Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=2048M -l h_rss=2458M echo hello world

One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=4096M -l h_rss=4915M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=8192M -l h_rss=9830M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=16384M -l h_rss=19960M echo hello world

If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.

No posts found with the requested search criteria.
Comments (5)

Hello

A quick question about GATK and SGE clusters. I need to run a set of GATK commands on our cluster but I think this was tied with the usage of Queue (may be I am wrong)

I tried to run it through a job manager I use for all the tools I have but I have a memory issue with GATK commands

Invalid maximum heap size: -Xmx4g -jar Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.

Using this memory size on local node works great but not in cluster mode.

My question is : can I run GATK on GE without using scala or Queue ?

Thanks

Rad

Comments (2)

Hi! I am happy to report that Queue and all the necessary tests for running GridEngine passed. The issue I am having is using a custom qscript to run a job in parallel. When I run the job on the cluster via qsub it runs in serial. Would someone be willing to look at my qsub syntax and my qscript to see if I am forgetting something?

The Qscript was a modified UnifiedGenotyper script configured to work with HaplotypeCaller: ` package org.broadinstitute.sting.queue.qscripts.examples

import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class Haplotyper extends QScript {
  @Input(doc="The reference file for the bam files.", shortName="R")
  var referenceFile: File = _ // _ is scala shorthand for null

  @Input(doc="Bam file to genotype.", shortName="I")
  var bamFile: File = _

  @Input(doc="Output file.", shortName="o")
  var outputFile: File = _

  trait UnifiedGenotyperArguments extends CommandLineGATK {
    this.reference_sequence = qscript.referenceFile
    this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
    this.memoryLimit = 2
  }
  def script() {
   val genotyper = new HaplotypeCaller with UnifiedGenotyperArguments

  genotyper.scatterCount = 12
  genotyper.input_file :+= qscript.bamFile
  genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "vcf")

  add(genotyper) 
    }
}`

and my Queue syntax was: java -Djava.io.tmpdir=tmp -jar /location/of/queue/Queue.jar -S scripts/qscalascripts/haplotyper.scala -R human_g1k_v37 -I /source/input_file -o /destination/output/file -l debug -jobRunner GridEngine -run

When I use the above, the Queue script breaks up my job into 12 discrete pieces, but runs it all on one node on the cluster. Any pointers is most welcome.

Comments (10)

Good morning team!

First, I have to qualify my question with that I'm a unix sysadmin- trying to get the "queue" functionality implemented in our cluster so our analysts can play. I'm hoping my question is simple, here goes:

We have SGE, and I have downloaded the binary "queue" package.

My first attempt at executing the "hello world" example came up with this error:

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:28,560 QScriptManager - Compiling 1 QScript INFO 11:04:31,265 QScriptManager - Compilation complete INFO 11:04:31,340 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,340 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:04:31,340 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:04:31,340 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:04:31,341 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:31,341 HelpFormatter - Date/Time: 2013/06/05 11:04:31 INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,346 QCommandLine - Scripting HelloWorld INFO 11:04:31,363 QCommandLine - Added 1 functions INFO 11:04:31,364 QGraph - Generating graph. INFO 11:04:31,373 QGraph - Running jobs. ERROR 11:04:31,427 QGraph - Uncaught error running jobs. java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa': libdrmaa.so: cannot open shared object file: No such file or directory

ooops! Seems I can't find the drmaa library by default. So, I fixed that by adding the following directory to the library search path on the node: /gridware/sge/lib/lx-amd64 (which is where that library lives).

Success! Sort of. The error above is resolved, but I am now getting the error below, and this is where I'm stuck. It doesn't look like the job is actually getting submitted, OR, it's getting submitted and dies. I would really appreciate any insight the team can offer, we are very excited to try to get this environment to work, thank you in advance!

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:52,728 QScriptManager - Compiling 1 QScript INFO 11:07:55,208 QScriptManager - Compilation complete INFO 11:07:55,271 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,271 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:07:55,271 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:07:55,271 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:07:55,272 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:55,272 HelpFormatter - Date/Time: 2013/06/05 11:07:55 INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,276 QCommandLine - Scripting HelloWorld INFO 11:07:55,292 QCommandLine - Added 1 functions INFO 11:07:55,292 QGraph - Generating graph. INFO 11:07:55,298 QGraph - Running jobs. INFO 11:07:55,481 FunctionEdge - Starting: echo hello world INFO 11:07:55,482 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out ERROR 11:07:55,507 Retry - Caught error during attempt 1 of 4. org.ggf.drmaa.InternalException: Error reading answer list from qmaster at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply$mcV$sp(DrmaaJobRunner.scala:87) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.util.Retry$.attempt(Retry.scala:49) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) ERROR 11:07:55,510 Retry - Retrying in 1.0 minute.

Comments (5)

Hi,

I am trying to run the GATK variant detection pipeline on 112 stickleback samples. I am using a GridEngine queue to parallelize this across our different machines. I have previously run the same code on a subset of the samples (55) and it worked fine. However, when I have tried to run on the full 112, I have run into some strange errors. In particular, things like:

commlib returns can't find connection

WARN  13:58:57,655 DrmaaJobRunner - Unable to determine status of job id 4970049 
org.ggf.drmaa.DrmCommunicationException: failed receiving gdi request response for mid=19906 (can't find connection).
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:391)
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:381)
        at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.getJobProgramStatus(JnaSession.java:155)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree2$1(DrmaaJobRunner.scala:101)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.updateJobStatus(DrmaaJobRunner.scala:100)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:55)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:55)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322)
        at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager.updateStatus(DrmaaJobManager.scala:55)
        at org.broadinstitute.sting.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1076)
        at org.broadinstitute.sting.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1068)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at org.broadinstitute.sting.queue.engine.QGraph.updateStatus(QGraph.scala:1068)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:442)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:131)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:127)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

crop up, followed by something like:

error: smallest event number 108 is greater than number 1 i'm waiting for

Does anyone have any idea of what might be going wrong? Either way, do you have any suggestions to help me move forward?

As a note, I have not tried running the 55 again, so it is possible that this would also now fail. In other words, I don't know whether the problem is due to some difference between the 55 and 112 sets, or if some part of the GATK that has been updated in the interim has introduced the problem. I can try running the original set again if it would be helpful.

Thanks, Jason