Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.
As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.
Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.
If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.
Try out the Hello World example with
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run
If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns
RunningStatus.DONE and "
hello world should be echoed into the output file, possibly with other grid engine log messages.
See QFunction and Command Line Options for more info on Queue options.
If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2
Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the
HelloWorld.scala example with and without memory reservations and limits:
qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd $PWD -j y echo hello world qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd $PWD -j y \ -l mem_free=2048M -l h_rss=2458M echo hello world
One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.
qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd $PWD -j y \ -l mem_free=4096M -l h_rss=4915M echo hello world qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd $PWD -j y \ -l mem_free=8192M -l h_rss=9830M echo hello world qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd $PWD -j y \ -l mem_free=16384M -l h_rss=19960M echo hello world
If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.
I am trying to run the GATK variant detection pipeline on 112 stickleback samples. I am using a GridEngine queue to parallelize this across our different machines. I have previously run the same code on a subset of the samples (55) and it worked fine. However, when I have tried to run on the full 112, I have run into some strange errors. In particular, things like:
commlib returns can't find connection WARN 13:58:57,655 DrmaaJobRunner - Unable to determine status of job id 4970049 org.ggf.drmaa.DrmCommunicationException: failed receiving gdi request response for mid=19906 (can't find connection). at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:391) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:381) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.getJobProgramStatus(JnaSession.java:155) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree2$1(DrmaaJobRunner.scala:101) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.updateJobStatus(DrmaaJobRunner.scala:100) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:55) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager$$anonfun$updateStatus$1.apply(DrmaaJobManager.scala:55) at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:123) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:322) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobManager.updateStatus(DrmaaJobManager.scala:55) at org.broadinstitute.sting.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1076) at org.broadinstitute.sting.queue.engine.QGraph$$anonfun$updateStatus$1.apply(QGraph.scala:1068) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:61) at scala.collection.immutable.List.foreach(List.scala:45) at org.broadinstitute.sting.queue.engine.QGraph.updateStatus(QGraph.scala:1068) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:442) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:131) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:127) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
crop up, followed by something like:
error: smallest event number 108 is greater than number 1 i'm waiting for
Does anyone have any idea of what might be going wrong? Either way, do you have any suggestions to help me move forward?
As a note, I have not tried running the 55 again, so it is possible that this would also now fail. In other words, I don't know whether the problem is due to some difference between the 55 and 112 sets, or if some part of the GATK that has been updated in the interim has introduced the problem. I can try running the original set again if it would be helpful.