Tagged with #jobrunner
2 documentation articles | 0 announcements | 3 forum discussions

Created 2012-08-15 17:07:32 | Updated 2014-04-02 16:12:09 | Tags: jobs qfunction jobrunner

Comments (11)

Implementing a Queue JobRunner

The following scala methods need to be implemented for a new JobRunner. See the implementations of GridEngine and LSF for concrete full examples.

1. class JobRunner.start()

Start should to copy the settings from the CommandLineFunction into your job scheduler and invoke the command via sh <jobScript>. As an example of what needs to be implemented, here is the current contents of the start() method in MyCustomJobRunner which contains the pseudo code.

  def start() {
    // TODO: Copy settings from function to your job scheduler syntax.

    val mySchedulerJob = new ...

    // Set the display name to 4000 characters of the description (or whatever your max is)
    mySchedulerJob.displayName = function.description.take(4000)

    // Set the output file for stdout
    mySchedulerJob.outputFile = function.jobOutputFile.getPath

    // Set the current working directory
    mySchedulerJob.workingDirectory = function.commandDirectory.getPath

    // If the error file is set specify the separate output for stderr
    if (function.jobErrorFile != null) {
      mySchedulerJob.errFile = function.jobErrorFile.getPath

    // If a project name is set specify the project name
    if (function.jobProject != null) {
      mySchedulerJob.projectName = function.jobProject

    // If the job queue is set specify the job queue
    if (function.jobQueue != null) {
      mySchedulerJob.queue = function.jobQueue

    // If the resident set size is requested pass on the memory request
    if (residentRequestMB.isDefined) {
      mySchedulerJob.jobMemoryRequest = "%dM".format(residentRequestMB.get.ceil.toInt)

    // If the resident set size limit is defined specify the memory limit
    if (residentLimitMB.isDefined) {
      mySchedulerJob.jobMemoryLimit = "%dM".format(residentLimitMB.get.ceil.toInt)

    // If the priority is set (user specified Int) specify the priority
    if (function.jobPriority.isDefined) {
      mySchedulerJob.jobPriority = function.jobPriority.get

    // Instead of running the function.commandLine, run "sh <jobScript>"
    mySchedulerJob.command = "sh " + jobScript

    // Store the status so it can be returned in the status method.
    myStatus = RunnerStatus.RUNNING

    // Start the job and store the id so it can be killed in tryStop
    myJobId = mySchedulerJob.start()

2. class JobRunner.status

The status method should return one of the enum values from org.broadinstitute.sting.queue.engine.RunnerStatus:

  • RunnerStatus.RUNNING
  • RunnerStatus.DONE
  • RunnerStatus.FAILED

3. object JobRunner.init()

Add any initialization code to the companion object static initializer. See the LSF or GridEngine implementations for how this is done.

4. object JobRunner.tryStop()

The jobs that are still in RunnerStatus.RUNNING will be passed into this function. tryStop() should send these jobs the equivalent of a Ctrl-C or SIGTERM(15), or worst case a SIGKILL(9) if SIGTERM is not available.

Running Queue with a new JobRunner

Once there is a basic implementation, you can try out the Hello World example with -jobRunner MyJobRunner.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S scala/qscript/examples/HelloWorld.scala -jobRunner MyJobRunner -run

If all goes well Queue should dispatch the job to your job scheduler and wait until the status returns RunningStatus.DONE and hello world should be echo'ed into the output file, possibly with other log messages.

See [QFunction and Command Line Options]() for more info on Queue options.

Created 2012-08-11 02:02:20 | Updated 2014-02-03 22:32:01 | Tags: queue developer gridengine jobrunner

Comments (16)

1. Background

Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.

As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.

Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.

If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.

2. Running Queue with GridEngine

Try out the Hello World example with -jobRunner GridEngine.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run

If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.

See QFunction and Command Line Options for more info on Queue options.

3. Debugging issues with Queue and GridEngine

If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2

Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=2048M -l h_rss=2458M echo hello world

One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=4096M -l h_rss=4915M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=8192M -l h_rss=9830M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=16384M -l h_rss=19960M echo hello world

If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.

No articles to display.

Created 2015-02-16 16:20:08 | Updated 2015-02-16 16:22:23 | Tags: queue jobrunner

Comments (5)

Hi all!

I'm working on trying to get a parallel version of the ShellJobRunner working in Queue, which would allow us to parallelize some parts of our workflows that are running single core on a full node using the ShellJobRunner and thus are wasting a lot of resources. I thought that I'd made some rather nice progress, until I noticed that if I tried to use it for any job running longer than about 5 minutes the job runner would exit saying that it's job failed, while in reality the job keeps running (so it obviously it did not fail, and Queue doesn't kill it either).

The code I've come up with so far is available here: https://gist.github.com/johandahlberg/a9b7ac61c3aa2c654899 (And as you can see it's mostly stolen from the regular ShellJobRunner, which with some Scala future stuff mixed in)

I'm guessing that the problems comes from me abusing the ProcessController (and admittedly there are warnings in the source for it for not being thread safe), but I'm not sure if there is any way that I can get around it. Any pointers here would be extremely appreciated - also if there is any general interest in this feature I'd be happy to clean up the code a bit and submit a pull request on this upstream.


Created 2014-05-20 22:28:30 | Updated | Tags: gridengine scala jobrunner scatter-gather

Comments (2)

Hi! I am happy to report that Queue and all the necessary tests for running GridEngine passed. The issue I am having is using a custom qscript to run a job in parallel. When I run the job on the cluster via qsub it runs in serial. Would someone be willing to look at my qsub syntax and my qscript to see if I am forgetting something?

The Qscript was a modified UnifiedGenotyper script configured to work with HaplotypeCaller: ` package org.broadinstitute.sting.queue.qscripts.examples

import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class Haplotyper extends QScript {
  @Input(doc="The reference file for the bam files.", shortName="R")
  var referenceFile: File = _ // _ is scala shorthand for null

  @Input(doc="Bam file to genotype.", shortName="I")
  var bamFile: File = _

  @Input(doc="Output file.", shortName="o")
  var outputFile: File = _

  trait UnifiedGenotyperArguments extends CommandLineGATK {
    this.reference_sequence = qscript.referenceFile
    this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
    this.memoryLimit = 2
  def script() {
   val genotyper = new HaplotypeCaller with UnifiedGenotyperArguments

  genotyper.scatterCount = 12
  genotyper.input_file :+= qscript.bamFile
  genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "vcf")


and my Queue syntax was: java -Djava.io.tmpdir=tmp -jar /location/of/queue/Queue.jar -S scripts/qscalascripts/haplotyper.scala -R human_g1k_v37 -I /source/input_file -o /destination/output/file -l debug -jobRunner GridEngine -run

When I use the above, the Queue script breaks up my job into 12 discrete pieces, but runs it all on one node on the cluster. Any pointers is most welcome.

Created 2014-03-06 21:31:54 | Updated | Tags: queue jobrunner java drmaa

Comments (2)

Hi (once more) I am attempting to run Queue with a scala script and scheduling it with jobrunner. The script works nicely, but when I run it with jobRunner I get the error

"Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa':libdrmaa.so: cannot open shared object file: No such file or directory."

When I try to pass the location of the libdrmaa.so file (-Djava.library.path=/opt/sge625/sge/lib/lx24-amd64/) the result is the same.

How would I point jobRunner to the correct path for the Drmaa.so library?