Tagged with #qscript
5 documentation articles | 0 announcements | 11 forum discussions

Created 2012-08-15 22:27:10 | Updated 2014-02-03 22:37:09 | Tags: test developer qscript scala

Comments (3)

In addition to testing walkers individually, you may want to also run integration tests for your QScript pipelines.

1. Brief comparison to the Walker integration tests

  • Pipeline tests should use the standard location for testing data.
  • Pipeline tests use the same test dependencies.
  • Pipeline tests which generate MD5 results will have the results stored in the MD5 database].
  • Pipeline tests, like QScripts, are written in Scala.
  • Pipeline tests dry-run under the ant target pipelinetest and run under pipelinetestrun.
  • Pipeline tests class names must end in PipelineTest to run under the ant target.
  • Pipeline tests should instantiate a PipelineTestSpec and then run it via PipelineTest.exec().

2. PipelineTestSpec

When building up a pipeline test spec specify the following variables for your test.

Variable Type Description
args String The arguments to pass to the Queue test, ex: -S scala/qscript/examples/HelloWorld.scala
jobQueue String Job Queue to run the test. Default is null which means use hour.
fileMD5s Map[Path, MD5] Expected MD5 results for each file path.
expectedException classOf[Exception] Expected exception from the test.

3. Example PipelineTest

The following example runs the ExampleCountLoci QScript on a small bam and verifies that the MD5 result is as expected.

It is checked into the Sting repository under scala/test/org/broadinstitute/sting/queue/pipeline/examples/ExampleCountLociPipelineTest.scala

package org.broadinstitute.sting.queue.pipeline.examples

import org.testng.annotations.Test
import org.broadinstitute.sting.queue.pipeline.{PipelineTest, PipelineTestSpec}
import org.broadinstitute.sting.BaseTest

class ExampleCountLociPipelineTest {
  def testCountLoci {
    val testOut = "count.out"
    val spec = new PipelineTestSpec
    spec.name = "countloci"
    spec.args = Array(
      " -S scala/qscript/examples/ExampleCountLoci.scala",
      " -R " + BaseTest.hg18Reference,
      " -I " + BaseTest.validationDataLocation + "small_bam_for_countloci.bam",
      " -o " + testOut).mkString
    spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"

3. Running Pipeline Tests

Dry Run

To test if the script is at least compiling with your arguments run ant pipelinetest specifying the name of your class to -Dsingle:

ant pipelinetest -Dsingle=ExampleCountLociPipelineTest

Sample output:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour
   [testng]   => countloci PASSED DRY RUN
   [testng] PASSED: testCountLoci


As of July 2011 the pipeline tests run against LSF 7.0.6 and Grid Engine 6.2u5. To include these two packages in your environment use the hidden dotkit .combined_LSF_SGE.

reuse .combined_LSF_SGE

Once you are satisfied that the dry run has completed without error, to actually run the pipeline test run ant pipelinetestrun.

ant pipelinetestrun -Dsingle=ExampleCountLociPipelineTest

Sample output:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e4c42c267b3a6]
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

Generating initial MD5s

If you don't know the MD5s yet you can run the command yourself on the command line and then MD5s the outputs yourself, or you can set the MD5s in your test to "" and run the pipeline.

When the MD5s are blank as in:

spec.fileMD5s += testOut -> ""

You run:

ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run

And the output will look like:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is , equal? = false
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

Checking MD5s

When a pipeline test fails due to an MD5 mismatch you can use the MD5 database to diff the results.

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### Updating MD5 file: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e0000deadbeef]
   [testng] ##### Test countloci is going fail #####
   [testng] ##### Path to expected   file (MD5=67823e4722495eb10a5e0000deadbeef): integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest
   [testng] ##### Path to calculated file (MD5=67823e4722495eb10a5e4c42c267b3a6): integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] ##### Diff command: diff integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] FAILED: testCountLoci
   [testng] java.lang.AssertionError: 1 of 1 MD5s did not match.

If you need to examine a number of MD5s which may have changed you can briefly shut off MD5 mismatch failures by setting parameterize = true.

spec.parameterize = true
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"

For this run:

ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run

If there's a match the output will resemble:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e4c42c267b3a6, equal? = true
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

While for a mismatch it will look like this:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e0000deadbeef, equal? = false
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

Created 2012-08-11 02:28:42 | Updated 2015-02-10 23:51:33 | Tags: faq queue developer qscript

Comments (1)

1. Many of my GATK functions are setup with the same Reference, Intervals, etc. Is there a quick way to reuse these values for the different analyses in my pipeline?


  • Create a trait that extends from CommandLineGATK.
  • In the trait, copy common values from your qscript.
  • Mix the trait into instances of your classes.

For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.

2. How do I accept a list of arguments to my QScript?

In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.

@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil

On the command line specify the arguments by repeating the argument name.

-filter filter1 -filter filter2 -filter filter3

Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.

  def script {
     var myCommand = new MyFunction
     myCommand.filters = this.filterNames

For a full example of command line arguments see the QScripts documentation.

3. What is the best way to run a utility method at the right time?

Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.

    class MySplitter extends InProcessFunction {
      var in: File = _

      var out: List[File] = Nil

      def run {
         StingUtilityMethod.quickSplitFile(in, out)

    var splitter = new MySplitter
    splitter.in = new File("input.txt")
    splitter.out = List(new File("out1.txt"), new File("out2.txt"))

See Queue CommandLineFunctions for more information on how @Input and @Output are used.

4. What is the best way to write a list of files?

Create an instance of a ListWriterFunction and add it in your script method.

import org.broadinstitute.sting.queue.function.ListWriterFunction

val writeBamList = new ListWriterFunction
writeBamList.inputFiles = bamFiles
writeBamList.listFile = new File("myBams.list")

5. How do I add optional debug output to my QScript?

Queue contains a trait mixin you can use to add Log4J support to your classes.

Add the import for the trait Logging to your QScript.

import org.broadinstitute.sting.queue.util.Logging

Mixin the trait to your class.

class MyScript extends Logging {

Then use the mixed in logger to write debug output when the user specifies -l DEBUG.

logger.debug("This will only be displayed when debugging is enabled.")

6. I updated Queue and now I'm getting java.lang.NoClassDefFoundError / java.lang.AbstractMethodError

Try ant clean.

Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.

7. Do I need to create directories in my QScript?

No. QScript will create all parent directories for outputs.

8. How do I specify the -W 240 for the LSF hour queue at the Broad?

Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:

240.0 min of gsa3

9. Can I run Queue with GridEngine?

Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.

10. How do I pass advanced java arguments to my GATK commands, such as remote debugging?

The easiest way to do this at the moment is to mixin a trait.

First define a trait which adds your java options:

  trait RemoteDebugging extends JavaCommandLineFunction {
    override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"

Then mix in the trait to your walker and otherwise run it as normal:

  val printReadsDebug = new PrintReads with RemoteDebugging
  printReadsDebug.reference_sequence = "my.fasta"
  // continue setting up your walker...

11. Why does Queue log "Running jobs. ... Done." but doesn't actually run anything?

If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.

INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper 
INFO 16:25:55,140 QCommandLine - Added 4 functions 
INFO 16:25:55,140 QGraph - Generating graph. 
INFO 16:25:55,164 QGraph - Generating scatter gather jobs. 
INFO 16:25:55,714 QGraph - Removing original jobs. 
INFO 16:25:55,716 QGraph - Adding scatter gather jobs. 
INFO 16:25:55,779 QGraph - Regenerating graph. 
INFO 16:25:55,790 QGraph - Running jobs. 
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done 
INFO 16:25:55,902 QCommandLine - Done 

Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.

Created 2012-08-11 00:59:38 | Updated 2014-02-03 22:34:22 | Tags: queue developer qscript commandlinefunction

Comments (8)

1. Basic QScript run rules

  • In the script method, a QScript will add one or more CommandLineFunctions.
  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.
  • Queue will run functions based on the dependencies between them, so if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

2. Command Line

Each CommandLineFunction must define the actual command line to run as follows.

class MyCommandLine extends CommandLineFunction {
  def commandLine = "myScript.sh hello world"

Constructing a Command Line Manually

If you're writing a one-off CommandLineFunction that is not destined for use by other QScripts, it's often easiest to construct the command line directly rather than through the API methods provided in the CommandLineFunction class.

For example:

def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)

Constructing a Command Line using API Methods

If you're writing a CommandLineFunction that will become part of Queue and/or will be used by other QScripts, however, our best practice recommendation is to construct your command line only using the methods provided in the CommandLineFunction class: required(), optional(), conditional(), and repeat()

The reason for this is that these methods automatically escape the values you give them so that they'll be interpreted literally within the shell scripts Queue generates to run your command, and they also manage whitespace separation of command-line tokens for you. This prevents (for example) a value like MQ > 10 from being interpreted as an output redirection by the shell, and avoids issues with values containing embedded spaces. The methods also give you the ability to turn escaping and/or whitespace separation off as needed. An example:

override def commandLine = super.commandLine +
                           required("eff") +
                           conditional(verbose, "-v") +
                           optional("-c", config) +
                           required("-i", "vcf") +
                           required("-o", "vcf") +
                           required(genomeVersion) +
                           required(inVcf) +
                           required(">", escape=false) +  // This will be shell-interpreted as an output redirection

The CommandLineFunctions built into Queue, including the CommandLineFunctions automatically generated for GATK Walkers, are all written using this pattern. This means that when you configure a GATK Walker or one of the other built-in CommandLineFunctions in a QScript, you can rely on all of your values being safely escaped and taken literally when the commands are run, including values containing characters that would normally be interpreted by the shell such as MQ > 10.

Below is a brief overview of the API methods available to you in the CommandLineFunction class for safely constructing command lines:

  • required()

Used for command-line arguments that are always present, e.g.:

required("-f", "filename")                              returns: " '-f' 'filename' "
required("-f", "filename", escape=false)                returns: " -f filename "
required("java")                                        returns: " 'java' "
required("INPUT=", "myBam.bam", spaceSeparated=false)   returns: " 'INPUT=myBam.bam' "
  • optional()

Used for command-line arguments that may or may not be present, e.g.:

optional("-f", myVar) behaves like required() if myVar has a value, but returns ""
if myVar is null/Nil/None
  • conditional()

Used for command-line arguments that should only be included if some condition is true, e.g.:

conditional(verbose, "-v") returns " '-v' " if verbose is true, otherwise returns ""
  • repeat()

Used for command-line arguments that are repeated multiple times on the command line, e.g.:

repeat("-f", List("file1", "file2", "file3")) returns: " '-f' 'file1' '-f' 'file2' '-f' 'file3' "

3. Arguments

  • CommandLineFunction arguments use a similar syntax to arguments.

  • CommandLineFunction variables are annotated with @Input, @Output, or @Argument annotations.

Input and Output Files

So that Queue can track the input and output files of a command, CommandLineFunction @Input and @Output must be java.io.File objects.

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file")
  var inputFile: File = _
  def commandLine = "myScript.sh -fileParam " + inputFile


CommandLineFunction variables can also provide indirect access to java.io.File inputs and outputs via the FileProvider trait.

class MyCommandLine extends CommandLineFunction {
  @Input(doc="named input file")
  var inputFile: ExampleFileProvider = _
  def commandLine = "myScript.sh " + inputFile

// An example FileProvider that stores a 'name' with a 'file'.
class ExampleFileProvider(var name: String, var file: File) extends org.broadinstitute.sting.queue.function.FileProvider {
  override def toString = " -fileName " + name + " -fileParam " + file

Optional Arguments

Optional files can be specified via required=false, and can use the CommandLineFunction.optional() utility method, as described above:

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file", required=false)
  var inputFile: File = _
  // -fileParam will only be added if the QScript sets inputFile on this instance of MyCommandLine
  def commandLine = required("myScript.sh") + optional("-fileParam", inputFile)

Collections as Arguments

A List or Set of files can use the CommandLineFunction.repeat() utility method, as described above:

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file")
  var inputFile: List[File] = Nil // NOTE: Do not set List or Set variables to null!
  // -fileParam will added as many times as the QScript adds the inputFile on this instance of MyCommandLine
  def commandLine = required("myScript.sh") + repeat("-fileParam", inputFile)

Non-File Arguments

A command line function can define other required arguments via @Argument.

class MyCommandLine extends CommandLineFunction {
  @Argument(doc="message to display")
  var veryImportantMessage: String = _
  // If the QScript does not specify the required veryImportantMessage, the pipeline will not run.
  def commandLine = required("myScript.sh") + required(veryImportantMessage)

4. Example: "samtools index"

class SamToolsIndex extends CommandLineFunction {
  @Input(doc="bam to index") var bamFile: File = _
  @Output(doc="bam index") var baiFile: File = _
  def commandLine = "samtools index %s %s".format(bamFile, baiFile)

Or, using the CommandLineFunction API methods to construct the command line with automatic shell escaping:

class SamToolsIndex extends CommandLineFunction {
  @Input(doc="bam to index") var bamFile: File = _
  @Output(doc="bam index") var baiFile: File = _
  def commandLine = required("samtools") + required("index") + required(bamFile) + required(baiFile)

Created 2012-08-10 20:00:40 | Updated 2014-02-03 22:35:27 | Tags: queue developer qscript

Comments (9)

1. Introduction

Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.

QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.

The following is a basic outline of a QScript:

import org.broadinstitute.sting.queue.QScript
// List other imports here

// Define the overall QScript here.
class MyScript extends QScript {
  // List script arguments here.
  @Input(doc="My QScript inputs")
  var scriptInput: File = _

  // Create and add the functions in the script here.
  def script = {
     var myCL = new MyCommandLine
     myCL.myInput = scriptInput // Example variable input
     myCL.myOutput = new File("/path/to/output") // Example hardcoded output


2. Imports

Imports can be any scala or java imports in scala syntax.

import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.

3. Classes

  • To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.

  • The QScript must define a method script.

  • The QScript can define helper methods or variables.

4. Script method

The body of script should create and add Queue CommandlineFunctions.

class MyScript extends org.broadinstitute.sting.queue.QScript {
  def script = add(new CommandLineFunction { def commandLine = "echo hello world" })

5. Command Line Arguments

  • A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.

  • A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.

  • QScript command line arguments can be marked as optional by setting required=false.

    class MyScript extends org.broadinstitute.sting.queue.QScript { @Input(doc="example message to echo") var message: String = _ def script = add(new CommandLineFunction { def commandLine = "echo " + message }) }

6. Using and writing CommandLineFunctions

Adding existing GATK walkers

See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.

After functions are defined they should be added to the QScript pipeline using add().

for (vcf <- vcfs) {
  val ve = new VariantEval
  ve.vcfFile = vcf
  ve.evalFile = swapExt(vcf, "vcf", "eval")

Defining new CommandLineFunctions

  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.

  • Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

  • See the main article Queue CommandLineFunctions for more information.

7. Examples

Hello World QScript

The following is a "hello world" example that runs a single command line to echo hello world.

import org.broadinstitute.sting.queue.QScript

class HelloWorld extends QScript {
  def script = {
    add(new CommandLineFunction {
      def commandLine = "echo hello world"

The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run

It should produce output similar to:

INFO  16:23:27,825 QScriptManager - Compiling 1 QScript 
INFO  16:23:31,289 QScriptManager - Compilation complete 
INFO  16:23:34,631 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine 
INFO  16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run  
INFO  16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34 
INFO  16:23:34,632 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,632 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,634 QCommandLine - Scripting HelloWorld 
INFO  16:23:34,651 QCommandLine - Added 1 functions 
INFO  16:23:34,651 QGraph - Generating graph. 
INFO  16:23:34,660 QGraph - Running jobs. 
INFO  16:23:34,689 ShellJobRunner - Starting: echo hello world 
INFO  16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out 
INFO  16:23:34,771 ShellJobRunner - Done: echo hello world 
INFO  16:23:34,773 QGraph - Deleting intermediate files. 
INFO  16:23:34,773 QCommandLine - Done 


This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.

The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.

To list the command line parameters, including the required parameters, run with -help.

java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help

The help output should appear similar to this:

INFO  10:26:08,491 QScriptManager - Compiling 1 QScript
INFO  10:26:11,926 QScriptManager - Compilation complete
Program Name: org.broadinstitute.sting.queue.QCommandLine
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
       [-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
       <dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
       <job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
       <temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
       [-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPassFile <emailPasswordFile>] [-emailPass <emailPassword>]
       [-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
       [-filter <filternames>] [-filterExpression <filterexpressions>]

 -S,--script <script>                                                      QScript scala file
 -run,--run_scripts                                                        Run QScripts.  Without this flag set only
                                                                           performs a dry run.
 -jobRunner,--job_runner <job_runner>                                      Use the specified job runner to dispatch
                                                                           command line jobs
 -bsub,--bsub                                                              Equivalent to -jobRunner Lsf706
 -status,--status                                                          Get status of jobs for the qscript
 -retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
                                                                           command fails.  Defaults to no retries.
 -startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
                                                                           outputs were previously output successfully.
 -keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
                                                                           any Function marked as intermediate.
 -statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
                                                                           completion or on error.
 -statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
                                                                           completion or on error.
 -dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
 -expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
                                                                           a .dot file.  Otherwise overwrites the
 -jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
 -jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
 -jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
 -jobPriority,--job_priority <job_priority>                                Default priority for jobs.
 -memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
 -runDir,--run_directory <run_directory>                                   Root directory to run functions from.
 -tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
 -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
                                                                           output for compute farm jobs.
 -emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
 -emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
                                                                           otherwise 25.
 -emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
 -emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
 -emailUser,--emailUsername <emailUsername>                                Email SMTP username. Defaults to none.
 -emailPassFile,--emailPasswordFile <emailPasswordFile>                    Email SMTP password file. Defaults to none.
 -emailPass,--emailPassword <emailPassword>                                Email SMTP password. Defaults to none. Not
                                                                           secure! See emailPassFile.
 -l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
                                                                           setting INFO get's you INFO up to FATAL,
                                                                           setting ERROR gets you ERROR and FATAL level
 -log,--log_to_file <log_to_file>                                          Set the logging location
 -quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
 -debug,--debug_mode                                                       Set the logging file string to include a lot
                                                                           of debugging information (SLOW!)
 -h,--help                                                                 Generate this help message

Arguments for ExampleUnifiedGenotyper:
 -R,--referencefile <referencefile>                          The reference file for the bam files.
 -I,--bamfile <bamfile>                                      Bam file to genotype.
 -L,--intervals <intervals>                                  An optional file with a list of intervals to proccess.
 -filter,--filternames <filternames>                         A optional list of filter names.
 -filterExpression,--filterexpressions <filterexpressions>   An optional list of filter expressions.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
        at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:192)
        at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:172)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:199)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:57)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing.
##### ERROR Argument with name '--referencefile' (-R) is missing.
##### ERROR ------------------------------------------------------------------------------------------

To dry run the pipeline:

java \
  -Djava.io.tmpdir=tmp \
  -jar dist/Queue.jar \
  -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \
  -R human_b36_both.fasta \
  -I pilot2_daughters.chr20.10k-11k.bam \
  -L chr20.interval_list \
  -filter StrandBias -filterExpression "SB>=0.10" \
  -filter AlleleBalance -filterExpression "AB>=0.75" \
  -filter QualByDepth -filterExpression "QD<5" \
  -filter HomopolymerRun -filterExpression "HRun>=4"

The dry run output should appear similar to this:

INFO  10:45:00,354 QScriptManager - Compiling 1 QScript
INFO  10:45:04,855 QScriptManager - Compilation complete
INFO  10:45:05,058 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO  10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4 
INFO  10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05
INFO  10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO  10:45:05,150 QCommandLine - Added 4 functions
INFO  10:45:05,150 QGraph - Generating graph.
INFO  10:45:05,169 QGraph - Generating scatter gather jobs.
INFO  10:45:05,182 QGraph - Removing original jobs.
INFO  10:45:05,183 QGraph - Adding scatter gather jobs.
INFO  10:45:05,231 QGraph - Regenerating graph.
INFO  10:45:05,247 QGraph - -------
INFO  10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals
INFO  10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,254 QGraph - -------
INFO  10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,279 QGraph - -------
INFO  10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,283 QGraph - -------
INFO  10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,288 QGraph - -------
INFO  10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,289 QGraph - -------
INFO  10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples
INFO  10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,292 QGraph - -------
INFO  10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval
INFO  10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out
INFO  10:45:05,296 QGraph - -------
INFO  10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun
INFO  10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out
INFO  10:45:05,302 QGraph - -------
INFO  10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval
INFO  10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out
INFO  10:45:05,304 QGraph - Dry run completed successfully!
INFO  10:45:05,304 QGraph - Re-run with "-run" to execute the functions.
INFO  10:45:05,304 QCommandLine - Done

8. Using traits to pass common values between QScripts to CommandLineFunctions

QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables.

  • A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits.

  • A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions.

See the following example:

class MyScript extends org.broadinstitute.sting.queue.QScript {
  // Create an alias 'qscript' for 'MyScript.this'
  qscript =>

  // This is a script argument
  @Argument(doc="message to display")
  var message: String = _

  // This is a script argument
  @Argument(doc="number of times to display")
  var count: Int = _

  trait ReusableArguments extends MyCommandLineFunction {
    // Whenever a function is created 'with' this trait, it will copy the message.
    this.commandLineMessage = qscript.message

  abstract class MyCommandLineFunction extends CommandLineFunction {
     // This is a per command line argument
     @Argument(doc="message to display")
     var commandLineMessage: String = _

  class MyEchoFunction extends MyCommandLineFunction {
     def commandLine = "echo " + commandLineMessage

  class MyAlsoEchoFunction extends MyCommandLineFunction {
     def commandLine = "echo also " + commandLineMessage

  def script = {
    for (i <- 1 to count) {
      val echo = new MyEchoFunction with ReusableArguments
      val alsoEcho = new MyAlsoEchoFunction with ReusableArguments
      add(echo, alsoEcho)

Created 2012-08-10 19:20:46 | Updated 2014-10-16 21:23:42 | Tags: intro queue developer qscript

Comments (12)

1. Introduction

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

  • Local realignment around indels
  • Emitting raw SNP calls
  • Emitting indels
  • Masking the SNPs at indels
  • Annotating SNPs using chip data
  • Labeling suspicious calls based on filters
  • Creating a summary report with statistics

Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.

With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.

2. Obtaining Queue

You have two options: download the binary distribution (prepackaged, ready to run program) or build it from source.

- Download the binary

This is obviously the easiest way to go. Links are on the Downloads page. Just get the Queue package; no need to get the GATK package separately as GATK is bundled in with Queue.

- Building Queue from source

Briefly, here's what you need to know/do:

Queue is part of the GATK repository. Download the source from the public repository on Github. Run the following command:

git clone https://github.com/broadgsa/gatk.git

IMPORTANT NOTE: These instructions refer to the MIT-licensed version of the GATK+Queue source code. With that version, you will be able to build Queue itself, as well as the public portion of the GATK (the core framework), but that will not include the GATK analysis tools. If you want to use Queue to pipeline the GATK analysis tools, you need to clone the 'protected' repository. Please note however that part of the source code in that repository (the 'protected' module) is under a different license which excludes for-profit use, modification and redistribution.

Move to the git root directory and use maven to build the source.

mvn clean verify

All dependencies will be managed by Maven as needed.

See this article on how to test your installation of Queue.

3. Running Queue

See this article on running Queue for the first time for full details.

Queue arguments can be listed by running with --help

java -jar dist/Queue.jar --help

To list the arguments required by a QScript, add the script with -S and run with --help.

java -jar dist/Queue.jar -S script.scala --help

Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.

See QFunction and Command Line Options for more info on adjusting Queue options.

4. QScripts

General Information

Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.

Every QScript includes the following steps:

  • New instances of CommandLineFunctions are created
  • Input and output arguments are specified on each function
  • The function is added with add() to Queue for dispatch and monitoring

The basic command-line to run the Queue pipelines on the command line is

java -jar Queue.jar -S <script>.scala

See the main article Queue QScripts for more info on QScripts.

Supported QScripts

Most QScripts are analysis pipelines that are custom-built for specific projects, and we currently do not offer any QScripts as supported analysis tools. However, we do provide some example scripts that you can use as basis to write your own QScripts (see below).

Example QScripts

The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples

5. Visualization and Queue


Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.

Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:

bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile

Note that gsalib is available from the CRAN repository so you can install it with the canonical R package install command.


  • The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves

  • This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.

DOT visualization of Pipelines

Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.

To clarify your pipeline, override the dotString() function:

class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
    @Input(doc="foo") var bam = bamIn
    @Input(doc="foo") var bamIndex = bai(bamIn)
    @Output(doc="foo") var recalData = recalDataIn
    memoryLimit = Some(4)
    override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
    def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)

Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:

6. Further reading

No articles to display.

Created 2016-03-01 13:02:06 | Updated 2016-03-01 13:08:47 | Tags: baserecalibrator commandlinegatk queue qscript analyzecovariates

Comments (8)

Dear GATK team,

I'd like to ask a question about the possibility of HaplotypeCaller and AnalyzeCovariates running in parallel: I developed a QScript that runs indel realignment, BQSR, variant calling (obtaining gVCF file as a result) and then -- BaseRecalibrator for the second time followed by AnalyzeCovariates. Looking in QScript jobreport PDF file I noticed that the second run of BaseRecalibrator was performed after HaplotypeCaller, though as I can understand, HaplotypeCaller and the second run of BaseRecalibrator are independent regarding data and potentially can run in paralle.

Can I run them in parallel somehow in order to save time?

Created 2016-02-23 18:01:29 | Updated | Tags: commandlinegatk queue qscript analyzecovariates

Comments (2)

Dear GATK team,

I use Queue to build a pipeline with GATK tools: RealignerTargetCreator, IndelRealigner, BaseRecalibrator, AnalyzeCovariates, and HaplotypeCaller. So, I developed the corresponding QScript. When I tested it the very first time, all functions finished with no errors. But then, when I invoked it with -startfromScratch it failed to execute AnalyzeCovariates saying:

ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' ERROR 19:08:14,045 FunctionEdge - Contents of bqsr-report.pdf.out: [...]

In bqsr-report.out file I can see no errors:

INFO 17:25:12,764 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,767 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:40 INFO 17:25:12,767 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:25:12,767 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:25:12,771 HelpFormatter - Program Args: -T AnalyzeCovariates -L intervals_to_process.interval_list -R Homo_sapiens_assembly38.fasta -before recal-table1.txt -after recal-table2.txt -plots bqsr-report.pdf -csv bqsr-report.csv INFO 17:25:12,779 HelpFormatter - Executing as [...] INFO 17:25:12,779 HelpFormatter - Date/Time: 2016/02/23 17:25:12 INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,838 GenomeAnalysisEngine - Strictness is SILENT INFO 17:25:13,079 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:25:13,196 IntervalUtils - Processing 3088286401 bp from intervals INFO 17:25:13,277 GenomeAnalysisEngine - Preparing for traversal INFO 17:25:13,287 GenomeAnalysisEngine - Done preparing for traversal INFO 17:25:13,287 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:25:13,288 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:25:13,289 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:25:13,764 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,921 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,931 AnalyzeCovariates - Generating csv file 'bqsr-report.csv' INFO 17:25:14,164 AnalyzeCovariates - Generating plots file 'bqsr-report.pdf' INFO 17:25:20,279 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$None@3ce295f9 INFO 17:25:20,282 ProgressMeter - done 0.0 6.0 s 11.6 w 100.0% 6.0 s 0.0 s INFO 17:25:20,283 ProgressMeter - Total runtime 7.00 secs, 0.12 min, 0.00 hours INFO 17:25:21,513 GATKRunReport - Uploaded run statistics report to AWS S3

I can see that AnalyzeCovariates and HaplotypeCaller start nearly at the same time:

INFO 17:25:09,264 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' INFO 17:25:09,264 FunctionEdge - Output written to bqsr-report.pdf.out INFO 17:25:21,548 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' 'recaled.bam' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-variant_index_type' 'LINEAR' '-variant_index_parameter' '128000' '-o' 'sample.gvcf' '-D' 'dbsnp_144.hg38.vcf' '-ERC' 'GVCF' '-pcrModel' 'CONSERVATIVE' INFO 17:25:21,548 FunctionEdge - Output written to sample.gvcf.out

and after HaplotypeCaller finishes successfully (I checked that it produced a gVCF file), this message about AnalyzeCovariates error is printed:

INFO 19:08:14,029 QGraph - 0 Pend, 2 Run, 0 Fail, 10 Done ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv'

Of course, AnalyzeCovariates didn't produce CSV and PDF reports this way.

When I invoked the QScript again (this time not from scratch) to reproduce the situation, it executed with no errors, and AnalyzeCovariates generated both CSV and PDF reports.

After that I executed AnalyzeCovariates manually with -l DEBUG and it produced only CSV report, no PDF. Here is the central part of the debug output of AnalyzeCovariates:

`DEBUG 20:38:28,593 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/engine/recalibration/BQSR.R bqsr-report.csv recal-table1.txt bqsr-report1.pdf DEBUG 20:38:28,607 RScriptExecutor - Executing: DEBUG 20:38:28,607 RScriptExecutor - Rscript DEBUG 20:38:28,607 RScriptExecutor - -e DEBUG 20:38:28,608 RScriptExecutor - tempLibDir = '/tmp/Rlib.4876869209103817519';source('/tmp/BQSR.3779158431229884689.R'); DEBUG 20:38:28,608 RScriptExecutor - bqsr-report.csv DEBUG 20:38:28,608 RScriptExecutor - recal-table1.txt DEBUG 20:38:28,608 RScriptExecutor - bqsr-report1.pdf

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:


Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion DEBUG 20:38:34,790 RScriptExecutor - Result: 0 INFO 20:38:34,792 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$Non e@44723d95 INFO 20:38:34,795 ProgressMeter - done 0.0 7.0 s 11.6 w 100.0% 7.0 s 0.0 s INFO 20:38:34,795 ProgressMeter - Total runtime 7.03 secs, 0.12 min, 0.00 hours `

So, every time, when I start my QScript from scratch AnalyzeCovariates fails, but it finished successfully when I invoke my script the next time without -startFromScratch. When I execute AnalyzeCovariates manually with -l DEBUG, it produces CSV report only.

Should I use BQSR.R script directly?

I will be very grateful for any tips and help.

Created 2015-01-23 19:00:21 | Updated | Tags: queue qscript scala

Comments (6)

Right now, when I run "java -jar Queue.jar QScript.scala", scalac ("QScriptManager") complains that it can't find classes and objects that are in the same package as my QScript, as well as methods declared in my package object. When I try to import these explicitly, I get "object is not a member of package x".

How can I tell Queue to pass to scalac all of the files in my package?

I'm using Scala 2.11.4.


Created 2014-06-09 13:45:13 | Updated | Tags: queue qscript commandlinefunction

Comments (5)


I've been using Queue extensively recently, and it's great. I'm now trying to run a Qfunction that takes a list of input files, and uses them all as dependencies just as they would be if they were single inputs. Here's an example of what I want:

  case class reportWriter(inFiles:List[File], file:File, report:DataReport) extends CommandLineFunction {
    @Input(doc="files to rely on") val files:List[File] = inFiles 
    @Output(doc="report file to write") val reportFile:File = file
    def commandLine = "reportWriter " + inFiles + ">" + reportFile
    this.isIntermediate = false
    this.jobName = "writeReport"
    this.analysisName = this.jobName

I the call populate a List[File] parameters in my pipeline, and pass it as an argument like so:

  add( GeneralUtils.writeReport(fileList, reportFile, report) )

However, the files in my List[File] are not treated as dependencies and the job is run the first thing that happens in the pipeline.

Is there a way to use a List[File] as input to a CommandLineFunction, and have it use all elements of the List as dependencies?

Created 2014-04-03 23:13:39 | Updated | Tags: queue vcf qscript scala

Comments (7)


Thanks to previous replies can run Queue and the relevant walker on a distributed computing server. The question was if I define my scala script to require an argument for the output file, using the -o parameter like so:

            // Required arguments.  All initialized to empty values.

             @Input(doc="Output file.", shortName="o")
                              var outputFile: File = _

How do I direct the output to pipe the result to a specified directory? Currently I have the code: genotyper.out = swapExt(qscript.bamFile, "bam", outputFile, "unfiltered.vcf")

Currently when I include the string -o /path/to/my/output/files/MyResearch.vcf

The script creates a series of folders within the directory where I execute Queue from. In this case my results were sent to: /Queue-2-8-1-g932cd3a/MyResearch./path/to/my/output/files/MyResearch.unfiltered.vcf

when all I wanted was the output to appear in the path: /path/to/my/output/files/MyResearch.unfiltered.vcf

As always any help is much appreciated.

Created 2013-11-26 18:10:04 | Updated | Tags: selectvariants queue qscript walkers java

Comments (12)

I am working on a Queue script that uses the selectVariants walker. Two of the arguments that I am trying to use both use an enumerated type: restrictAllelesTo and selectTypeToInclude. I have tried passing these as strings however I get java type mismatch errors. What is the simplest way to pass these parameters to the selectVariant walker in the qscript?

Created 2013-11-21 15:10:07 | Updated | Tags: qscript

Comments (2)


I am trying to understand how GATK Queue works, and wrote a mini script to do alignment using bwa mem. The test code failed when running.

The error messages: not found: type not found: type ExternalCommonArgs ERROR 21:25:34,464 QScriptManager - case class bwa_mem (inFastQ: File, outBam: File) extends CommandLineFunction with ExternalCommonArgs ERROR 21:25:34,464 QScriptManager - case class bwa_mem (inFastQ: File, outBam: File) extends CommandLineFunction with ExternalCommonArgs

Then I removed ExternalCommonArgs and ran again. The error is: ERROR MESSAGE: Non-file found. Try removing the annotation, change the annotation to @Argument, or extend File with FileExtension: private int temp.bwaThreads: 1

Any suggestions for debugging? My code is attached below.


import org.broadinstitute.sting.queue.QScript

class temp extends QScript {

qscript =>

@Input(doc="input fastq file - or list of fastq files", shortName="I", required=true) var input: File = _

@Input(doc="Reference fasta file", shortName="R", required=true) var reference: File = _

@Input(doc="Number of threads BWA should use", fullName="bwa_threads", shortName="bt", required=false) var bwaThreads: Int = 1

val queueLogDir: String = ".qlog/" // Gracefully hide Queue's output

def performAlignment(fq: File): File = { var alignedBam: File = swapExt(fq, ".fq", ".sam") add(bwa_mem(fq, alignedBam)) alignedBam


def script() { var outBam: File = performAlignment(input)

case class bwa_mem (inFastQ: File, outBam: File) extends CommandLineFunction with ExternalCommonArgs{ @Input(doc="fastq file to be aligned") var fq = inFastQ @Output(doc="output bam file") var bam = outBam def commandLine = "bwa mem -t " + bwaThreads + " " + reference + " " + fq + " > " + bam this.analysisName = queueLogDir + outBam + ".bwamem" this.jobName = queueLogDir + outBam + ".bwamem" } }

Created 2013-11-20 06:41:02 | Updated 2013-11-20 06:41:55 | Tags: queue qscript picard markduplicates

Comments (9)


So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc="Input BAM file") val bam: File = inBam
    @Output(doc="Metrics file") val metrics: File = outMetrics
    this.input :+= bam
    this.targets = targetIntervals
    this.baits = baitIntervals
    this.output = metrics
    this.reference =  refGenome
    this.isIntermediate = false

Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'  
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar 
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath 

And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc = "Input bam file") var inbam = inBam
    @Output(doc = "Output BAM file with dups removed") var outbam = outBam
    this.REMOVE_DUPLICATES = true
    this.input :+= inBam
    this.output = outBam
    this.metrics = metricsFile
    this.memoryLimit = 3
    this.isIntermediate = false

Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.

Created 2013-10-29 18:43:03 | Updated | Tags: qscript argument taggedfile

Comments (2)

Is the a way to access argument tags in the arguments to a Qscript?

I have a script that takes a number of bam files as input and I would like to be able to tag them. i.e.

--input:whole-genome some_long_name.bam --input:exome a_different_bam.bam

In a walker I do this and then look up the tags by calling getToolkit().getTags(argumentValue), but this isn't available to a qscript. Is there a good way to do this?

Created 2013-07-23 19:20:19 | Updated | Tags: queue qscript scala

Comments (8)

I'm working on a set of related Queue scripts. I would like to have functionality shared between them, ideally in separate scala files which would be imported. Is there a way to specify additional paths for the Queue scala compiler to search or do I have to bake my library into the gatk when I build it?

Created 2012-11-22 15:48:43 | Updated 2012-11-22 15:49:11 | Tags: unifiedgenotyper queue qscript nullpointerexception

Comments (2)

I'm building a variant calling qscript (it's available here), heavily based on the the MethodsDevelopmentCallingPipeline.scala. I cannot however run into trouble when setting the "this.scatterCount" of the GenotyperBase to more than 1 - in which case I get a NullPointerException (I include the full error message below).

I use the following command line:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/NewVariantCalling.scala -i NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam -R /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta -res /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/ **-sg 2** -nt 8 -run -l DEBUG -startFromScratch

As you can see, I'm using the files from the gatk bundle, and I guess these should be alright for this purpose? Just to be clear I use the "-res" parameter to point to the directory where all the resource files are located, dbsnp, hapmap, etc. and the -sg parameter is what controls the scatter/gather count.

I've tried to search in the code for what might be causing this, and I can conclude that the org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc is called with str (its parameter) being an empty string, which is what causes contig to be null, which in turn creates the NullPointerException on line 408 when this line is executed: stop = getContigInfo(contig).getSequenceLength();

This, I guess, is the obvious stuff, but this far I haven't been able to figure this out any further that this. I'm not sure if this is caused by a bug in my script, or by a bug in the GATK. Right now I'm thinking the latter of the two, since I have used the scatter/gather function in other scripts without any trouble.

Any ideas of where to continue from here, or confirmation that this is indeed something related to the GATK code would be much appreciated.

Cheers, Johan

ERROR 16:22:50,781 FunctionEdge - Error: LocusScatterFunction: List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf.idx, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam) > List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_1_of_2/scatter.intervals, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_2_of_2/scatter.intervals) 
        at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:408)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:84)
        at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:97)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:538)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:520)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindings(IntervalUtils.java:499)
        at org.broadinstitute.sting.queue.extensions.gatk.GATKIntervals.locs(GATKIntervals.scala:60)
        at org.broadinstitute.sting.queue.extensions.gatk.LocusScatterFunction.run(LocusScatterFunction.scala:39)
        at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:28)
        at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)