# Tagged with #queue 13 documentation articles | 3 announcements | 80 forum discussions

Created 2015-07-06 18:30:35 | Updated 2015-07-07 15:32:56 | Tags: queue gatk build

TL;DR: mvn -Ddisable.shadepackage verify

### Background

In addition to Queue's GATK-wrapper codegen, relatively slow scala compilation, etc. there's still a lot of legacy compatibility from our ant days in the Maven scripts. Our mvn verify behaves more like when one runs ant, and builds everything needed to bundle the GATK.

As of GATK 3.4, by default the build for the "protected" code generates jar files that contains every class needed for running, one for the GATK and one for Queue. This is done by the Maven shade plugin, and are each called the "package jar". But, there's a way to generate a jar file that only contains META-INF/MANIFEST.MF pointers to the dependency jar files, instead of zipping/shading them up. These are each the "executable jar", and FYI are always generated as it takes seconds, not minutes.

### Instructions for fast compilation

While developing and recompiling Queue, disable the shaded jar with -Ddisable.shadepackage. Then run java -jar target/executable/Queue.jar ... If you need to transfer this jar to another machine / directory, you can't copy (or rsync) just the jar, you'll need the entire executable directory.

# Total expected time, on a local disk, with Queue:
#   ~5.0 min from clean
#   ~1.5 min per recompile

# always available
java -jar target/executable/Queue.jar --help

java -jar target/package/Queue.jar --help

If one is only developing for the GATK, skip Queue by adding -P\!queue also.

mvn -Ddisable.shadepackage -P\!queue verify

# always available
java -jar target/executable/GenomeAnalysisTK.jar --help

java -jar target/executable/Queue.jar --help

Created 2012-12-18 21:35:34 | Updated 2013-01-26 05:10:36 | Tags: intro queue parallelism performance scatter-gather multithreading nct nt

This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.

### 1. Introducing the concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

#### Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

• open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

• open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

• open the file, index the lines

• read the first line, count the number of words, tell us the number
• read the second line, count the number of words, tell us the number
• read the third line, count the number of words, tell us the number
• [repeat for all lines]

• collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.

### 2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

#### A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.

• Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

• Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

• Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

• -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

#### Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).

At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.

#### Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

Created 2012-12-14 21:59:43 | Updated 2016-03-31 18:41:31 | Tags: queue parallelism performance scatter-gather multithreading nct nt

This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.

### Overview

As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

• -nt / --num_threads controls the number of data threads sent to the processor
• -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.

#### Additional consideration when using -nct with versions 2.2 and 2.3

Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.

### Scatter-gather

For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

### Applicability of parallelism to the major GATK tools

Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.

Tool Full name Type of traversal NT NCT SG
RTC RealignerTargetCreator RodWalker + - -
IR IndelRealigner ReadWalker - - +
BR BaseRecalibrator LocusWalker - + +
UG UnifiedGenotyper LocusWalker + + +

### Recommended configurations

The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.

Tool RTC IR BR PR RR UG
Available modes NT SG NCT,SG NCT SG NT,NCT,SG
Cluster nodes 1 4 4 1 4 4 / 4 / 4
CPU threads (-nct) 1 1 8 4-8 1 3 / 6 / 24
Data threads (-nt) 24 1 1 1 1 8 / 4 / 1
Memory (Gb) 48 4 4 4 4 32 / 16 / 4

Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Created 2012-08-11 02:28:42 | Updated 2015-02-10 23:51:33 | Tags: faq queue developer qscript

### 1. Many of my GATK functions are setup with the same Reference, Intervals, etc. Is there a quick way to reuse these values for the different analyses in my pipeline?

Yes.

• Create a trait that extends from CommandLineGATK.
• In the trait, copy common values from your qscript.
• Mix the trait into instances of your classes.

For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.

### 2. How do I accept a list of arguments to my QScript?

In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.

@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil

On the command line specify the arguments by repeating the argument name.

-filter filter1 -filter filter2 -filter filter3

Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.

  def script {
var myCommand = new MyFunction
myCommand.filters = this.filterNames
}

For a full example of command line arguments see the QScripts documentation.

### 3. What is the best way to run a utility method at the right time?

Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.

    class MySplitter extends InProcessFunction {
@Input(doc="inputs")
var in: File = _

@Output(doc="outputs")
var out: List[File] = Nil

def run {
StingUtilityMethod.quickSplitFile(in, out)
}
}

var splitter = new MySplitter
splitter.in = new File("input.txt")
splitter.out = List(new File("out1.txt"), new File("out2.txt"))
add(splitter)

See Queue CommandLineFunctions for more information on how @Input and @Output are used.

### 4. What is the best way to write a list of files?

Create an instance of a ListWriterFunction and add it in your script method.

import org.broadinstitute.sting.queue.function.ListWriterFunction

val writeBamList = new ListWriterFunction
writeBamList.inputFiles = bamFiles
writeBamList.listFile = new File("myBams.list")
add(writeBamList)

### 5. How do I add optional debug output to my QScript?

Queue contains a trait mixin you can use to add Log4J support to your classes.

Add the import for the trait Logging to your QScript.

import org.broadinstitute.sting.queue.util.Logging

Mixin the trait to your class.

class MyScript extends Logging {
...

Then use the mixed in logger to write debug output when the user specifies -l DEBUG.

logger.debug("This will only be displayed when debugging is enabled.")

### 6. I updated Queue and now I'm getting java.lang.NoClassDefFoundError / java.lang.AbstractMethodError

Try ant clean.

Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.

### 7. Do I need to create directories in my QScript?

No. QScript will create all parent directories for outputs.

### 8. How do I specify the -W 240 for the LSF hour queue at the Broad?

Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:

RUNLIMIT
240.0 min of gsa3

### 9. Can I run Queue with GridEngine?

Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.

### 10. How do I pass advanced java arguments to my GATK commands, such as remote debugging?

The easiest way to do this at the moment is to mixin a trait.

  trait RemoteDebugging extends JavaCommandLineFunction {
override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
}

Then mix in the trait to your walker and otherwise run it as normal:

  val printReadsDebug = new PrintReads with RemoteDebugging
// continue setting up your walker...
add(printReadsDebug)

### 11. Why does Queue log "Running jobs. ... Done." but doesn't actually run anything?

If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.

INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 16:25:55,140 QCommandLine - Added 4 functions
INFO 16:25:55,140 QGraph - Generating graph.
INFO 16:25:55,164 QGraph - Generating scatter gather jobs.
INFO 16:25:55,714 QGraph - Removing original jobs.
INFO 16:25:55,716 QGraph - Adding scatter gather jobs.
INFO 16:25:55,779 QGraph - Regenerating graph.
INFO 16:25:55,790 QGraph - Running jobs.
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done
INFO 16:25:55,902 QCommandLine - Done 

Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.

Created 2012-08-11 02:02:20 | Updated 2014-02-03 22:32:01 | Tags: queue developer gridengine jobrunner

### 1. Background

Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.

As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.

Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.

If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.

## 2. Running Queue with GridEngine

Try out the Hello World example with -jobRunner GridEngine.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run

If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.

See QFunction and Command Line Options for more info on Queue options.

## 3. Debugging issues with Queue and GridEngine

If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2

Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:

qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y echo hello world qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd$PWD -j y \
-l mem_free=2048M -l h_rss=2458M echo hello world

One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.

qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \ -l mem_free=4096M -l h_rss=4915M echo hello world qsub -w e -V -b y -N echo_hello_world \ -o test.out -wd$PWD -j y \
-l mem_free=8192M -l h_rss=9830M echo hello world

qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \ -l mem_free=16384M -l h_rss=19960M echo hello world If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum. Created 2012-08-11 00:59:38 | Updated 2014-02-03 22:34:22 | Tags: queue developer qscript commandlinefunction ### 1. Basic QScript run rules • In the script method, a QScript will add one or more CommandLineFunctions. • Queue tracks dependencies between functions via variables annotated with @Input and @Output. • Queue will run functions based on the dependencies between them, so if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running. ### 2. Command Line Each CommandLineFunction must define the actual command line to run as follows. class MyCommandLine extends CommandLineFunction { def commandLine = "myScript.sh hello world" } #### Constructing a Command Line Manually If you're writing a one-off CommandLineFunction that is not destined for use by other QScripts, it's often easiest to construct the command line directly rather than through the API methods provided in the CommandLineFunction class. For example: def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out) #### Constructing a Command Line using API Methods If you're writing a CommandLineFunction that will become part of Queue and/or will be used by other QScripts, however, our best practice recommendation is to construct your command line only using the methods provided in the CommandLineFunction class: required(), optional(), conditional(), and repeat() The reason for this is that these methods automatically escape the values you give them so that they'll be interpreted literally within the shell scripts Queue generates to run your command, and they also manage whitespace separation of command-line tokens for you. This prevents (for example) a value like MQ > 10 from being interpreted as an output redirection by the shell, and avoids issues with values containing embedded spaces. The methods also give you the ability to turn escaping and/or whitespace separation off as needed. An example: override def commandLine = super.commandLine + required("eff") + conditional(verbose, "-v") + optional("-c", config) + required("-i", "vcf") + required("-o", "vcf") + required(genomeVersion) + required(inVcf) + required(">", escape=false) + // This will be shell-interpreted as an output redirection required(outVcf) The CommandLineFunctions built into Queue, including the CommandLineFunctions automatically generated for GATK Walkers, are all written using this pattern. This means that when you configure a GATK Walker or one of the other built-in CommandLineFunctions in a QScript, you can rely on all of your values being safely escaped and taken literally when the commands are run, including values containing characters that would normally be interpreted by the shell such as MQ > 10. Below is a brief overview of the API methods available to you in the CommandLineFunction class for safely constructing command lines: • required() Used for command-line arguments that are always present, e.g.: required("-f", "filename") returns: " '-f' 'filename' " required("-f", "filename", escape=false) returns: " -f filename " required("java") returns: " 'java' " required("INPUT=", "myBam.bam", spaceSeparated=false) returns: " 'INPUT=myBam.bam' " • optional() Used for command-line arguments that may or may not be present, e.g.: optional("-f", myVar) behaves like required() if myVar has a value, but returns "" if myVar is null/Nil/None • conditional() Used for command-line arguments that should only be included if some condition is true, e.g.: conditional(verbose, "-v") returns " '-v' " if verbose is true, otherwise returns "" • repeat() Used for command-line arguments that are repeated multiple times on the command line, e.g.: repeat("-f", List("file1", "file2", "file3")) returns: " '-f' 'file1' '-f' 'file2' '-f' 'file3' " ### 3. Arguments • CommandLineFunction arguments use a similar syntax to arguments. • CommandLineFunction variables are annotated with @Input, @Output, or @Argument annotations. #### Input and Output Files So that Queue can track the input and output files of a command, CommandLineFunction @Input and @Output must be java.io.File objects. class MyCommandLine extends CommandLineFunction { @Input(doc="input file") var inputFile: File = _ def commandLine = "myScript.sh -fileParam " + inputFile } #### FileProvider CommandLineFunction variables can also provide indirect access to java.io.File inputs and outputs via the FileProvider trait. class MyCommandLine extends CommandLineFunction { @Input(doc="named input file") var inputFile: ExampleFileProvider = _ def commandLine = "myScript.sh " + inputFile } // An example FileProvider that stores a 'name' with a 'file'. class ExampleFileProvider(var name: String, var file: File) extends org.broadinstitute.sting.queue.function.FileProvider { override def toString = " -fileName " + name + " -fileParam " + file } #### Optional Arguments Optional files can be specified via required=false, and can use the CommandLineFunction.optional() utility method, as described above: class MyCommandLine extends CommandLineFunction { @Input(doc="input file", required=false) var inputFile: File = _ // -fileParam will only be added if the QScript sets inputFile on this instance of MyCommandLine def commandLine = required("myScript.sh") + optional("-fileParam", inputFile) } #### Collections as Arguments A List or Set of files can use the CommandLineFunction.repeat() utility method, as described above: class MyCommandLine extends CommandLineFunction { @Input(doc="input file") var inputFile: List[File] = Nil // NOTE: Do not set List or Set variables to null! // -fileParam will added as many times as the QScript adds the inputFile on this instance of MyCommandLine def commandLine = required("myScript.sh") + repeat("-fileParam", inputFile) } #### Non-File Arguments A command line function can define other required arguments via @Argument. class MyCommandLine extends CommandLineFunction { @Argument(doc="message to display") var veryImportantMessage: String = _ // If the QScript does not specify the required veryImportantMessage, the pipeline will not run. def commandLine = required("myScript.sh") + required(veryImportantMessage) } ### 4. Example: "samtools index" class SamToolsIndex extends CommandLineFunction { @Input(doc="bam to index") var bamFile: File = _ @Output(doc="bam index") var baiFile: File = _ def commandLine = "samtools index %s %s".format(bamFile, baiFile) ) Or, using the CommandLineFunction API methods to construct the command line with automatic shell escaping: class SamToolsIndex extends CommandLineFunction { @Input(doc="bam to index") var bamFile: File = _ @Output(doc="bam index") var baiFile: File = _ def commandLine = required("samtools") + required("index") + required(bamFile) + required(baiFile) ) Created 2012-08-11 00:07:18 | Updated 2015-09-24 00:25:10 | Tags: queue developer qfunction These are the most popular Queue command line options. For a complete and up to date list run with --help or -h. QScripts may also add additional command line options. Please note that this page is out of date. We hope to update it in future but have no resources to do so at present. If you run into trouble using any of the command line arguments listed here, we recommend you check the source code for the Q arguments here. Apologies for the inconvenience. ### 1. Queue Command Line Options Command Line Argument Description Default -run If passed the scripts are run. If not passed a dry run is executed. dry run -jobRunner <jobrunner> The job runner to dispatch jobs. Setting to Lsf706, GridEngine, or Drmaa will dispatch jobs to LSF or Grid Engine using the job settings (see below). Defaults to Shell which runs jobs on a local shell one at a time. Shell -bsub Alias for -jobRunner Lsf706 not set -qsub Alias for -jobRunner GridEngine not set -status Prints out a summary progress. If a QScript is currently running via -run, you can run the same command line with -status instead to print a summary of progress. not set -retry <count> Retries a QFunction that returns a non-zero exit code up to count times. The QFunction must not have set jobRestartable to false. 0 = no retries -startFromScratch Restarts the graph from the beginning. If not specified for each output file specified on a QFunction, ex: /path/to/output.file, Queue will not re-run the job if a .done file is found for the all the outputs, ex: /path/to/.output.file.done. use .done files to determine if jobs are complete -keepIntermediates By default Queue deletes the output files of QFunctions that set .isIntermediate to true. delete intermediate files -statusTo <email> Email address to send status to whenever a) A job fails, or b) Queue has run all the functions it can run and is exiting. not set -statusFrom <email> Email address to send status emails from. user@local.domain -dot <file> If set renders the job graph to a dot file. not rendered -l <logging_level> The minimum level of logging, DEBUG, INFO, WARN, or FATAL. INFO -log <file> Sets the location to save log output in addition to standard out. not set -debug Set the logging to include a lot of debugging information (SLOW!) not set -jobReport Path to write the job report text file. If R is installed and available on the $PATH then a pdf will be generated visualizing the job report. jobPrefix.jobreport.txt
-disableJobReport Disables writing the job report. not set
-help Lists all of the command line arguments with their descriptions. not set

### 2. QFunction Options

The following options can be specified on the command line over overridden per QFunction.

Command Line Argument QFunction Property Description Default
-jobPrefix .jobName The unique name of the job. Used to prefix directories and log files. Use -jobNamePrefix on the Queue command line to replace the default prefix Q-<processid>@<host>. <jobNamePrefix>-<jobNumber>
N/A .jobOutputFile Captures stdout and if jobErrorFile is null it captures stderr as well. <jobName>.out
N/A .jobErrorFile If not null captures stderr. null
N/A .commandDirectory The directory to execute the command line from. current directory
-jobProject .jobProject The project name for the job. default job runner project
-jobQueue .jobQueue The queue to dispatch the job. default job runner queue
-jobPriority .jobPriority The dispatch priority for the job. Lowest priority = 0. Highest priority = 100. default job runner priority
-jobNative .jobNativeArgs Native args to pass to the job runner. Currently only supported in GridEngine and Drmaa. The string is concatenated to the native arguments passed over DRMAA. Example: -w n. none
-jobResReq .jobResourceRequests Resource requests to pass to the job runner. On GridEngine this is multiple -l <req>. On LSF a single -R <req> is generated. memory reservations and limits on LSF and GridEngine
-jobEnv .jobEnvironmentNames Predefined environment names to pass to the job runner. On GridEngine this is -pe <env>. On LSF this is -a <env>. none
-memLimit .memoryLimit The memory limit for the job in gigabytes. Used to populate the variables residentLimit and residentRequest which can also be set separately. default job runner memory limit
-resMemLimit .residentLimit Limit for the resident memory in gigabytes. On GridEngine this is -l mem_free=<mem>. On LSF this is -R rusage[mem=<mem>]. memoryLimit * 1.2
-resMemReq .residentRequest Requested amount of resident memory in gigabytes. On GridEngine this is -l h_rss=<mem>. On LSF this is -R rusage[select=<mem>]. memoryLimit

### 3. Email Status Options

Command Line Argument Description Default
-emailHost <hostname> SMTP host name localhost
-emailPort <port> SMTP port 25
-emailTLS If set uses TLS. not set
-emailSSL If set uses SSL. not set
-emailUser <username> If set along with emailPass or emailPassFile authenticates the email with this username. not set
-emailPassFile <file> If emailUser is also set authenticates the email with contents of the file. not set
-emailPass <password> If emailUser is also set authenticates the email with this password. NOT SECURE: Use emailPassFile instead! not set

Created 2012-08-10 23:47:33 | Updated 2014-02-03 22:35:00 | Tags: queue developer pipeline

### 1. Introduction

As mentioned in the introductory materials, the core concept behind the GATK tools is the walker. The Queue scripting framework contains several mechanisms which make it easy to chain together GATK walkers.

### 2. Authoring walkers

As part of authoring your walker there are several Queue behaviors that you can specify for [QScript]() authors using your particular walker.

#### Specifying how to partition

Queue can significantly speed up generating walker outputs by passing different instances of the GATK the same BAM or VCF data but specifying different regions of the data to analyze. After the different instances output their individual results Queue will gather the results back to the original output path requested by QScript.

Queue limits the level it will split genomic data by examining the @PartitionBy() annotation for your walker which specifies a PartitionType. This table lists the different partition types along with the default partition level for each of the different walker types.

PartitionType Default for Walker Type Description Example Intervals Example Splits
PartitionType.CONTIG Read walkers Data is grouped together so that all genomic data from the same contig is never presented to two different instances of the GATK. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60; split 2:chr3:10-11
PartitionType.INTERVAL (none) Data is split down to the interval level but never divides up an explicitly specified interval. If no explicit intervals are specified in the QScript for the GATK then this is effectively the same as splitting by contig. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-40; split 2: chr2:50-60, chr3:10-11
PartitionType.LOCUS Locus walkers, ROD walkers Data is split down to the locus level possibly dividing up intervals. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-35; split 2: chr2:36-40, chr2:50-60, chr3:10-11
PartitionType.NONE Read pair walkers, Duplicate walkers The data cannot be split and Queue must run the single instance of the GATK as specified in the QScript. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 no split: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11

If you walker is implemented in a way that Queue should not divide up your data you should explicitly set the @PartitionBy(PartitionType.NONE). If your walker can theoretically be run per genome location specify @PartitionBy(PartitionType.LOCUS).

@PartitionBy(PartitionType.LOCUS)
public class ExampleWalker extends LocusWalker<Integer, Integer> {
...

#### Specifying how to join outputs

Queue will join the standard walker outputs.

Output type Default gatherer implementation
SAMFileWriter The BAM files are joined together using Picard's MergeSamFiles.
VCFWriter The VCF files are joined together using the GATK CombineVariants.
PrintStream The first two files are scanned for a common header. The header is written once into the output, and then each file is appended to the output, skipping past with the header lines.

If your PrintStream is not a simple text file that can be concatenated together, you must implement a Gatherer. Extend your custom Gatherer from the abstract base class and implement the gather() method.

package org.broadinstitute.sting.commandline;

import java.io.File;
import java.util.List;

/**
* Combines a list of files into a single output.
*/
public abstract class Gatherer {
/**
* Gathers a list of files into a single output.
* @param inputs Files to combine.
* @param output Path to output file.
*/
public abstract void gather(List<File> inputs, File output);

/**
* Returns true if the caller should wait for the input files to propagate over NFS before running gather().
*/
public boolean waitForInputs() { return true; }
}

Specify your gatherer using the @Gather() annotation by your @Output.

@Output
@Gather(MyGatherer.class)
public PrintStream out;

Queue will run your custom gatherer to join the intermediate outputs together.

### 3. Using GATK walkers in Queue

#### Queue GATK Extensions

Running 'ant queue' builds a set of Queue extensions for the GATK-Engine. Every GATK walker and command line program in the compiled GenomeAnalysisTK.jar a Queue compatible wrapper is generated.

The extensions can be imported via import org.broadinstitute.sting.queue.extensions.gatk._

import org.broadinstitute.sting.queue.QScript

class MyQscript extends QScript {
...

Note that the generated GATK extensions will automatically handle shell-escaping of all values assigned to the various Walker parameters, so you can rest assured that all of your values will be taken literally by the shell. Do not attempt to escape values yourself -- ie.,

Do this:

filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")

NOT this:

filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")

#### Listing variables

In addition to the GATK documentation on this wiki you can also find the full list of arguments for each walker extension in a variety of ways.

The source code for the extensions is generated during ant queue and placed in this directory:

build/queue-extensions/src

When properly configured an IDE can provide command completion of the walker extensions. See Queue with IntelliJ IDEA for our recommended settings.

If you do not have access to an IDE you can still find the names of the generated variables using the command line. The generated variable names on each extension are based off of the fullName of the Walker argument. To see the built in documentation for each Walker, run the GATK with:

java -jar GenomeAnalysisTK.jar -T <walker name> -help

Once the import statement is specified you can add() instances of gatk extensions in your QScript's script() method.

#### Setting variables

If the GATK walker input allows more than one of a value you should specify the values as a List().

  def script() {
val snps = new UnifiedGenotyper
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
}

Although it may be harder for others trying to read your QScript, for each of the long name arguments the extensions contain aliases to their short names as well.

  def script() {
val snps = new UnifiedGenotyper
snps.R = new File("testdata/exampleFASTA.fasta")
snps.I = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
}

Here are a few more examples using various list assignment operators.

  def script() {
val countCovariates = new CountCovariates

// Append to list using item appender :+
countCovariates.rodBind :+= RodBind("dbsnp", "VCF", dbSNP)

// Append to list using collection appender ++
countCovariates.covariate ++= List("ReadGroupCovariate", "QualityScoreCovariate", "CycleCovariate", "DinucCovariate")

// Assign list using plain old object assignment
countCovariates.input_file = List(inBam)

// The following is not a list, so just assigning one file to another
countCovariates.recal_file = outRecalFile

}

#### Specifying an alternate GATK jar

By default Queue runs the GATK from the current classpath. This works best since the extensions are generated and compiled at time same time the GATK is compiled via ant queue.

If you need to swap in a different version of the GATK you may not be able to use the generated extensions. The alternate GATK jar must have the same command line arguments as the GATK compiled with Queue. Otherwise the arguments will not match and you will get an error when Queue attempts to run the alternate GATK jar. In this case you will have to create your own custom CommandLineFunction for your analysis.

  def script {
val snps = new UnifiedGenotyper
snps.jarFile = new File("myPatchedGATK.jar")
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
}

#### GATK scatter/gather

Queue currently allows QScript authors to explicitly invoke scatter/gather on GATK walkers by setting the scatter count on a function.

  def script {
val snps = new UnifiedGenotyper
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
snps.scatterCount = 20
}

This will run the UnifiedGenotyper up to 20 ways parallel and then will merge the partial VCFs back into the single snps.vcf.

Some walkers are still being updated to support Queue fully. For example they may not have defined the @Input and @Output and thus Queue is unable to correctly track their dependencies, or a custom Gatherer may not be implemented yet.

Created 2012-08-10 20:27:38 | Updated 2014-02-03 22:33:49 | Tags: queue developer intellij

We have found it that Queue works best with IntelliJ IDEA Community Edition (free) or Ultimate Edition installed with the Scala Plugin enabled. Once you have downloaded IntelliJ IDEA, follow the instructions below to setup a Sting project with Queue and the Scala Plugin.

[[File:sting_project_libraries.png|300px|thumb|right|Project Libraries]] [[File:sting_module_sources.png|300px|thumb|right|Module Sources]] [[File:sting_module_dependencies.png|300px|thumb|right|Module Dependencies]] [[File:sting_module_scala_facet.png|300px|thumb|right|Scala Facet]]

### 1. Build Queue on the Command Line

Build Queue from source from the command line with ant queue, so that:

• The lib folder is initialized including the scala jars
• The queue-extensions for the GATK are generated to the build folder

### 2. Add the scala plugin

• In IntelliJ, open the menu File > Settings
• Under the IDE Settings in the left navigation list select Plugins
• Click on the Available tab under plugins
• Scroll down in the list of available plugins and install the scala plugin
• If asked to retrieve dependencies, click No. The correct scala libraries and compiler are already available in the lib folder from when you built Queue from the command line
• Restart IntelliJ to load the scala plugin

### 3. Creating a new Sting Project including Queue

• Select the menu File... > New Project...

• On the first page of "New Project" select Create project from scratch Click Next >

• On the second page of "New Project" select Set the project Name: to Sting Set the Project files location: to the directory where you checked out the Sting git repository, for example /Users/jamie/src/Sting Uncheck Create Module Click Finish

• The "Project Structure" window should open. If not open it via the menu File > Project Structure

• Under the Project Settings in the left panel of "Project Structure" select Project Make sure that Project SDK is set to a build of 1.6 If the Project SDK only lists <No SDK> add a New > JSDK pointing to /System/Library/Frameworks/JavaVM.framework/Versions/1.6

• Under the Project Settings in the left panel of "Project Structure" select Libraries Click the plus (+) to create a new Project Library Set the Name: to Sting/lib Select Attach Jar Directories Select the path to lib folder under your SVN checkout

• Under the Project Settings in the left panel of "Project Structure" select Modules

• Click on the + box to add a new module

• On the first page of "Add Module" select Create module from scratch Click Next \>

• On the second page of "Add Module" select Set the module Name: to Sting Change the Content root to: <directory where you checked out the Sting SVN repository> Click Next \>

• On the third page Uncheck all of the other source directories only leaving the java/src directory checked Click Next \>

• On fourth page click Finish

• Back in the Project Structure window, under the Module 'Sting', on the Sources tab make sure the following folders are selected

• Source Folders (in blue): public/java/src public/scala/src private/java/src (Broad only) private/scala/src (Broad only) build/queue-extensions/src
• Test Source Folders (in green): public/java/test public/scala/test private/java/test (Broad only) private/scala/test (Broad only)
• In the Project Structure window, under the Module 'Sting', on the Module Dependencies tab select Click on the button Add... Select the popup menu Library... Select the Sting/lib library Click Add selected

• Refresh the Project Structure window so that it becomes aware of the Scala library in Sting/lib Click the OK button Reopen Project Structure via the menu File > Project Structure

• In the second panel, click on the Sting module Click on the plus (+) button above the second panel module In the popup menu under Facet select Scala On the right under Facet 'Scala' set the Compiler library: to Sting/lib Click OK

### 4. Enable annotation processing

• Open the menu File > Settings
• Under Project Settings [Sting] in the left navigation list select Compiler then Annotation Processors
• Click to enable the checkbox Enable annotation processing
• Leave the radio button obtain processors from the classpath selected
• Click OK

### 5. Debugging Queue

[[File:queue_debug.png|300px|thumb|right|Queue Remote Debug]]

• In IntelliJ 10 open the menu Run > Edit Configurations.

• Click the gold [+] button at the upper left to open the Add New Configuration popup menu.

• Select Remote from the popup menu.

• With the new configuration selected on the left, change the configuration name from 'Unnamed' to something like 'Queue Remote Debug'.

• Set the Host to the hostname of your server, and the Port to an unused port. You can try the default port of 5005.

• From the Use the following command line arguments for running remote JVM, copy the argument string.

• On the server, paste / modify your command line to run with the previously copied text, for example java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 Queue.jar -S myscript.scala ....

• If you would like the program to wait for you to attach the debugger before running, change suspend=n to suspend=y.

• Back in IntelliJ, click OK to save your changes.

#### Running with the Remote Configuration

• Ensure Queue Remote Debug is selected via the configuration drop down or Run > Edit Configurations.
• Set your breakpoints as you normally would in IntelliJ.
• Start your program by running the full java path (with the above -Xdebug -Xrunjdwp ...) on the server.
• In IntelliJ go to the Run > Debug.

### 6. Binding javadocs and source

From Stack overflow:

Go to File -> Project Structure -> SDKs -> Apple 1.x -> DocumentationPaths, and the click specify URL.

In IntelliJ, open File -> Project Structure. Click on "SDKs" under "Platform Settings". Add the following path under the Sourcepath tab: /Library/Java/JavaVirtualMachines/1.6.0_29-b11-402.jdk/Contents/Home/src.jar!/src

Created 2012-08-10 20:00:40 | Updated 2014-02-03 22:35:27 | Tags: queue developer qscript

### 1. Introduction

Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.

QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.

The following is a basic outline of a QScript:

import org.broadinstitute.sting.queue.QScript
// List other imports here

// Define the overall QScript here.
class MyScript extends QScript {
// List script arguments here.
@Input(doc="My QScript inputs")
var scriptInput: File = _

// Create and add the functions in the script here.
def script = {
var myCL = new MyCommandLine
myCL.myInput = scriptInput // Example variable input
myCL.myOutput = new File("/path/to/output") // Example hardcoded output
}

}

### 2. Imports

Imports can be any scala or java imports in scala syntax.

import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.

### 3. Classes

• To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.

• The QScript must define a method script.

• The QScript can define helper methods or variables.

### 4. Script method

The body of script should create and add Queue CommandlineFunctions.

class MyScript extends org.broadinstitute.sting.queue.QScript {
def script = add(new CommandLineFunction { def commandLine = "echo hello world" })
}

### 5. Command Line Arguments

• A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.

• A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.

• QScript command line arguments can be marked as optional by setting required=false.

class MyScript extends org.broadinstitute.sting.queue.QScript { @Input(doc="example message to echo") var message: String = _ def script = add(new CommandLineFunction { def commandLine = "echo " + message }) }

### 6. Using and writing CommandLineFunctions

See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.

After functions are defined they should be added to the QScript pipeline using add().

for (vcf <- vcfs) {
val ve = new VariantEval
ve.vcfFile = vcf
ve.evalFile = swapExt(vcf, "vcf", "eval")
}

#### Defining new CommandLineFunctions

• Queue tracks dependencies between functions via variables annotated with @Input and @Output.

• Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

### 7. Examples

• To print the list of arguments required by an existing QScript run with -help.
• To check if your script has all of the CommandLineFunction variables set correctly, run without -run.
• When you are ready to execute the full pipeline, add -run.

#### Hello World QScript

The following is a "hello world" example that runs a single command line to echo hello world.

import org.broadinstitute.sting.queue.QScript

class HelloWorld extends QScript {
def script = {
def commandLine = "echo hello world"
})
}
}

The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run

It should produce output similar to:

INFO  16:23:27,825 QScriptManager - Compiling 1 QScript
INFO  16:23:31,289 QScriptManager - Compilation complete
INFO  16:23:34,631 HelpFormatter - ---------------------------------------------------------
INFO  16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO  16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run
INFO  16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34
INFO  16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO  16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO  16:23:34,634 QCommandLine - Scripting HelloWorld
INFO  16:23:34,651 QCommandLine - Added 1 functions
INFO  16:23:34,651 QGraph - Generating graph.
INFO  16:23:34,660 QGraph - Running jobs.
INFO  16:23:34,689 ShellJobRunner - Starting: echo hello world
INFO  16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out
INFO  16:23:34,771 ShellJobRunner - Done: echo hello world
INFO  16:23:34,773 QGraph - Deleting intermediate files.
INFO  16:23:34,773 QCommandLine - Done 

#### ExampleUnifiedGenotyper.scala

This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.

The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.

To list the command line parameters, including the required parameters, run with -help.

java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help

The help output should appear similar to this:

INFO  10:26:08,491 QScriptManager - Compiling 1 QScript
INFO  10:26:11,926 QScriptManager - Compilation complete
---------------------------------------------------------
---------------------------------------------------------
---------------------------------------------------------
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
[-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
<dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
<job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
[-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
[-filter <filternames>] [-filterExpression <filterexpressions>]

-S,--script <script>                                                      QScript scala file
-run,--run_scripts                                                        Run QScripts.  Without this flag set only
performs a dry run.
-jobRunner,--job_runner <job_runner>                                      Use the specified job runner to dispatch
command line jobs
-bsub,--bsub                                                              Equivalent to -jobRunner Lsf706
-status,--status                                                          Get status of jobs for the qscript
-retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
command fails.  Defaults to no retries.
-startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
outputs were previously output successfully.
-keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
any Function marked as intermediate.
-statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
completion or on error.
-statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
completion or on error.
-dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
a .dot file.  Otherwise overwrites the
dot_graph
-jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
-jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
-jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
-jobPriority,--job_priority <job_priority>                                Default priority for jobs.
-memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory>                                   Root directory to run functions from.
-tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
output for compute farm jobs.
-emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
secure! See emailPassFile.
-l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file>                                          Set the logging location
-quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode                                                       Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help                                                                 Generate this help message

Arguments for ExampleUnifiedGenotyper:
-R,--referencefile <referencefile>                          The reference file for the bam files.
-I,--bamfile <bamfile>                                      Bam file to genotype.
-L,--intervals <intervals>                                  An optional file with a list of intervals to proccess.
-filter,--filternames <filternames>                         A optional list of filter names.
-filterExpression,--filterexpressions <filterexpressions>   An optional list of filter expressions.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala:57) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504): ##### ERROR ##### ERROR Please visit the wiki to see if this is a known problem ##### ERROR If not, please post the error, with stack trace, to the GATK forum ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa ##### ERROR ##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing. ##### ERROR Argument with name '--referencefile' (-R) is missing. ##### ERROR ------------------------------------------------------------------------------------------ To dry run the pipeline: java \ -Djava.io.tmpdir=tmp \ -jar dist/Queue.jar \ -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \ -R human_b36_both.fasta \ -I pilot2_daughters.chr20.10k-11k.bam \ -L chr20.interval_list \ -filter StrandBias -filterExpression "SB>=0.10" \ -filter AlleleBalance -filterExpression "AB>=0.75" \ -filter QualByDepth -filterExpression "QD<5" \ -filter HomopolymerRun -filterExpression "HRun>=4" The dry run output should appear similar to this: INFO 10:45:00,354 QScriptManager - Compiling 1 QScript INFO 10:45:04,855 QScriptManager - Compilation complete INFO 10:45:05,058 HelpFormatter - --------------------------------------------------------- INFO 10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine INFO 10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4 INFO 10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05 INFO 10:45:05,059 HelpFormatter - --------------------------------------------------------- INFO 10:45:05,059 HelpFormatter - --------------------------------------------------------- INFO 10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper INFO 10:45:05,150 QCommandLine - Added 4 functions INFO 10:45:05,150 QGraph - Generating graph. INFO 10:45:05,169 QGraph - Generating scatter gather jobs. INFO 10:45:05,182 QGraph - Removing original jobs. INFO 10:45:05,183 QGraph - Adding scatter gather jobs. INFO 10:45:05,231 QGraph - Regenerating graph. INFO 10:45:05,247 QGraph - ------- INFO 10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals INFO 10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out INFO 10:45:05,254 QGraph - ------- INFO 10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf INFO 10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out INFO 10:45:05,279 QGraph - ------- INFO 10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf INFO 10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out INFO 10:45:05,283 QGraph - ------- INFO 10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf INFO 10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out INFO 10:45:05,288 QGraph - ------- INFO 10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out INFO 10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out INFO 10:45:05,289 QGraph - ------- INFO 10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples INFO 10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out INFO 10:45:05,292 QGraph - ------- INFO 10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval INFO 10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out INFO 10:45:05,296 QGraph - ------- INFO 10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun INFO 10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out INFO 10:45:05,302 QGraph - ------- INFO 10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval INFO 10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out INFO 10:45:05,304 QGraph - Dry run completed successfully! INFO 10:45:05,304 QGraph - Re-run with "-run" to execute the functions. INFO 10:45:05,304 QCommandLine - Done ### 8. Using traits to pass common values between QScripts to CommandLineFunctions QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables. • A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits. • A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions. See the following example: class MyScript extends org.broadinstitute.sting.queue.QScript { // Create an alias 'qscript' for 'MyScript.this' qscript => // This is a script argument @Argument(doc="message to display") var message: String = _ // This is a script argument @Argument(doc="number of times to display") var count: Int = _ trait ReusableArguments extends MyCommandLineFunction { // Whenever a function is created 'with' this trait, it will copy the message. this.commandLineMessage = qscript.message } abstract class MyCommandLineFunction extends CommandLineFunction { // This is a per command line argument @Argument(doc="message to display") var commandLineMessage: String = _ } class MyEchoFunction extends MyCommandLineFunction { def commandLine = "echo " + commandLineMessage } class MyAlsoEchoFunction extends MyCommandLineFunction { def commandLine = "echo also " + commandLineMessage } def script = { for (i <- 1 to count) { val echo = new MyEchoFunction with ReusableArguments val alsoEcho = new MyAlsoEchoFunction with ReusableArguments add(echo, alsoEcho) } } } Created 2012-08-10 19:20:46 | Updated 2014-10-16 21:23:42 | Tags: intro queue developer qscript ### 1. Introduction GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things: • Local realignment around indels • Emitting raw SNP calls • Emitting indels • Masking the SNPs at indels • Annotating SNPs using chip data • Labeling suspicious calls based on filters • Creating a summary report with statistics Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources. With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster. ### 2. Obtaining Queue You have two options: download the binary distribution (prepackaged, ready to run program) or build it from source. #### - Download the binary This is obviously the easiest way to go. Links are on the Downloads page. Just get the Queue package; no need to get the GATK package separately as GATK is bundled in with Queue. #### - Building Queue from source Briefly, here's what you need to know/do: Queue is part of the GATK repository. Download the source from the public repository on Github. Run the following command: git clone https://github.com/broadgsa/gatk.git IMPORTANT NOTE: These instructions refer to the MIT-licensed version of the GATK+Queue source code. With that version, you will be able to build Queue itself, as well as the public portion of the GATK (the core framework), but that will not include the GATK analysis tools. If you want to use Queue to pipeline the GATK analysis tools, you need to clone the 'protected' repository. Please note however that part of the source code in that repository (the 'protected' module) is under a different license which excludes for-profit use, modification and redistribution. Move to the git root directory and use maven to build the source. mvn clean verify All dependencies will be managed by Maven as needed. See this article on how to test your installation of Queue. ### 3. Running Queue See this article on running Queue for the first time for full details. Queue arguments can be listed by running with --help java -jar dist/Queue.jar --help To list the arguments required by a QScript, add the script with -S and run with --help. java -jar dist/Queue.jar -S script.scala --help Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run. See QFunction and Command Line Options for more info on adjusting Queue options. ### 4. QScripts #### General Information Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts. Every QScript includes the following steps: • New instances of CommandLineFunctions are created • Input and output arguments are specified on each function • The function is added with add() to Queue for dispatch and monitoring The basic command-line to run the Queue pipelines on the command line is java -jar Queue.jar -S <script>.scala See the main article Queue QScripts for more info on QScripts. #### Supported QScripts Most QScripts are analysis pipelines that are custom-built for specific projects, and we currently do not offer any QScripts as supported analysis tools. However, we do provide some example scripts that you can use as basis to write your own QScripts (see below). #### Example QScripts The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples ### 5. Visualization and Queue #### QJobReport Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport. Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file: bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile .libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/") Note that gsalib is available from the CRAN repository so you can install it with the canonical R package install command. #### Caveats • The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves • This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE. #### DOT visualization of Pipelines Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline. To clarify your pipeline, override the dotString() function: class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction { @Input(doc="foo") var bam = bamIn @Input(doc="foo") var bamIndex = bai(bamIn) @Output(doc="foo") var recalData = recalDataIn memoryLimit = Some(4) override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args) def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData) } Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here: ### 6. Further reading Created 2012-08-09 04:08:01 | Updated 2013-06-17 21:09:33 | Tags: test intro queue developer install #### Objective Test that Queue is correctly installed, and that the supporting tools like Java are in your path. #### Prerequisites • Basic familiarity with the command-line environment • Understand what is a PATH variable • GATK installed • Queue downloaded and placed on path #### Steps 1. Invoke the Queue usage/help message 2. Troubleshooting ### 1. Invoke the Queue usage/help message The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly. Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation. #### Action Type the following command: java -jar <path to Queue.jar> --help replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment. #### Expected Result You should see usage output similar to the following: usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>] [-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir <temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>] [-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo <status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -S,--script <script> QScript scala file -jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs. -jobQueue,--job_queue <job_queue> Default queue for compute farm jobs. -jobProject,--job_project <job_project> Default project for compute farm jobs. -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather output for compute farm jobs. -memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes. -runDir,--run_directory <run_directory> Root directory to run functions from. -tempDir,--temp_directory <temp_directory> Temp directory to pass to functions. -emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost. -emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl, otherwise 25. -emailTLS,--emailUseTLS Email should use TLS. Defaults to false. -emailSSL,--emailUseSSL Email should use SSL. Defaults to false. -emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none. -emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not secure! See emailPassFile. -emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none. -bsub,--bsub_all_jobs Use bsub to submit jobs -run,--run_scripts Run QScripts. Without this flag set only performs a dry run. -dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See: http://en.wikipedia.org/wiki/DOT_language -expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to a .dot file. Otherwise overwrites the dot_graph -startFromScratch,--start_from_scratch Runs all command line functions even if the outputs were previously output successfully. -status,--status Get status of jobs for the qscript -statusFrom,--status_email_from <status_email_from> Email address to send emails from upon completion or on error. -statusTo,--status_email_to <status_email_to> Email address to send emails to upon completion or on error. -keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of any Function marked as intermediate. -retry,--retry_failed <retry_failed> Retry the specified number of times after a command fails. Defaults to no retries. -l,--logging_level <logging_level> Set the minimum level of logging, i.e. setting INFO get's you INFO up to FATAL, setting ERROR gets you ERROR and FATAL level logging. -log,--log_to_file <log_to_file> Set the logging location -quiet,--quiet_output_mode Set the logging to quiet mode, no output to stdout -debug,--debug_mode Set the logging file string to include a lot of debugging information (SLOW!) -h,--help Generate this help message If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting. ### 2. Troubleshooting Let's try to figure out what's not working. #### Action First, make sure that your Java version is at least 1.6, by typing the following command: java -version #### Expected Result You should see something similar to the following text: java version "1.6.0_12" Java(TM) SE Runtime Environment (build 1.6.0_12-b04) Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)  #### Remedial actions If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like java: Command not found  make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables. On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed. Created 2012-07-23 17:05:10 | Updated 2013-03-25 22:18:53 | Tags: dataprocessingpipeline queue workflow pacbio qscript intermediate ### Introduction Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future. The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below. ### BWA alignment First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools) To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications. ### Best Practices for Variant Calling Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data. ### Problems with Variant Calling with Pacific Biosciences • Calling must be more permissive of indels in the data. You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed. • Base quality thresholds should be adjusted to the specifics of your data. Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary. • Reference bias To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand. Created 2013-12-20 22:57:53 | Updated 2014-02-07 17:01:39 | Tags: best-practices workshop queue presentations videos topstory The presentation videos for: • Day 1 (Best Practices for Variant Calling with GATK) • Day 2 (Building Analysis Pipelines with Queue) are available here: http://www.broadinstitute.org/gatk/guide/events?id=3391 Created 2013-10-22 02:40:27 | Updated 2013-10-22 14:49:48 | Tags: best-practices workshop queue presentations slides Slides for : • Day 1 (Best Practices for Variant Calling with GATK) • Day 2 (Building Analysis Pipelines with Queue) are available at this link in our DropBox slide archive : https://www.dropbox.com/sh/ryz7bx4aeqc7mkl/44Q_2jvIyH Created 2013-09-17 18:18:29 | Updated | Tags: best-practices workshop queue pipeline Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22. http://www.cvent.com/events/broade-workshop-gatk-best-practices-building-analysis-pipelines-with-queue/event-summary-de1eaa027413404ba6dc04c128d52c63.aspx This workshop will cover the following topics: • GATK Best Practices for Variant Detection • Building Analysis Pipelines with Queue The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop! Created 2016-04-17 10:29:00 | Updated | Tags: queue licence Last week I did have a chat with @Geraldine_VdAuwera in Edinburgh about our split licence in our project. This only because we rely on some protected code of the queue extensions of the GATK walkers. To prevent problems I did split out repository in public and protected just like you are doing in GATK 3 right now. This prohibit us to distribute our jar. Out repository is available here: https://github.com/biopet/biopet This is the module that causes the problem:  <dependency> <groupId>org.broadinstitute.gatk</groupId> <artifactId>gatk-queue-extensions-distribution</artifactId> <version>3.5</version> </dependency> We don't include GATK itself in our project, the user should download this themselves and supply this in a config file. It would be better to get rid of this split licence so we can get full open source. This also allows of to distribute our jar. As discussed with @Geraldine_VdAuwera it might be possible to get permission to use the protected modules in our public code. Before I do this I would like a explicit permission from Broad to do this to prevent any law issues. So would it be possible to get this permission? Thanks already for looking into this. Created 2016-04-06 09:42:21 | Updated | Tags: haplotypecaller queue Hi!!! I'm attempting to use Queue on PBSPro HPC cluster. I have tested the functionality of a custom scala script for Haplotype Caller and it is runnable. However, following the discussion on GATK forum, I should need a job scheduler to dispatch queue output on several nodes..could you give me some advice or examples of the type of scheduler I need in a PBSpro system? I tried to run Queue on a single node and it seems working faster..the question is: when I run Queue on a single node I actually 'multithreading' HC or I'm wrong? Thanks a lot Best Regards Marco Created 2016-03-07 21:24:06 | Updated | Tags: queue picard I can not find the Picard version included in the Queue 3.5. Who can tell me that? Just for the document purpose. Many thanks in advance. Created 2016-03-01 13:02:06 | Updated 2016-03-01 13:08:47 | Tags: baserecalibrator commandlinegatk queue qscript analyzecovariates Dear GATK team, I'd like to ask a question about the possibility of HaplotypeCaller and AnalyzeCovariates running in parallel: I developed a QScript that runs indel realignment, BQSR, variant calling (obtaining gVCF file as a result) and then -- BaseRecalibrator for the second time followed by AnalyzeCovariates. Looking in QScript jobreport PDF file I noticed that the second run of BaseRecalibrator was performed after HaplotypeCaller, though as I can understand, HaplotypeCaller and the second run of BaseRecalibrator are independent regarding data and potentially can run in paralle. Can I run them in parallel somehow in order to save time? Created 2016-02-23 18:01:29 | Updated | Tags: commandlinegatk queue qscript analyzecovariates Dear GATK team, I use Queue to build a pipeline with GATK tools: RealignerTargetCreator, IndelRealigner, BaseRecalibrator, AnalyzeCovariates, and HaplotypeCaller. So, I developed the corresponding QScript. When I tested it the very first time, all functions finished with no errors. But then, when I invoked it with -startfromScratch it failed to execute AnalyzeCovariates saying: ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' ERROR 19:08:14,045 FunctionEdge - Contents of bqsr-report.pdf.out: [...] In bqsr-report.out file I can see no errors: INFO 17:25:12,764 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,767 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:40 INFO 17:25:12,767 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 17:25:12,767 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 17:25:12,771 HelpFormatter - Program Args: -T AnalyzeCovariates -L intervals_to_process.interval_list -R Homo_sapiens_assembly38.fasta -before recal-table1.txt -after recal-table2.txt -plots bqsr-report.pdf -csv bqsr-report.csv INFO 17:25:12,779 HelpFormatter - Executing as [...] INFO 17:25:12,779 HelpFormatter - Date/Time: 2016/02/23 17:25:12 INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,780 HelpFormatter - -------------------------------------------------------------------------------- INFO 17:25:12,838 GenomeAnalysisEngine - Strictness is SILENT INFO 17:25:13,079 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 17:25:13,196 IntervalUtils - Processing 3088286401 bp from intervals INFO 17:25:13,277 GenomeAnalysisEngine - Preparing for traversal INFO 17:25:13,287 GenomeAnalysisEngine - Done preparing for traversal INFO 17:25:13,287 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 17:25:13,288 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 17:25:13,289 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime INFO 17:25:13,764 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,921 ContextCovariate - Context sizes: base substitution model 2, indel substitution model 3 INFO 17:25:13,931 AnalyzeCovariates - Generating csv file 'bqsr-report.csv' INFO 17:25:14,164 AnalyzeCovariates - Generating plots file 'bqsr-report.pdf' INFO 17:25:20,279 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariatesNone@3ce295f9 INFO 17:25:20,282 ProgressMeter - done 0.0 6.0 s 11.6 w 100.0% 6.0 s 0.0 s INFO 17:25:20,283 ProgressMeter - Total runtime 7.00 secs, 0.12 min, 0.00 hours INFO 17:25:21,513 GATKRunReport - Uploaded run statistics report to AWS S3

I can see that AnalyzeCovariates and HaplotypeCaller start nearly at the same time:

INFO 17:25:09,264 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv' INFO 17:25:09,264 FunctionEdge - Output written to bqsr-report.pdf.out INFO 17:25:21,548 FunctionEdge - Starting: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' 'recaled.bam' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-variant_index_type' 'LINEAR' '-variant_index_parameter' '128000' '-o' 'sample.gvcf' '-D' 'dbsnp_144.hg38.vcf' '-ERC' 'GVCF' '-pcrModel' 'CONSERVATIVE' INFO 17:25:21,548 FunctionEdge - Output written to sample.gvcf.out

and after HaplotypeCaller finishes successfully (I checked that it produced a gVCF file), this message about AnalyzeCovariates error is printed:

INFO 19:08:14,029 QGraph - 0 Pend, 2 Run, 0 Fail, 10 Done ERROR 19:08:14,037 FunctionEdge - Error: 'java' '-Xmx16384m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLi mit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=tmp' '-cp' 'Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'AnalyzeCovariates' '-L' 'intervals_to_process.interval_list' '-R' 'Homo_sapiens_assembly38.fasta' '-before' 'recal-table1.txt' '-after' 'recal-table2.txt' '-plots' 'bqsr-report.pdf' '-csv' 'bqsr-report.csv'

Of course, AnalyzeCovariates didn't produce CSV and PDF reports this way.

When I invoked the QScript again (this time not from scratch) to reproduce the situation, it executed with no errors, and AnalyzeCovariates generated both CSV and PDF reports.

After that I executed AnalyzeCovariates manually with -l DEBUG and it produced only CSV report, no PDF. Here is the central part of the debug output of AnalyzeCovariates:

DEBUG 20:38:28,593 RecalUtils - R command line: Rscript (resource)org/broadinstitute/gatk/engine/recalibration/BQSR.R bqsr-report.csv recal-table1.txt bqsr-report1.pdf DEBUG 20:38:28,607 RScriptExecutor - Executing: DEBUG 20:38:28,607 RScriptExecutor - Rscript DEBUG 20:38:28,607 RScriptExecutor - -e DEBUG 20:38:28,608 RScriptExecutor - tempLibDir = '/tmp/Rlib.4876869209103817519';source('/tmp/BQSR.3779158431229884689.R'); DEBUG 20:38:28,608 RScriptExecutor - bqsr-report.csv DEBUG 20:38:28,608 RScriptExecutor - recal-table1.txt DEBUG 20:38:28,608 RScriptExecutor - bqsr-report1.pdf

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

lowess

Warning messages: 1: NAs introduced by coercion 2: NAs introduced by coercion DEBUG 20:38:34,790 RScriptExecutor - Result: 0 INFO 20:38:34,792 Walker - [REDUCE RESULT] Traversal result is: org.broadinstitute.gatk.tools.walkers.bqsr.AnalyzeCovariates$Non e@44723d95 INFO 20:38:34,795 ProgressMeter - done 0.0 7.0 s 11.6 w 100.0% 7.0 s 0.0 s INFO 20:38:34,795 ProgressMeter - Total runtime 7.03 secs, 0.12 min, 0.00 hours  So, every time, when I start my QScript from scratch AnalyzeCovariates fails, but it finished successfully when I invoke my script the next time without -startFromScratch. When I execute AnalyzeCovariates manually with -l DEBUG, it produces CSV report only. Should I use BQSR.R script directly? I will be very grateful for any tips and help. Created 2015-12-15 16:21:48 | Updated | Tags: queue picard collectmultiplemetrics Greetings, Some of the picard's collect (e.g. CollectMultipleMetrics, CollectInsertSizeMetrics) commands load R script resources. Those resources were present in Queue <= v3.3-0 but they are absent from the subsequent versions. $ jar tf 3.3-0/Queue.jar | grep "picard.*R$" picard/analysis/baseDistributionByCycle.R picard/analysis/gcBias.R picard/analysis/meanQualityByCycle.R picard/analysis/rnaSeqCoverage.R picard/analysis/qualityScoreDistribution.R picard/analysis/insertSizeHistogram.R$ jar tf 3.4-0/Queue.jar | grep "picard.*R$"$ jar tf 3.5-0/Queue.jar | grep "picard.*R$"$  Without those resources, CollectMultipleMetrics throws something similar to java.lang.IllegalArgumentException: Script [picard/analysis/qualityScoreDistribution.R] not found in classpath depending on which commands it hits first. I thought I would be able to overcome that by adding picard.jar to CLASSPATH but it did not help. I had to add them to Queue.jar in order to make it work.

It would be nice to have those resources back in the distributed Queue.jar.

Thanks!

Roman

Created 2015-12-01 13:17:38 | Updated | Tags: haplotypecaller queue

I tried HaplotypeCaller on whole-genome sample in GVCF mode and it takes several days and ends up incomplete. I would like to use Queue script to parallelize the jobs. I am new to using Queue scripts and framed the below script using the posts in the forum and have errors.

package org.broadinstitute.sting.queue.qscripts

class HaplotypeCaller extends QScript {
// Create an alias 'qscript' to be able to access variables in the VariantCaller.
// 'qscript' is now the same as 'VariantCaller.this'
qscript =>

// Required arguments. All initialized to empty values.
@Input(doc="The reference file for the bam files.", shortName="R", required=true)
var referenceFile: File = _

@Input(doc="One or more bam files.", shortName="I")
var bamFiles: List[File] = Nil

@Input(doc="Output core filename.", shortName="O", required=true)
var out: File = _

@Argument(doc="Maxmem.", shortName="mem", required=true)
var maxMem: Int = _

@Argument(doc="Number of scatters", shortName="nsc", required=true)
var numScatters: Int = _

@Argument(doc="Minimum phred-scaled confidence to call variants", shortName="stand_call_conf", required=true)
var standCallConf: Int = _ //30 //default: best-practices value

@Argument(doc="Minimum phred-scaled confidence to emit variants", shortName="stand_emit_conf", required=true)
var standEmitConf: Int = _ //10 //default: best-practices value

@Argument(doc="Mode for emitting reference confidenc scores", shortName="ERC", required=true)
var EmitRefConfidence: Boolean = true

// The following arguments are all optional.
@Input(doc="An optional file with known SNP sites.", shortName="D", required=false)
var dbsnpFile: File = _

@Input(doc="An optional file with targets intervals.", shortName="L", required=false)
var targetFile: File = _

@Argument(doc="Amount of padding (in bp) to add to each interval", shortName="ip", required=false)

def script() {
val haplotypeCaller = new HaplotypeCaller

// All required input
haplotypeCaller.input_file = bamFiles
haplotypeCaller.reference_sequence = referenceFile
haplotypeCaller.out = qscript.out + ".g.vcf"

haplotypeCaller.scatterCount = numScatters
haplotypeCaller.memoryLimit = maxMem

haplotypeCaller.stand_emit_conf = standEmitConf
haplotypeCaller.stand_call_conf = standCallConf
haplotypeCaller.emitRefConfidence = GVCF
// Optional input
if (dbsnpFile != null) {
haplotypeCaller.D = dbsnpFile
}
if (targetFile != null) {
haplotypeCaller.L :+= targetFile
}

}
}

The script is run from the command line as follows:

java -jar /proj/lohi/Canine_Tools/GATK-Queue-3.5/Queue.jar -S HaplotypeCaller.scala -R canFam3.fa -I BD01_recalibrated.bam -stand_call_conf 30 -stand_emit_conf 10 -nct 4 -ERC GVCF -O testQueue -run -debug

which throws the error:

INFO  15:06:10,940 QScriptManager - Compiling 1 QScript
ERROR 15:06:11,096 QScriptManager - HaplotypeCaller.scala:3: object QScript is not a member of package org.broadinstitute.sting.queue
ERROR 15:06:11,102 QScriptManager - import org.broadinstitute.sting.queue.QScript
ERROR 15:06:11,103 QScriptManager -        ^
ERROR 15:06:11,129 QScriptManager - HaplotypeCaller.scala:5: object extensions is not a member of package org.broadinstitute.sting.queue
ERROR 15:06:11,133 QScriptManager - import org.broadinstitute.sting.queue.extensions.gatk._
ERROR 15:06:11,134 QScriptManager -                                       ^
ERROR 15:06:11,144 QScriptManager - class HaplotypeCaller extends QScript {
ERROR 15:06:11,145 QScriptManager -                               ^
ERROR 15:06:11,335 QScriptManager -     var referenceFile: File = _
ERROR 15:06:11,335 QScriptManager -                        ^
ERROR 15:06:11,338 QScriptManager -     @Input(doc="The reference file for the bam files.", shortName="R", required=true)
ERROR 15:06:11,339 QScriptManager -      ^
ERROR 15:06:11,347 QScriptManager -     var bamFiles: List[File] = Nil
ERROR 15:06:11,347 QScriptManager -                        ^
ERROR 15:06:11,350 QScriptManager -     @Input(doc="One or more bam files.", shortName="I")
ERROR 15:06:11,351 QScriptManager -      ^
ERROR 15:06:11,623 QScriptManager -     var out: File = _
ERROR 15:06:11,623 QScriptManager -              ^
ERROR 15:06:11,625 QScriptManager -     @Input(doc="Output core filename.", shortName="O", required=true)
ERROR 15:06:11,626 QScriptManager -      ^
ERROR 15:06:11,632 QScriptManager -     @Argument(doc="Maxmem.", shortName="mem", required=true)
ERROR 15:06:11,633 QScriptManager -      ^
ERROR 15:06:11,635 QScriptManager -     @Argument(doc="Number of cpu threads per data thread", shortName="nct", required=true)
ERROR 15:06:11,636 QScriptManager -      ^
ERROR 15:06:11,638 QScriptManager -     @Argument(doc="Number of scatters", shortName="nsc", required=true)
ERROR 15:06:11,639 QScriptManager -      ^
ERROR 15:06:11,641 QScriptManager -     @Argument(doc="Minimum phred-scaled confidence to call variants", shortName="stand_call_conf", required=true)
ERROR 15:06:11,642 QScriptManager -      ^
ERROR 15:06:11,644 QScriptManager -     @Argument(doc="Minimum phred-scaled confidence to emit variants", shortName="stand_emit_conf", required=true)
ERROR 15:06:11,644 QScriptManager -      ^
ERROR 15:06:11,651 QScriptManager -     @Argument(doc="Mode for emitting reference confidenc scores", shortName="ERC", required=true)
ERROR 15:06:11,652 QScriptManager -      ^
ERROR 15:06:11,654 QScriptManager -     var dbsnpFile: File = _
ERROR 15:06:11,654 QScriptManager -                    ^
ERROR 15:06:11,656 QScriptManager -     @Input(doc="An optional file with known SNP sites.", shortName="D", required=false)
ERROR 15:06:11,656 QScriptManager -      ^
ERROR 15:06:11,658 QScriptManager -     var targetFile: File = _
ERROR 15:06:11,658 QScriptManager -                     ^
ERROR 15:06:11,660 QScriptManager -     @Input(doc="An optional file with targets intervals.", shortName="L", required=false)
ERROR 15:06:11,660 QScriptManager -      ^
ERROR 15:06:11,662 QScriptManager -     @Argument(doc="Amount of padding (in bp) to add to each interval", shortName="ip", required=false)
ERROR 15:06:11,663 QScriptManager -      ^
ERROR 15:06:11,888 QScriptManager - HaplotypeCaller.scala:55: value input_file is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,888 QScriptManager -     haplotypeCaller.input_file = bamFiles
ERROR 15:06:11,889 QScriptManager -                         ^
ERROR 15:06:11,896 QScriptManager - HaplotypeCaller.scala:56: value reference_sequence is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,897 QScriptManager -     haplotypeCaller.reference_sequence = referenceFile
ERROR 15:06:11,897 QScriptManager -                         ^
ERROR 15:06:11,907 QScriptManager - HaplotypeCaller.scala:59: value scatterCount is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,908 QScriptManager -     haplotypeCaller.scatterCount = numScatters
ERROR 15:06:11,909 QScriptManager -                         ^
ERROR 15:06:11,915 QScriptManager - HaplotypeCaller.scala:60: value memoryLimit is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,916 QScriptManager -     haplotypeCaller.memoryLimit = maxMem
ERROR 15:06:11,916 QScriptManager -                         ^
ERROR 15:06:11,924 QScriptManager -                         ^
ERROR 15:06:11,930 QScriptManager - HaplotypeCaller.scala:63: value stand_emit_conf is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,930 QScriptManager -     haplotypeCaller.stand_emit_conf = standEmitConf
ERROR 15:06:11,931 QScriptManager -                         ^
ERROR 15:06:11,937 QScriptManager - HaplotypeCaller.scala:64: value stand_call_conf is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,938 QScriptManager -     haplotypeCaller.stand_call_conf = standCallConf
ERROR 15:06:11,938 QScriptManager -                         ^
ERROR 15:06:11,944 QScriptManager - HaplotypeCaller.scala:65: value emitRefConfidence is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,945 QScriptManager -         haplotypeCaller.emitRefConfidence = GVCF
ERROR 15:06:11,945 QScriptManager -                         ^
ERROR 15:06:11,952 QScriptManager - HaplotypeCaller.scala:68: value D is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,952 QScriptManager -         haplotypeCaller.D = dbsnpFile
ERROR 15:06:11,953 QScriptManager -                             ^
ERROR 15:06:11,960 QScriptManager - HaplotypeCaller.scala:71: value L is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,961 QScriptManager -         haplotypeCaller.L :+= targetFile
ERROR 15:06:11,961 QScriptManager -                             ^
ERROR 15:06:11,967 QScriptManager - HaplotypeCaller.scala:72: value ip is not a member of org.broadinstitute.sting.queue.qscripts.HaplotypeCaller
ERROR 15:06:11,967 QScriptManager -         haplotypeCaller.ip = intervalPadding
ERROR 15:06:11,968 QScriptManager -                             ^
ERROR 15:06:11,971 QScriptManager -         ^
ERROR 15:06:11,981 QScriptManager - 64 errors found
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.gatk.queue.QException: Compile of HaplotypeCaller.scala failed with 64 errors
at org.broadinstitute.gatk.queue.QCommandLine.org$broadinstitute$gatk$queue$QCommandLine$$qScriptPluginManagerlzycompute(QCommandLine.scala:94) at org.broadinstitute.gatk.queue.QCommandLine.orgbroadinstitutegatkqueueQCommandLine$$qScriptPluginManager(QCommandLine.scala:92)
at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61) at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Compile of HaplotypeCaller.scala failed with 64 errors ##### ERROR ------------------------------------------------------------------------------------------ INFO 15:06:12,064 QCommandLine - Shutting down jobs. Please wait...  Could anyone offer some suggestions to fix this. Any help is appreciated. Created 2015-10-01 15:41:09 | Updated | Tags: queue combinegvcfs HI GATK team/users, We're doing a targeted sequencing project with ~1500 samples so I'm trying to make run-level files with batches of 36 samples with CombineGVCFs. I've written a super simple Qscript in the past that worked well. However we've moved to a new server since that use slurm instead of SGE. I've installed slurm-drmaa and that seems to be working well, but now I can't specify the list of vcf files correctly and I don't know enough about scala syntax to know what I'm doing wrong. Any help on this would be very appreciated. I've spent too many hours trying to figure this out with no success! I've tried submitted a list of vcfs or individual file names. Both give me the same error. **Here is the scala script. // Import classes package org.broadinstitute.sting.queue.qscripts.examples import collection.JavaConversions._ import org.broadinstitute.gatk.queue.QScript import org.broadinstitute.gatk.queue.extensions.gatk._ class RunCombineGVCFs extends QScript { @Input(fullName="variant", shortName = "V", doc="List of input gVCF files", required=true) var gvcfFile: List[File] = Nil @Argument(doc="Name of output merged GVCF file.", shortName="o") var output: File = _ var stupiderror: Boolean = true def script() { val combineGVCFs = new CombineGVCFs combineGVCFs.reference_sequence = new File("hs37d5.fa") combineGVCFs.variant = gvcfFile combineGVCFs.disable_auto_index_creation_and_locking_when_reading_rods = stupiderror combineGVCFs.out = output combineGVCFs.scatterCount=26 add(combineGVCFs) } } Submission variables java -Xmx5G -jar Queue.jar \ -S RunCombineGVCFs.test.scala \ -qsub \ --memory_limit 10 \ --run_name VCFFILE \ -o ../RunVCF/VCFFILE \ --variant SampleFR.79707.GATGAATC.g.vcf \ --variant SampleCADB.79202.GACAGTGC.g.vcf \ --variant SampleNFBC.1183222.CGGATTGC.g.vcf \ --variant SampleFR.57509.AGTCACTA.g.vcf \ -startFromScratch \ -run Error message INFO 10:41:14,300 QScriptManager - Compiling 1 QScript INFO 10:41:26,173 QScriptManager - Compilation complete INFO 10:41:26,246 HelpFormatter - ---------------------------------------------------------------------- INFO 10:41:26,246 HelpFormatter - Queue v3.4-46-gbc02625, Compiled 2015/07/09 17:37:55 INFO 10:41:26,246 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 10:41:26,247 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 10:41:26,247 HelpFormatter - Program Args: -S /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/RunCombineGVCFs.test.scala -qsub --memory_limit 10 --run_name VCFFILE -o ../RunVCF/VCFFILE --variant SampleFR.79707.GATGAATC.g.vcf --variant SampleCADB.79202.GACAGTGC.g.vcf --variant SampleNFBC.1183222.CGGATTGC.g.vcf --variant SampleFR.57509.AGTCACTA.g.vcf -startFromScratch -run INFO 10:41:26,248 HelpFormatter - Executing as fashar1@jhu.edu@login-node04 on Linux 2.6.32-504.30.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_15_12_57-b00. INFO 10:41:26,248 HelpFormatter - Date/Time: 2015/10/01 10:41:26 INFO 10:41:26,248 HelpFormatter - ---------------------------------------------------------------------- INFO 10:41:26,248 HelpFormatter - ---------------------------------------------------------------------- INFO 10:41:26,253 QCommandLine - Scripting RunCombineGVCFs INFO 10:41:26,300 QCommandLine - Added 1 functions INFO 10:41:26,300 QGraph - Generating graph. INFO 10:41:26,312 QGraph - Generating scatter gather jobs. INFO 10:41:26,341 QGraph - Removing original jobs. INFO 10:41:26,342 QGraph - Adding scatter gather jobs. INFO 10:41:26,475 QGraph - Regenerating graph. INFO 10:41:26,523 QGraph - Running jobs. INFO 10:41:26,555 QGraph - Removing outputs from previous runs. INFO 10:41:26,629 FunctionEdge - Starting: LocusScatterFunction: List(/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleCADB.79202.GACAGTGC.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleNFBC.1183222.CGGATTGC.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleNFBC.1183222.CGGATTGC.g.vcf, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.79707.GATGAATC.g.vcf, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.57509.AGTCACTA.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.79707.GATGAATC.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleCADB.79202.GACAGTGC.g.vcf, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/hs37d5.fa, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.57509.AGTCACTA.g.vcf) > List(/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_01_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_02_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_03_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_04_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_05_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_06_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_07_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_08_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_09_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_10_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_11_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_12_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_13_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_14_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_15_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_16_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_17_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_18_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_19_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_20_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_21_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_22_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_23_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_24_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_25_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_26_of_26/scatter.intervals) INFO 10:41:26,629 FunctionEdge - Output written to /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/scatter/scatter.out INFO 10:41:26,747 QGraph - 28 Pend, 1 Run, 0 Fail, 0 Done INFO 10:41:56,651 FunctionEdge - Done: LocusScatterFunction: List(/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleCADB.79202.GACAGTGC.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleNFBC.1183222.CGGATTGC.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleNFBC.1183222.CGGATTGC.g.vcf, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.79707.GATGAATC.g.vcf, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.57509.AGTCACTA.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.79707.GATGAATC.g.vcf.idx, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleCADB.79202.GACAGTGC.g.vcf, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/hs37d5.fa, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.57509.AGTCACTA.g.vcf) > List(/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_01_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_02_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_03_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_04_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_05_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_06_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_07_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_08_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_09_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_10_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_11_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_12_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_13_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_14_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_15_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_16_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_17_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_18_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_19_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_20_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_21_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_22_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_23_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_24_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_25_of_26/scatter.intervals, /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_26_of_26/scatter.intervals) INFO 10:41:56,652 QGraph - Writing incremental jobs reports... INFO 10:41:56,652 QJobsReporter - Writing JobLogging GATKReport to file /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/VCFFILE.jobreport.txt INFO 10:41:56,686 FunctionEdge - Starting: 'java' '-Xmx10240m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/tmp' '-cp' '/work-zfs/darking1/software/src/gatk/Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'CombineGVCFs' '-L' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_01_of_26/scatter.intervals' '-R' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/hs37d5.fa' '-disable_auto_index_creation_and_locking_when_reading_rods' '-V' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.79707.GATGAATC.g.vcf' '-V' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleCADB.79202.GACAGTGC.g.vcf' '-V' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleNFBC.1183222.CGGATTGC.g.vcf' '-V' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/SampleFR.57509.AGTCACTA.g.vcf' '-o' '/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_01_of_26/VCFFILE' INFO 10:41:56,686 FunctionEdge - Output written to /work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.calling/SampleVCF/.queue/scatterGather/VCFFILE-1-sg/temp_01_of_26/VCFFILE.out INFO 10:41:56,707 GridEngineJobRunner - Native spec is: -V -l h_rss=12288M ERROR 10:41:56,709 Retry - Caught error during attempt 1 of 4. Created 2015-09-22 12:51:29 | Updated | Tags: bqsr queue scala Hi, I am trying to speed up BaseRecalibrator by using GATK Queue. However, it does not seem to play well with scatter-gather. When I run it with the -BQSR option, it fails with an error message which seems to apply to GATK Queue itself and not (at least not directly) to something in my script. My script is just a wrapper: import org.broadinstitute.gatk.queue.QScript import org.broadinstitute.gatk.queue.extensions.gatk._ class MyBaseRecalibrator extends QScript { @Input(shortName="R") var referenceFile: File = _ @Input(shortName="I") var bamFile: File = _ @Input(shortName="L", required=false) var targetIntervals: File = _ @Argument(shortName="interval_padding", required=false) var intervalPadding: Int = _ @Input(shortName="knownSites") var knownSites: List[File] = Nil @Input(shortName="BQSR", required=false) var bqsrFile: File = _ @Output(shortName="o") var outFile: File = _ @Argument(shortName="scatterCount") var scatterCount: Int = _ def script() { val br = new BaseRecalibrator br.R = referenceFile br.I :+= bamFile br.L :+= targetIntervals br.interval_padding = intervalPadding br.BQSR = bqsrFile br.knownSites = knownSites br.o = outFile br.scatterCount = scatterCount br.memoryLimit = 8 add(br) } } When the BQSR option is left out, everything works great, but when it is included, the error message is as follows: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace java.lang.NullPointerException at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$$anonfunintervalFilesExist1.apply(GATKScatterFunction.scala:90) at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$$anonfun$intervalFilesExist$1.apply(GATKScatterFunction.scala:90) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
at scala.collection.immutable.List.exists(List.scala:84)
at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$class.intervalFilesExist(GATKScatterFunction.scala:90) at org.broadinstitute.gatk.queue.extensions.gatk.ContigScatterFunction.intervalFilesExist(ContigScatterFunction.scala:35) at org.broadinstitute.gatk.queue.extensions.gatk.ContigScatterFunction.scatterCount(ContigScatterFunction.scala:37) at org.broadinstitute.gatk.queue.function.scattergather.ScatterGatherableFunction$class.generateFunctions(ScatterGatherableFunction.scala:131)
at org.broadinstitute.gatk.queue.engine.QGraph$$anonfunfillGraph1.apply(QGraph.scala:182) at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$fillGraph$1.apply(QGraph.scala:179)
at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61) at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala) ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625): ##### ERROR ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem. ##### ERROR If not, please post the error message, with stack trace, to the GATK forum. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Code exception (see stack trace for error itself) ##### ERROR ------------------------------------------------------------------------------------------ INFO 11:26:09,842 QCommandLine - Shutting down jobs. Please wait... None of the lines seem to refer to my code. Any clues? Thanks, Michael Created 2015-09-16 07:21:04 | Updated | Tags: queue drmaa memory slurm Hi, I am trying to use GATK Queue to submit scatter-gather jobs on our local cluster. However, the jobs almost always run out of memory, because only 1GB is allocated by default. I run Queue with the -jobRunner Drmaa option, and I have tried all different combinations of -memLimit 8 -resMemLimit 8 -resMemReq 8 to see if I could somehow force it to allocate more, but the problem persists. Any ideas how to increase the memory allocation? Our cluster uses SLURM for job submission. Created 2015-07-11 00:37:01 | Updated 2015-07-11 00:42:28 | Tags: queue I'm using Queue 3.4 with a Univa GridEngine cluster. Everything runs fine, and at first the time between job submissions is only a few seconds. But the rate at which queue submits jobs to the cluster slows to a crawl - up to 40 seconds per job after a few hundred submissions, and slowly increasing. Any idea what might be happening here? With a scatterCount of 1000, it can take a full day just to submit all jobs to the cluster for a HaplotypeCaller run. Created 2015-05-05 00:40:58 | Updated | Tags: haplotypecaller queue gatk Question Hey folks, We've run in to a situation where we have hundreds of queue managed GATK HaplotypeCaller.scala jobs which we'd like to run. This has led to hundreds of instances of Queue.jar in their post scatter watching state holding on to cores in our cluster. I'm curious about plans for Queue and whether or not a job dependency tree model has been considered for future versions, or whether or not this is already available. By that, I mean something along the lines of the scatter kicks off the child jobs and a check/resubmit_faileds/gather job is queued on the cluster with a qsub -W depend=after:child1jobid[:child2jobid: ...:childNjobid]. The initial scatter job would end, freeing up a core, and resources would be available to the sub jobs. Then, when the child jobs finish up, the dependencies would be met and the next job would execute (when resources are available) and pick up the successful, resubmit the failed, lather, rinse , repeat. Thanks in advance for your patience with me in the phrasing of my question, as I am approaching this as a cluster admin, not as the developer who has implemented the queue. Cheers, Scott Created 2015-04-30 06:34:42 | Updated 2015-05-01 15:02:14 | Tags: unifiedgenotyper ploidy queue Hi, I have several questions about the function "UnifiedGenotyper" and its corresponding Qscript. 1. I first test this function on a single chromosome of my chicken sample (bam file is about 800M) to call SNPs, and set the ploidy value to 30. It took me half a day to finish. Is it normal that it took so long for a single chromosome? As a result of this, I tried to use the scatter-gather to parallel my job in order to get the full use of my multi-node server. I built my own script based on modifying the ExampleUnifiedGenotyper.scala. I defined "ploidy" as an argument in my Qscript, then I assign a value of 30 to ploidy in my bash command line. java -Djava.io.tmpdir=$tmp_dir \
-jar /sw/apps/bioinfo/GATK-Queue/3.2.2/Queue.jar \
-S /path/to/ug.scala \
-R $ref_file \ -I /path/to/bamfile.bam \ -o /path/to/my.vcf \ -ploidy 30 \ -run 2.The script ran successfully, but the output VCF file was based sample_ploidy=2. It seems that it does not respond to the argument "ploidy". Then if I add one command line in the def script() part, which is" genotyper.sample_ploidy = qscript.sample_ploidy ". GATK will give warning of errors "sample_ploidy is not a member of TestUnifiedGenotyper.this.UnifiedGenotyperArguments". Could you please help point out where I did wrong? Also, I have some questions about how to assign values to argument such as " this.memoryLimit" and "genotyper.scatterCount". 1. Based on the explanation given in the example scala file, argument "this.memoryLimit" sets the memory limit for each command. So in my case, I have only one command in my script, so that I can assign as much memory as I have to it? I am running on a server that has multiple-node, each node with 16 cores, and each core has 8G RAM. So in my understanding, if I run this job on one node with 16 cores, can I assign "this.memoryLimit=128" (16*8)? Do I understand correctly? 2. about the scatterCount, can I assign it with the number of cores I have in the server? That is, based on the example I listed above, 16. Thank you very much for your help and time! Attached is my Qscript in relating with my question. My Qscript is as follows: package org.broadinstitute.gatk.queue.qscripts.examples import org.broadinstitute.gatk.queue.QScript import org.broadinstitute.gatk.queue.extensions.gatk._ class TestUnifiedGenotyper extends QScript { qscript => @Input(doc="The reference file for the bam files.", shortName="R") var referenceFile: File = _ @Input(doc="Bam file to genotype.", shortName="o") var outputFile: File = _ @Input(doc="Bam file to genotype.", shortName="I") var bamFile: File = _ @Input(doc="An optional file with dbSNP information.", shortName="D", required=false) var dbsnp: File = _ @Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy") var sample_ploidy: List[String] = Nil trait UnifiedGenotyperArguments extends CommandLineGATK { this.reference_sequence = qscript.referenceFile this.memoryLimit = 2 } def script() { val genotyper = new UnifiedGenotyper with UnifiedGenotyperArguments genotyper.scatterCount = 3 genotyper.input_file :+= qscript.bamFile // genotyper.sample_ploidy = qscript.sample_ploidy genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "unfiltered.vcf") add(genotyper) } } Created 2015-04-15 01:56:33 | Updated | Tags: queue scala I'm just learning to create Q scripts right now. Currently I'm using CommandLineFunction to create the calls for bcl2fastq, bwa, samtools, and picard. Right now I'm stuck trying to find a way to pass a system path to my script as an argument. I'm trying to run "cd /path/to/rundirectory/" followed by "bcl2fastq". How do I pass the path to the run directory to the "cd" command? Created 2015-04-13 20:30:12 | Updated | Tags: queue I'm running Queue-3.3-0 on an Ubuntu server, and running it deletes every file in the working directory. I've only tried this on one machine so far, but I reproduced this behavior repeatedly in different situations and confirmed that the directory emptied is the precisely the current directory, at least in my setup. This behavior seems dangerous to me. I can submit a detailed bug report. Thanks, Douglas Created 2015-03-17 21:15:25 | Updated 2015-03-17 21:21:48 | Tags: queue The HaplotypeCaller documentation recommends using Queue to parallelize HaplotypeCaller instead of -nct, so I've been attempting to do that, however I can't seem to get Queue to do any kind of parallel processing. I'm currently working on a machine with 8 cores and I'm consistently getting Queue to run, but it only runs single-threaded. I don't have access to a distributed computing environment, but I don't see why Queue wouldn't be able to parallelize on one machine with multiple cores, and I see no documentation indicating that threading by Queue is only available in distributed computing environments. What I've done is a minimal modification of the ExampleUnifiedGenotyper.scala script to use it to run HaplotypeCaller. I have tried running it a couple of times to see how it would run. I tried a couple times with just the reference file and mapping file as input, plus I tried a couple times with an intervals file listing each of the chromosomes as separate intervals. Every time, it ran single threaded. I've found several articles and comments indicating that Queue should be used to Scatter/Gather a job and even explain how Scatter/Gather works, so I was under the assumption that this is just what Queue does and it would use multi-core systems to their full advantage, however this is not my experience and I don't see anything in the documentation to explain why. If it could be explained to me either how I'm running the command wrong, or why Queue can't be used to parallelize on one machine, I would be very grateful. Created 2015-03-11 13:50:49 | Updated | Tags: queue commandlinefunction I have something like this: case class Command1 (param1, param2) extends CommandLineFunction{ println(This is Command1)} case class Command2(param1, param2, param3) extends CommandLineFunction{ Command1(param1, param2)} The call of Command1 in Command2 seems to be skipped. How should I do properly in this case? The motivation is because Command2 is too long, and complicated, so I want to split the work into smaller steps. Created 2015-02-16 16:20:08 | Updated 2015-02-16 16:22:23 | Tags: queue jobrunner Hi all! I'm working on trying to get a parallel version of the ShellJobRunner working in Queue, which would allow us to parallelize some parts of our workflows that are running single core on a full node using the ShellJobRunner and thus are wasting a lot of resources. I thought that I'd made some rather nice progress, until I noticed that if I tried to use it for any job running longer than about 5 minutes the job runner would exit saying that it's job failed, while in reality the job keeps running (so it obviously it did not fail, and Queue doesn't kill it either). The code I've come up with so far is available here: https://gist.github.com/johandahlberg/a9b7ac61c3aa2c654899 (And as you can see it's mostly stolen from the regular ShellJobRunner, which with some Scala future stuff mixed in) I'm guessing that the problems comes from me abusing the ProcessController (and admittedly there are warnings in the source for it for not being thread safe), but I'm not sure if there is any way that I can get around it. Any pointers here would be extremely appreciated - also if there is any general interest in this feature I'd be happy to clean up the code a bit and submit a pull request on this upstream. /Johan Created 2015-02-09 11:59:49 | Updated | Tags: queue scatter-gather wrappers Hi guys, I've been trying to do something supposedly simple: i.e. annotating a VCF file with a custom annotation, using Queue with a custom wrapper. I followed the instructions here https://www.broadinstitute.org/gatk/events/3391/GATKw1310-Q-4-Advanced_Queue.pdf However, since I'm working with a VCF file, I thought about distributing better my job(s) by scattering/gathering the input, benefiting of Queue functionality. I thought, following this presentation https://www.broadinstitute.org/gatk/events/3391/GATKw1310-Q-3-Queue_Parallelism.pdf that .scatterCount would be available natively by extending commandLineFunction, but apparently I get a message saying it's not a member of my class. Would you please suggest how can I scatter/gather a VCF file if I have to process it with a custom wrapper? I haven't found this question answered before, but happy to read elsewhere if it's been already. This is my script package org.broadinstitute.gatk.queue.qscripts import org.broadinstitute.gatk.queue.QScript import org.broadinstitute.gatk.queue.extensions.gatk._ import org.broadinstitute.gatk.queue.util.QScriptUtils import org.broadinstitute.gatk.utils.commandline._ import org.broadinstitute.gatk.queue.function.scattergather._ import collection.JavaConversions._ class customAnnotation extends QScript { // Create an alias 'qscript' to be able to access variables qscript => // Required arguments. All initialized to empty values. @Input(doc="VCF file to be annotated", fullName="vcf", shortName="V", required=true) var inVcf: File = _ /********************************************************* * definitions of names **********************************************************/ val baseName = swapExt(qscript.inVcf, "vcf", "anno") val myOut = new File( baseName + ".customAnno.vcf") val annotationOut = new File( baseName + ".parsed.vcf") val testFile = new File( baseName + ".TEST") /********************************************************* * CUSTOM annotation as command line **********************************************************/ class MyAnnotation extends CommandLineFunction { @Input(doc = "input VCF file") val input: File = qscript.inVcf @Output(doc = "output VCF file") val output: File = qscript.myOut this.jobNativeArgs = Seq("--mem=12000") this.jobNativeArgs ++= Seq("--time=12:00:00") // job name override def jobRunnerJobName = "myAnno" this.scatterCount = 30 override def commandLine = required("perl ~/tools/customAnno.pl") + required("-i", input) + required("-o", output) } /*************************************************** * main script ***************************************************/ def script() { val myanno = new MyAnnotation add(myanno) } } and this is the error I get: Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1 INFO 12:01:38,767 QScriptManager - Compiling 1 QScript ERROR 12:01:40,198 QScriptManager - testAnno.scala:56: value scatterCount is not a member of customAnnotation.this.MyAnnotation ERROR 12:01:40,200 QScriptManager - this.scatterCount = 30 ERROR 12:01:40,200 QScriptManager - ^ ERROR 12:01:40,225 QScriptManager - two errors found ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR stack trace org.broadinstitute.gatk.queue.QException: Compile of /home/lescai/pipeline/annotation/testAnno.scala failed with 2 errors at org.broadinstitute.gatk.queue.QScriptManager.loadScripts(QScriptManager.scala:79) at org.broadinstitute.gatk.queue.QCommandLine.org$broadinstitute$gatk$queue$QCommandLine$$qScriptPluginManagerlzycompute(QCommandLine.scala:95) at org.broadinstitute.gatk.queue.QCommandLine.orgbroadinstitutegatkqueueQCommandLine$$qScriptPluginManager(QCommandLine.scala:93) at org.broadinstitute.gatk.queue.QCommandLine.getArgumentSources(QCommandLine.scala:230) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:205) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:62)
##### ERROR ------------------------------------------------------------------------------------------

thanks for helping an inexperienced .scala user :) Francesco

Created 2015-01-23 19:00:21 | Updated | Tags: queue qscript scala

Right now, when I run "java -jar Queue.jar QScript.scala", scalac ("QScriptManager") complains that it can't find classes and objects that are in the same package as my QScript, as well as methods declared in my package object. When I try to import these explicitly, I get "object is not a member of package x".

How can I tell Queue to pass to scalac all of the files in my package?

I'm using Scala 2.11.4.

Thanks!

Created 2014-12-23 01:56:20 | Updated 2014-12-23 01:57:55 | Tags: variantrecalibrator queue

Hello GATK team, When I use the VariantRecalibrator, I need to set the resource as

-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf

When I use the VariantRecalibrator with Queue, I don't know how to set it in the scala script. In the Queue, we receive the parameter as (key value), but how can it be (key:value value)?

Created 2014-12-08 23:26:39 | Updated | Tags: queue

Hi all:

When using IndelRealigner with Queue (v3.3) we are getting an error from ConcatenateLogsFunction with regards to one of the log files being missing:

ERROR 15:07:41,484 FunctionEdge - Error: Concat: List(/share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/scatter/scatter.out, /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/temp_1_of_5/realigned.bam.out, /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/temp_2_of_5/realigned.bam.out, /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/temp_3_of_5/realigned.bam.out, /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/temp_4_of_5/realigned.bam.out, /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/temp_5_of_5/realigned.bam.out, /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/gather-out/gather-realigned.bam.out) > /share/data/resources/gatk_v3.3/tests/scala_test_out/realigned.bam.out org.broadinstitute.gatk.queue.QException: Unable to find log: /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/gather-out/gather-realigned.bam.out at org.broadinstitute.gatk.queue.function.scattergather.ConcatenateLogsFunction.run(ConcatenateLogsFunction.scala:50) at org.broadinstitute.gatk.queue.engine.InProcessRunner.start(InProcessRunner.scala:53) at org.broadinstitute.gatk.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.gatk.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala:62) at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala) The contents of the folder /share/data/resources/gatk_v3.3/tests/.queue/scatterGather/gatk-2-sg/gather-out/ is: gather-realigned.bam.utt So it appears that the name of this file is getting mangled at some point by Queue. The other parts of the pipeline we have tried so far seem to work (BaseRecalibrator, RealignerTargetCreator) so not sure if it BAM output specific. We are utilizing Queue/GATK (3.3-0-geee94ec) which has been cloned from the gatk-protected repository in conjunction with a custom jobRunner for HTCondor. We can provide additional info as needed. Any thoughts would be appreciated. Thanks, Dan Created 2014-10-28 21:43:30 | Updated | Tags: queue I put FindCoveredIntervals into a Queue script and ran it with 25 scatters (one for each chromosome in hg19). The 25_of_25 job fails all attempts to run:  Failed functions: Attempt 4 of 4. 'java' '-Xmx4096m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/temp' '-cp' '/usr/local/bio_apps/Queue-3.2/Queue.jar' 'org.broadinstitute.gatk.engine.CommandLineGATK' '-T' 'FindCoveredIntervals' '-I' '/path/bwa.bam' '-L' '/path/.queue/scatterGather/Coverage-1-sg/temp_25_of_25/scatter.intervals' '-L' 'unmapped' '-R' '/path/hg19.fa' '-o' '/path/.queue/scatterGather/Coverage-1-sg/temp_25_of_25/bwa.20xCov.Queue.list' '-cov' '20' '-minBQ' '17' '-minMQ' '20' Logs: /path/.queue/scatterGather/Coverage-1-sg/temp_25_of_25/bwa.20xCov.Queue.list.out  Note that there are two -L arguments in the job submission command. In the log I see this error:  ERROR ERROR MESSAGE: Interval list specifies unmapped region. Only read walkers may include the unmapped region. ERROR ------------------------------------------------------------------------------------------  It appears that the "-L unmapped" parameter is the reason for the error. Is this a bug in how Queue creates the scatter for FindCoveredIntervals? The error says that only read walkers should be processing with "-L unmapped". Is there a way to force it to not include unmapped reads so I can avoid the error? Thanks, Andrew Created 2014-10-22 12:38:00 | Updated | Tags: haplotypecaller queue I would like to run HaplotypeCaller on some WGS samples from 24 individuals. I have attempted to use the multithreading option but would instead like to use the scatter-gather approach with Queue on our cluster as this will hopefully run much quicker. I see that there was a patch written for Queue to work with PBS scheduler some time back, however, I am struggling to find the associated documentation for it. Could you please point me in the right direction or provide some advice on how to do this. Created 2014-09-10 11:55:40 | Updated | Tags: queue scala Dear GATK-Team, I am trying to add non-GATK software to my current Queue pipeline and have been following the Advanced Queue Usage. However, I get the following error when running my bash script and I don't see where this error is coming from. Do I fail to import a needed library? INFO 13:40:59,347 QScriptManager - Compiling 1 QScript ERROR 13:40:59,551 QScriptManager - map.scala:18: not found: type CommandLineFunction ERROR 13:40:59,555 QScriptManager - class RunBlast extends CommandLineFunction { ERROR 13:40:59,557 QScriptManager - ^ Here is my scala script: import org.broadinstitute.gatk.queue.QScript import org.broadinstitute.gatk.queue.extensions.gatk._ class RunBlast extends CommandLineFunction { @Input(doc="File to use as query") var queryFile: File = _ @Ouput(doc="File to write output results") var blastHits: File = _ @Argument(doc="BLAST algorithm to use") var algorithm: String = "blastn" @Argument(doc="Database to query against") var database: String = "nt" def commandLine = algorithm + " -db " + database + " -query " + queryFile + " -out " + blastHits } class ScriptToRunBlast extends QScript { def script() { val runBlast = new RunBlast runBlast.queryFile = new File("mySequence.fasta") runBlast.blastHits = new File("blastHits.out") add(runBlast) } And my command line invocation: java -Xmx500M -jarHOME"/src/Queue-3.2-2/Queue.jar" \
-S HOME"/scripts/humanomics/scala/map.scala" \ -run \ -startFromScratch Thanks in advance Created 2014-08-27 14:38:55 | Updated | Tags: queue developer pipeline Let's say I have a case class in a qscript like so: case class getFileFromRemote(remoteFile:File, localFile:file) extends ExternalCommonArgs with SixteenCoreJob with FourDayJob{ @Input(doc = "remote") val _remoteFile = remoteFile @Output(doc = "local") val _localFile = localFile def commandLine = "fetchFile.sh " + remoteFile + " " + localFile this.isIntermediate = true this.jobName = "fetchFile" this.analysisName = this.jobName }  Then I can add it to my script() like so: add( getFileFromRemote("user@server:/path/to/remote", "path/to/local") ) All is well so far. Lets say that I have lots of files to fetch, so I add jobs in a for loop over a Seq of file names. I then add jobs downstream jobs as usual. The problem that I run in to is that all 1000+ fetchFile.sh (which uses irods/irsync behind the scenes) sessions will start at the same time, choking the system and nothing will get downloaded. One solution to my problem would be to be able to set a limit in my fetcher case class, to tell Queue to never submit more than 5 (or so) of these jobs to the cluster. Is that possible, or can anyone see another way around this? Created 2014-08-07 17:34:25 | Updated | Tags: queue The 3.2 CombineVariants Queue extension does not seem to pass through the TaggedFile information. I assume the extension generator is not recognizing the particular input type used by CombineVariants… For example: this.V = List(new TaggedFile("file1.vcf","file1"), new TaggedFile("file2.vcf","file2")) in a Scala class that extends CombineVariants will produce: -V file1.vcf -V file2.vcf without the tags. Created 2014-06-12 12:22:10 | Updated | Tags: queue scala Dear GATK Team, I am wondering about future plans for the Queue framework. I find it a useful framework to write and run pipelines in computing clusters. However, I found myself often wanting to use Queue in pipelines without any GATK walkers at all. Are there any plans in the future to release Queue as its own, GATK-independent package? I know that internally there are some shared classes (e.g. the command line parser), and refactoring them so that Queue can be GATK-free may require a little more work. But I'm just interested to know if there are already plans to do this (or perhaps even already ongoing). Cheers, konuva Created 2014-06-09 13:45:13 | Updated | Tags: queue qscript commandlinefunction Hi, I've been using Queue extensively recently, and it's great. I'm now trying to run a Qfunction that takes a list of input files, and uses them all as dependencies just as they would be if they were single inputs. Here's an example of what I want:  case class reportWriter(inFiles:List[File], file:File, report:DataReport) extends CommandLineFunction { @Input(doc="files to rely on") val files:List[File] = inFiles @Output(doc="report file to write") val reportFile:File = file def commandLine = "reportWriter " + inFiles + ">" + reportFile this.isIntermediate = false this.jobName = "writeReport" this.analysisName = this.jobName } I the call populate a List[File] parameters in my pipeline, and pass it as an argument like so:  add( GeneralUtils.writeReport(fileList, reportFile, report) ) However, the files in my List[File] are not treated as dependencies and the job is run the first thing that happens in the pipeline. Is there a way to use a List[File] as input to a CommandLineFunction, and have it use all elements of the List as dependencies? Created 2014-06-06 07:46:07 | Updated | Tags: queue scala Is there a way in Queue to let a job executed locally instead of submitting it to the cluster with drmaa? I know this must be possible since the scatter jobs are also not submitted at the cluster, could only not find how to do this. For example I have a sniplet of a job that can easily run local: class Ln extends CommandLineFunction { @Input(doc="Input file") var in: File = _ @Output(doc="Link destination") var out: File = _ def commandLine = { "ln -sf " + required(in) + required(out) } } I hope you can help me out with this. Created 2014-06-03 16:39:33 | Updated | Tags: haplotypecaller queue performance Hello, I adapted the example UnifiedGenotyper script to run HaplotypeCaller. Unfortunately, so far I haven't been able to get it to use multiple CPUs. I am using a server with 64 cores (AMD Opteron) and 512 gb RAM. Invoking the -nct option made no difference either. The example job is a highly heterogeneous set of 6 samples with a 270 mbp reference. Predicted runtime: 6 weeks (and we will need to genotype batches of up to 40 samples...). The script creates ten output directories (/.queue/scatterGather/hc-1-sg/temp_01_of_10, etc), but files are only produced in one of them. The local guru cannot spot the problem either. I attach the script - I will appreciate some feedback. As a sidenote, does the memoryLimit refer to the entire process, or is it per core? Many thanks, Chris # Command: /queue -S hc.scala -R /whale-data/kk443/mapping/scaffolds_cinnabar_W.fasta -I test_bams.list -NCT 4 -SCC 20 -MBQ 20 -H 0.1 -IH 0.001 -O test2.10cpu.4nct.output.vcf -run # Script: //kk443 AT cam.ac.uk 30/05/2014 Script to run the GATK HaplotypeCaller with Queue parallelism. package org.broadinstitute.sting.queue.qscripts.examples import org.broadinstitute.sting.queue.QScript import org.broadinstitute.sting.queue.extensions.gatk._ class ExampleUnifiedGenotyper extends QScript { qscript => // Required arguments. All initialized to empty values. @Input(doc="The reference file for the bam files.", shortName="R") var referenceFile: File = // is scala shorthand for null @Input(doc="Bam file to genotype.", shortName="I") var bamFile: File = _ @Argument(shortName="H") var hets: Float = _ @Argument(shortName="IH") var indelHeterozygosity: Float = _ @Output(shortName="O") var o: File = _ // The following arguments are all optional. @Argument(shortName="SCC", required=false) var stand_callconf: Int = @Argument(shortName="MBQ", required=false) var mbq: Int = _ @Argument(shortName="NCT", required=false) var nct: Int = _ // This trait allows us set the variables below in one place, // and then reuse this trait on each CommandLineGATK function below. trait HaplotypeCallerArguments extends CommandLineGATK { this.reference_sequence = qscript.referenceFile this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals) // Set the memory limit to 2 gigabytes on each command. this.memoryLimit = 40 } def script() { // Create the four functions that we may run depending on options. val genotyper = new HaplotypeCaller with HaplotypeCallerArguments genotyper.scatterCount = 10 genotyper.input_file :+= qscript.bamFile genotyper.out = swapExt(qscript.bamFile, "bam", "unfiltered.vcf") add(genotyper) } } # StdOut: INFO 23:27:36,528 QScriptManager - Compiling 1 QScript INFO 23:27:44,663 QScriptManager - Compilation complete INFO 23:27:44,806 HelpFormatter - ---------------------------------------------------------------------- INFO 23:27:44,806 HelpFormatter - Queue v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:09 INFO 23:27:44,807 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 23:27:44,807 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 23:27:44,808 HelpFormatter - Program Args: -S hc.scala -R /whale-data/kk443/mapping/scaffolds_cinnabar_W.fasta -I test_bams.list -NCT 2 -SCC 20 -MBQ 20 -H 0.1 -IH 0.001 -O test4.20cpu.2nct.200GB.output.vcf -run INFO 23:27:44,809 HelpFormatter - Executing as kk443@bonobo--bio on Linux 3.2.0-60-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_25-b15. INFO 23:27:44,809 HelpFormatter - Date/Time: 2014/06/01 23:27:44 INFO 23:27:44,810 HelpFormatter - ---------------------------------------------------------------------- INFO 23:27:44,810 HelpFormatter - ---------------------------------------------------------------------- INFO 23:27:44,818 QCommandLine - Scripting ExampleUnifiedGenotyper INFO 23:27:44,969 QCommandLine - Added 1 functions INFO 23:27:44,971 QGraph - Generating graph. INFO 23:27:44,994 QGraph - Generating scatter gather jobs. INFO 23:27:45,058 QGraph - Removing original jobs. INFO 23:27:45,061 QGraph - Adding scatter gather jobs. INFO 23:27:45,282 QGraph - Regenerating graph. INFO 23:27:45,362 QGraph - Running jobs. INFO 23:27:45,512 FunctionEdge - Starting: 'java' '-Xmx40960m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/whale-data/kk443/mapping/AAAfinished/.queue/tmp' '-cp' '/whale-data/biosoft/src/Queue-3.1-1/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' '/whale-data/kk443/mapping/AAAfinished/test_bams.list' '-L' '/whale-data/kk443/mapping/AAAfinished/.queue/scatterGather/hc-1-sg/temp_01_of_10/scatter.intervals' '-R' '/whale-data/kk443/mapping/scaffolds_cinnabar_W.fasta' '-o' '/whale-data/kk443/mapping/AAAfinished/.queue/scatterGather/hc-1-sg/temp_01_of_10/test_bams.listunfiltered.vcf' INFO 23:27:45,512 FunctionEdge - Output written to /whale-data/kk443/mapping/AAAfinished/.queue/scatterGather/hc-1-sg/temp_01_of_10/test_bams.listunfiltered.vcf.out Created 2014-06-02 17:23:35 | Updated | Tags: haplotypecaller queue scatter-gather Gidday, I have a question about choosing the number of scatter jobs when running the HaplotypeCaller in Queue. Basically, is there a hard and fast rule about how small you can split up the job? From what I understand of HC, given it does local reconstruction of haplotypes anyway, splitting into more jobs shouldn't affect the results. (My current dataset is mouse whole-genome data with 24 samples, and even scattered into 250 jobs, the longest jobs still took ~6d to run... I'd love to be able to speed it up if I have to re-run HC by splitting into more jobs. As long as it doesn't affect the results!) Thanks! Created 2014-04-18 21:12:07 | Updated 2014-04-18 22:00:11 | Tags: queue Hi all, I'm stumped by some behavior regarding job submission by Queue (although I'm not sure if the problem is related to Queue). I wrote a QScript and tested it on a single machine running SGE; everything worked entirely as expected. I've since moved the analysis over to a cluster using UGE. Everything works mostly as expected, except for 2 issues: one I posted about on the forums here, and one where the job submissions begin taking longer over time. When the pipeline is first started, jobs are submitted rapidly (within a few seconds of each other). Over time, jobs take increasingly longer (e.g. minutes) to submit, regardless of the job. The trend can be seen in the attached .pdf file. I can kill the pipeline, immediately restart it (both from scratch or not), and reproduce the behavior. Additionally, I can qsub the same jobs from a bash for loop without any problems. The terminal pauses after outputting the "FunctionEdge - Output written to..." line for a minute (or more) between each submission, even with (what I think are) simple jobs: INFO 13:07:14,093 FunctionEdge - Starting: rm 12.sam 12.sorted.bam 12.sorted.bai 12.sorted.intervals 12.sorted.realigned.bam 12.sorted.realigned.bai INFO 13:07:14,093 FunctionEdge - Output written to /data/ryanabashbash/src/Queue1104bDev/CleanupIntermediates.out INFO 13:08:08,810 DrmaaJobRunner - Submitted job id: 37461 On the single machine with SGE, these same jobs were submitted almost instantaneously, and I can qsub the same jobs from a bash for loop on the cluster just as fast. I'm guessing its some sort of interaction between Queue and the architecture that I'm overlooking? Does anyone have any suggestions on how to get some traction in tracking down the cause, or has anyone seen similar behavior before? As an aside, my .out files all contain something similar to the following errors at the beginning of the file (I haven't been able to track down the source, and I'm not sure if it is related; it doesn't seem to affect the output of the job): sh: module: line 1: syntax error: unexpected end of file sh: error importing function definition for module' Thanks for any suggestions or pointers! Created 2014-04-03 23:13:39 | Updated | Tags: queue vcf qscript scala Hi, Thanks to previous replies can run Queue and the relevant walker on a distributed computing server. The question was if I define my scala script to require an argument for the output file, using the -o parameter like so:  // Required arguments. All initialized to empty values. .... @Input(doc="Output file.", shortName="o") var outputFile: File = _ How do I direct the output to pipe the result to a specified directory? Currently I have the code: genotyper.out = swapExt(qscript.bamFile, "bam", outputFile, "unfiltered.vcf") Currently when I include the string -o /path/to/my/output/files/MyResearch.vcf The script creates a series of folders within the directory where I execute Queue from. In this case my results were sent to: /Queue-2-8-1-g932cd3a/MyResearch./path/to/my/output/files/MyResearch.unfiltered.vcf when all I wanted was the output to appear in the path: /path/to/my/output/files/MyResearch.unfiltered.vcf As always any help is much appreciated. Created 2014-04-02 15:37:38 | Updated | Tags: queue Our cluster is setup in such a way that each compute node has a fairly generous scratch disk associated with it. This means that it would be really nice to be able to write temporary files to those scratch disks instead of having to write them over the network to the globally accessible disk. I've been experimenting with different ways of trying to do this using Queue, but so far I've come up short. Since Queue tries to create all temporary directories (or at least check for their existence) before the jobs are run. I would like to know if there is any way that I could redirect the temporary outputs to the scratch disk using environment variables (To achieve something like this in the end: mycommand -tmpTMPDIR ...). Since Queue basically just sends commands to be executed on the compute nodes I don't understand why it's necessary to check all temporary dirs before hand. I'm sure I'm missing something here, and I'd like to know what.

From my understanding of Queue I could see two types of temporary files that need to be written to globally accessible disk. The first being the compiled QScript, which location I guess could be set by changing the following piece of code in org.broadinstitute.sting.queue.QCommandLine from:

  private lazy val qScriptPluginManager = {
qScriptClasses = IOUtils.tempDir("Q-Classes-", "", settings.qSettings.tempDirectory)
new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
}

to:

  private lazy val qScriptPluginManager = {
qScriptClasses = IOUtils.tempDir("Q-Classes-", "", "/path/to/globally/accessible/disk")
new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
}

Of course not hard coded, but feed into the the QCommandLine as an argument, but this is just to illustrate the point.

And the other one would be scatter-gather jobs, which can be explicitly redirected using: --job_scatter_gather_directory, to make sure that all of those end up in a globally accessible part of the file system.

All other temporary files should be possible to write to local non-globally-accessible disk? Or am I completely wrong here, and this is not a path worth pursuing further?

Created 2014-03-21 22:43:33 | Updated 2014-03-21 22:44:48 | Tags: queue pbsengine

Hi,

I'm trying to use Queue with Torque through the PbsEngine jobRunner. The institutional cluster I'm trying to use doesn't allow users to select a queue. You are supposed to request the proper resources and a job routing algorithm selects the right queue for you.

I was able to confirm that doing this change allows me to use Queue in this kind of enviroment.

In "gatk-protected / public / queue-framework / src / main / scala / org / broadinstitute / sting / queue / engine / pbsengine / PbsEngineJobRunner.scala":

Change:

    // If the job queue is set specify the job queue
if (function.jobQueue != null)
nativeSpec += " -q " + function.jobQueue
else
nativeSpec += " -q normal "

to:

    // If the job queue is set specify the job queue
if (function.jobQueue != null)
nativeSpec += " -q " + function.jobQueue

Thanks, Carlos

PS: Do the GATK team accepts pull requests? If so, I could submit a pull request with this change.

Created 2014-03-06 21:31:54 | Updated | Tags: queue jobrunner java drmaa

Hi (once more) I am attempting to run Queue with a scala script and scheduling it with jobrunner. The script works nicely, but when I run it with jobRunner I get the error

"Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa':libdrmaa.so: cannot open shared object file: No such file or directory."

When I try to pass the location of the libdrmaa.so file (-Djava.library.path=/opt/sge625/sge/lib/lx24-amd64/) the result is the same.

How would I point jobRunner to the correct path for the Drmaa.so library?

Created 2014-03-03 09:19:41 | Updated 2014-03-03 09:20:25 | Tags: queue commandline rsem

I'm working on add RSEM to our RNAseq pipeline which uses Queue. RSEM takes a number of inputs on the command line, so I have a case class and override commandLine for this to work. Nothing special there.

However, RSEM wants a prefix of the output sample names. If i give it sample_name, it will generate a whole bunch of files, sample_name.genes.results with expression values for genes, sample_name.isoforms.results with expression values for isoforms, sample_name.genome.bam, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai with mappings etc, etc.

What's the best way to handle this in terms of @Output?

Should I use (1):

case class rsem(inFq1: File, inFq2: File, prefix: String) extends ExternalCommonArgs {
...
@Output val myPrefix = prefix
...

and them use the prefix in the downstream jobs? Or should I use (2):

case class rsem(inFq1: File, inFq2: File, prefix: String, bam: File, geneResults: File) extends ExternalCommonArgs {
...
@Output val myBam = bam
@Output val myGeneRes = geneResults
...

In (2), I would still use prefix in the def commandLine, of course.

Is there a preferred way to handle this in Queue?

Created 2014-02-24 22:44:00 | Updated 2014-02-24 22:45:40 | Tags: queue rscript

Hi,

I found if I add library(grid) to gatk-protected / public / R / scripts / org / broadinstitute / sting / queue / util / queueJobReport.R the error goes away.

INFO  15:49:53,263 QCommandLine - Writing final jobs report...
INFO  15:49:53,263 QJobsReporter - Writing JobLogging GATKReport to file /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.txt
INFO  15:49:53,270 QJobsReporter - Plotting JobLogging GATKReport to file /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.pdf
DEBUG 15:49:53,278 RScriptExecutor - Executing:
DEBUG 15:49:53,278 RScriptExecutor -   Rscript
DEBUG 15:49:53,278 RScriptExecutor -   -e
DEBUG 15:49:53,278 RScriptExecutor -   tempLibDir = '/Users/cborroto/workspace/sciencemodule/data_processing/tmp/Rlib.5689446532231761075';install.packages(pkgs=c('/Users/cborroto/workspace/sciencemodule/data_processing/tmp/RlibSources.4324191911112824298/gsalib'), lib=tempLibDir, repos=NULL, type='source', INSTALL_opts=c('--no-libs', '--no-data', '--no-help', '--no-demo', '--no-exec'));library('gsalib', lib.loc=tempLibDir);source('/Users/cborroto/workspace/sciencemodule/data_processing/tmp/queueJobReport.6856864909021911181.R');
DEBUG 15:49:53,278 RScriptExecutor -   /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.txt
DEBUG 15:49:53,278 RScriptExecutor -   /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.pdf
* installing *source* package ‘gsalib’ ...
** R
** building package indices
** testing if installed package can be loaded
* DONE (gsalib)
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
Attaching package: ‘reshape’
The following objects are masked from ‘package:plyr’:
rename, round_any
[1] "Report"
[1] "Project          : /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.txt"
Error in do.call("layer", list(mapping = mapping, data = data, stat = stat,  :
could not find function "arrow"
Calls: source ... geom_segment -> <Anonymous> -> <Anonymous> -> do.call
Execution halted
DEBUG 15:49:55,207 RScriptExecutor - Result: 1
WARN  15:49:55,207 RScriptExecutor - RScript exited with 1

Thanks, Carlos

Created 2014-02-24 21:47:00 | Updated | Tags: haplotypecaller queue

I wrote my first script in scala to run haplotyperCaller walker of GATK. However, I am running into some errors when I execute the *.scala script. I am unable to figure out the source of error, any help will be appreciated.

package org.broadinstitute.sting.queue.qscripts

class haplotypeCaller extends QScript {

@Input(doc="Reference file for the bam files",shortName="R")
var referenceFile: File = _

@Input(doc="One or more bam files",shortName="I")
var bamFiles: List[File] = Nil

@Argument(doc="the interval string",shortName="L")
var intervalString: String = ""

@Argument(doc="heterozygosity to be considered",shortName="heterozygosity")
var het: Double = 0.006

@Argument(doc="which sites to emit",shortName="output_mode")
var outmode: String = "EMIT_ALL_SITES"

@Output(doc="outFile to write snps and indels",shortName="out")
var outFile: File = _

def script(){
val hc = new HaplotypeCaller
hc.memoryLimit=20
}

}

command line arguments passed: java -jar ./Queue-2.7-4-g6f46d11/Queue.jar -S haplotypeCaller.scala -R ./reference/xx.fasta -I x1.realigned.bam / -I x2.realigned.bam -I x3.realigned.bam -I x4.realigned.bam -I x5.realigned.bam -I x6.realigned.bam -I x7.realigned.bam / -I x8.realigned.bam -I x9.realigned.bam -I x10.realigned.bam -I x11.realigned.bam -I x12.realigned.bam / -L chr1 -heterozygosity 0.006 -output_mode EMIT_ALL_SITES -out gatk.hc.chr1.raw.snps.indels.vcf -run

error: ##### ERROR MESSAGE: Walker requires a reference but none was provided.

The reference file exists in the above mentioned path. I even tried running with absolute path for the reference file but was not successful. Any help will be appreciated.

Created 2014-02-24 21:46:17 | Updated | Tags: haplotypecaller queue

I wrote my first script in scala to run haplotyperCaller walker of GATK. However, I am running into some errors when I execute the *.scala script. I am unable to figure out the source of error, any help will be appreciated.

package org.broadinstitute.sting.queue.qscripts

class haplotypeCaller extends QScript {

@Input(doc="Reference file for the bam files",shortName="R")
var referenceFile: File = _

@Input(doc="One or more bam files",shortName="I")
var bamFiles: List[File] = Nil

@Argument(doc="the interval string",shortName="L")
var intervalString: String = ""

@Argument(doc="heterozygosity to be considered",shortName="heterozygosity")
var het: Double = 0.006

@Argument(doc="which sites to emit",shortName="output_mode")
var outmode: String = "EMIT_ALL_SITES"

@Output(doc="outFile to write snps and indels",shortName="out")
var outFile: File = _

def script(){
val hc = new HaplotypeCaller
hc.memoryLimit=20
}

}

command line arguments passed: java -jar ./Queue-2.7-4-g6f46d11/Queue.jar -S haplotypeCaller.scala -R ./reference/xx.fasta -I x1.realigned.bam / -I x2.realigned.bam -I x3.realigned.bam -I x4.realigned.bam -I x5.realigned.bam -I x6.realigned.bam -I x7.realigned.bam / -I x8.realigned.bam -I x9.realigned.bam -I x10.realigned.bam -I x11.realigned.bam -I x12.realigned.bam / -L chr1 -heterozygosity 0.006 -output_mode EMIT_ALL_SITES -out gatk.hc.chr1.raw.snps.indels.vcf -run

error: ##### ERROR MESSAGE: Walker requires a reference but none was provided.

The reference file exists in the above mentioned path. I even tried running with absolute path for the reference file but was not successful. Any help will be appreciated.

Created 2013-12-20 20:40:37 | Updated | Tags: queue jobs

Hi,

I've been using Queue for pipelineing now for a little while, and have run in to an issue with the job report. The issue is that it's blank, after the pipeline has finished, the entire contents of the file is

#:GATKReport.v1.1:0

I'm getting output from Queue saying that

 INFO  21:37:35,646 QJobsReporter - Writing JobLogging GATKReport to file /home/daniel.klevebring/bin/autoseqer/AutoSeqPipeline.jobreport.txt 

Any ideas why? The report would be a great tool, so I'm really interested in getting it to work properly.

Happy holidays,

Daniel

Created 2013-12-06 08:13:46 | Updated 2013-12-06 16:20:26 | Tags: queue

import org.broadinstitute.sting.gatk.walkers.variantrecalibration

@Input(doc="The reference file.", shortName="R")
val reference:File= null

def script ={
class leftsaiFormationFunction extends CommandLineFunction{
@Output(doc="leftSai")
val leftsai:File=new File("leftsai.sai")
required(leftsai)
}
val leftsaiFormation=new leftsaiFormationFunction()

class rightsaiFormationFunction extends CommandLineFunction{
@Output(doc="rightSai")
val rightsai:File=new File("rightsai.sai")
required(rightsai)
}
val rightsaiFormation=new rightsaiFormationFunction()

class combinedSamFormationFunction extends CommandLineFunction{
@Input(doc="leftSai")
val leftsai:File=leftsaiFormation.leftsai
@Input(doc="rightSai")
val rightsai:File=rightsaiFormation.rightsai
@Output(doc="combinedSam")
val combinedSam:File=new File("combined.sam")
def commandLine=required("bwa")+required("sampe")+required(reference)+required(leftsai)+
}
val combinedSamFormation=new combinedSamFormationFunction()

}
}

the error message says as follow:
lxd@lxd-System-Product-Name:~/Documents/queue$java -jar Queue.jar -S 111111.scala -R human_g1k_v37.fasta -lr C166_Modified_1.fq -rr C166_Modified_2.fq -run -startFromScratch INFO 16:05:35,810 QScriptManager - Compiling 1 QScript INFO 16:05:39,696 QScriptManager - Compilation complete INFO 16:05:39,759 HelpFormatter - ---------------------------------------------------------------------- INFO 16:05:39,759 HelpFormatter - Queue v2.7-2-g6bda569, Compiled 2013/08/28 16:33:34 INFO 16:05:39,759 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 16:05:39,759 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 16:05:39,760 HelpFormatter - Program Args: -S 111111.scala -R human_g1k_v37.fasta -lr C166_Modified_1.fq -rr C166_Modified_2.fq -run -startFromScratch INFO 16:05:39,760 HelpFormatter - Date/Time: 2013/12/06 16:05:39 INFO 16:05:39,760 HelpFormatter - ---------------------------------------------------------------------- INFO 16:05:39,760 HelpFormatter - ---------------------------------------------------------------------- INFO 16:05:39,766 QCommandLine - Scripting MyScript INFO 16:05:39,800 QCommandLine - Added 3 functions INFO 16:05:39,801 QGraph - Generating graph. INFO 16:05:39,811 QGraph - Running jobs. INFO 16:05:39,812 QGraph - Removing outputs from previous runs. INFO 16:05:39,840 FunctionEdge - Starting: 'bwa' 'aln' 'human_g1k_v37.fasta' 'C166_Modified_1.fq' '>' '/home/lxd/Documents/queue/leftsai.sai' INFO 16:05:39,840 FunctionEdge - Output written to /home/lxd/Documents/queue/leftsai.sai.out INFO 16:05:39,858 FunctionEdge - Starting: 'bwa' 'aln' 'human_g1k_v37.fasta' 'C166_Modified_2.fq' '>' '/home/lxd/Documents/queue/rightsai.sai' INFO 16:05:39,858 FunctionEdge - Output written to /home/lxd/Documents/queue/rightsai.sai.out INFO 16:05:39,862 QGraph - 1 Pend, 2 Run, 0 Fail, 0 Done ERROR 16:06:09,834 FunctionEdge - Error: 'bwa' 'aln' 'human_g1k_v37.fasta' 'C166_Modified_1.fq' '>' '/home/lxd/Documents/queue/leftsai.sai' ERROR 16:06:09,839 FunctionEdge - Contents of /home/lxd/Documents/queue/leftsai.sai.out: /home/lxd/Documents/queue/.queue/tmp/.exec6752111377935879956: 2: /home/lxd/Documents/queue/.queue/tmp/.exec6752111377935879956: bwa: not found ERROR 16:06:09,842 FunctionEdge - Error: 'bwa' 'aln' 'human_g1k_v37.fasta' 'C166_Modified_2.fq' '>' '/home/lxd/Documents/queue/rightsai.sai' ERROR 16:06:09,842 FunctionEdge - Contents of /home/lxd/Documents/queue/rightsai.sai.out: /home/lxd/Documents/queue/.queue/tmp/.exec7253750678399969841: 2: /home/lxd/Documents/queue/.queue/tmp/.exec7253750678399969841: bwa: not found INFO 16:06:09,843 QGraph - Writing incremental jobs reports... INFO 16:06:09,844 QJobsReporter - Writing JobLogging GATKReport to file /home/lxd/Documents/queue/111111.jobreport.txt INFO 16:06:09,869 QGraph - 1 Pend, 0 Run, 2 Fail, 0 Done INFO 16:06:09,870 QCommandLine - Writing final jobs report... INFO 16:06:09,870 QJobsReporter - Writing JobLogging GATKReport to file /home/lxd/Documents/queue/111111.jobreport.txt INFO 16:06:09,876 QJobsReporter - Plotting JobLogging GATKReport to file /home/lxd/Documents/queue/111111.jobreport.pdf WARN 16:06:10,262 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info. INFO 16:06:10,263 QCommandLine - Done with errors INFO 16:06:10,264 QGraph - ------- INFO 16:06:10,266 QGraph - Failed: 'bwa' 'aln' 'human_g1k_v37.fasta' 'C166_Modified_1.fq' '>' '/home/lxd/Documents/queue/leftsai.sai' INFO 16:06:10,266 QGraph - Log: /home/lxd/Documents/queue/leftsai.sai.out INFO 16:06:10,266 QGraph - ------- INFO 16:06:10,268 QGraph - Failed: 'bwa' 'aln' 'human_g1k_v37.fasta' 'C166_Modified_2.fq' '>' '/home/lxd/Documents/queue/rightsai.sai' INFO 16:06:10,268 QGraph - Log: /home/lxd/Documents/queue/rightsai.sai.out INFO 16:06:10,269 QCommandLine - Script failed with 3 total jobs lxd@lxd-System-Product-Name:~/Documents/queue$ 

Created 2013-12-04 15:59:35 | Updated | Tags: queue qnode debug

Hi,

I've been running Queue using the old DataProcessingPipeline.scala script (unmodified) for over a day now, and I'm starting to think there's an infinite loop. The output keeps adding Qnodes. Right now it's up to almost 1700000 QNodes. I've run the same script before with 1000 smaller bam files (about 1/2Gb each) and that seemed to work fine. Now I'm trying to run it on 1000 exomes, each on average 8Gb. I'm using the following options:

Here's the output:

INFO 11:09:17,548 HelpFormatter - ---------------------------------------------

INFO 11:09:17,548 HelpFormatter - Queue v2.7-4-g6f46d11, Compiled 2013/10/10 17 :29:52 INFO 11:09:17,548 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:09:17,549 HelpFormatter - For support and documentation go to http://ww w.broadinstitute.org/gatk DEBUG 11:09:17,549 HelpFormatter - Current directory: /cluster/ifs/projects/Exom es/Biesecker_CS_WE/mito_express/ES/tmp INFO 11:09:17,549 HelpFormatter - Program Args: -S /home/singhln/Projects/ES/sr c/DataProcessingPipeline.scala -bwa /home/singhln/bin/bwa -bt 10 -i /home/singhl n/Projects/ES/Data/allalignedbams.list -R /home/singhln/Projects/ES/Data/human_g 1k_v37.fasta -D /home/singhln/Projects/ES/Data/dbsnp_137.b37.vcf -p CS -bwape -l og cses.log -gv -qsub -startFromScratch -jobReport queuereport.txt -memLimit 4 - tempDir /home/singhln/Projects/ES/tmp -runDir /home/singhln/Projects/ES/Dedup -l DEBUG -run INFO 11:09:17,550 HelpFormatter - Date/Time: 2013/12/03 11:09:17 INFO 11:09:17,550 HelpFormatter - ---------------------------------------------

INFO 11:09:17,550 HelpFormatter - ---------------------------------------------

INFO 11:09:17,561 QCommandLine - Scripting DataProcessingPipeline DEBUG 11:10:46,667 QGraph - adding QNode: 0 DEBUG 11:10:46,859 QGraph - adding QNode: 100 DEBUG 11:10:47,024 QGraph - adding QNode: 200 DEBUG 11:10:47,136 QGraph - adding QNode: 300 DEBUG 11:10:47,241 QGraph - adding QNode: 400 ... ... DEBUG 10:54:46,948 QGraph - adding QNode: 1647200 DEBUG 10:55:11,294 QGraph - adding QNode: 1647300 DEBUG 10:55:23,623 QGraph - adding QNode: 1647400 DEBUG 10:55:40,081 QGraph - adding QNode: 1647500 DEBUG 10:55:52,678 QGraph - adding QNode: 1647600 DEBUG 10:56:02,486 QGraph - adding QNode: 1647700

I'm not even sure how I'd go about debugging this or if this is normal, but it does seem very strange to me. No output seems to have been created during the last 24 hours either, other than the log file.

Thanks for any help, -Larry.

Created 2013-11-26 18:10:04 | Updated | Tags: selectvariants queue qscript walkers java

I am working on a Queue script that uses the selectVariants walker. Two of the arguments that I am trying to use both use an enumerated type: restrictAllelesTo and selectTypeToInclude. I have tried passing these as strings however I get java type mismatch errors. What is the simplest way to pass these parameters to the selectVariant walker in the qscript?

Created 2013-11-25 16:47:02 | Updated | Tags: queue

I've submitted a pull request that adds a "logDirectory" argument to Queue. Quoting myself:

Complex Queue pipelines with a large number of intermediate steps can generate more log files than actual output, cluttering the working directory. This patch introduces a command-line parameter --logDirectory/-logDir that allows those files to be sequestered into a single directory.

The use of absolute pathnames changes the behavior of logDir. The possibilities are:

1. logDir is not specified: No change from the existing behavior (the .out file is in the same directory as the @Output)
2. The location for the @Output element is relative: The relative path for @Output will be rooted in logDir (whether logDir is relative or absolute)
3. The @Output element is absolute: logDir is ignored, as if it were not specified

My assumption is that absolute directories will be rare, and that when they are used they should override any other considerations. I've tested that this works as described when logDir is not specified and when it has a value of "log" (i.e., a relative path).

At the urging of the GSA team, and following a nice model introduced by @Johan_Dahlberg, I'd like to open this change up for public comment. Obviously, I'm only considering a fairly restricted use case. Does this behavior make sense to others? Does it break anybody's workflow in any significant way?

Created 2013-11-20 06:41:02 | Updated 2013-11-20 06:41:55 | Tags: queue qscript picard markduplicates

Hi,

So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
@Input(doc="Input BAM file") val bam: File = inBam
@Output(doc="Metrics file") val metrics: File = outMetrics
this.input :+= bam
this.targets = targetIntervals
this.baits = baitIntervals
this.output = metrics
this.reference =  refGenome
this.isIntermediate = false
}

Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath 

And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
@Input(doc = "Input bam file") var inbam = inBam
@Output(doc = "Output BAM file with dups removed") var outbam = outBam
this.REMOVE_DUPLICATES = true
this.input :+= inBam
this.output = outBam
this.metrics = metricsFile
this.memoryLimit = 3
this.isIntermediate = false
}

Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.

Created 2013-11-13 12:52:32 | Updated | Tags: variantrecalibrator queue

Hi there, I do apologise in case this question has been answered already but I couldn't find an updated thread on it.

In my previous code I had a conditional based on the number of samples to modify the percentBad parameter in VariantRecalibrator. Now I get an error from my scala code as if the value were not anymore a member of the class.

Checking again in the Best Practices and in the VariantRecalibrator documentation I can't find it anymore. Could you confirm it's not used anymore? or am I doing something wrong?

cheers, Francesco

Created 2013-11-06 19:02:46 | Updated | Tags: queue gatk

I am reading through the most recent workshop slides on Queue, and trying to write a scala script to connect the GATK walkers. However, I'm confused how to use the output of last walker as input for the next walker, especially when you have multiple outputs from the last walker. For example, I wrote the following script to connect RealignerTargetCreator and IndelRealigner, and I have a list of bam files as input to RealignerTargetCreator. I don't know whether I should have multiple outputs from RealignerTargetCreator, and how to use the multiple output from RealignerTargetCreator as input for IndelRealigner. My confusion is highlighted as bold comment text below:

def script() {
val bams = QScriptUtils.createSeqFromFile(qscript.input)

if (nContigs < 0)
nContigs = QScriptUtils.getNumberOfContigs(bams(0))
val baseName = ""
val outputDir = if (qscript.outputDir.isEmpty()) baseName else qscript.outputDir + "/" + baseName

val realigner = new RealignerTargetCreator
realigner.reference_sequence = qscript.referenceFile
realigner.input_file = bams
realigner.out = new File(outputDir + "/Realigner")  // **should I have multiple outputs? how do you name individual output files?**

val indelRealigner = new IndelRealigner
indelRealigner.input_file :+= realigner.out  // **do I need a for-loop to go through each input files from last walker?**
indelRealigner.targetIntervals = swapExt(realigner.out, "bam", "intervals")
indelRealigner.nWayOut = new File(".realigned.bam")

}

Created 2013-10-08 05:10:07 | Updated 2013-10-08 05:14:30 | Tags: queue java

Hi team,
I have Java 1.6 installed in my system. I know that GATK works now with Java 1.7, but I work in a shared system and I can not change the default java version so I downloaded Java 1.7 and I work asigning the java version at the call:

If I run:
/jre1.7.0/bin/java -Djava.io.tmpdir=tmp -jar Queue.jar --help
works fine, but if I try:
/jre1.7.0/bin/java -Djava.io.tmpdir=tmp -jar Queue.jar -S myScala.file ....
I get the following error:

DEBUG 13:42:50,953 FunctionEdge - Starting: /Qscripts > 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/Qscripts/tmp' '-cp' 'Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' '/resources/exampleBAM.bam' '-R' '/resources/exampleFASTA.fasta' '-nct' '1' '-o' '/raw1.vcf' '-hets' '0.005'
INFO 13:42:50,954 FunctionEdge - Output written to /raw1.vcf.out DEBUG 13:42:50,954 IOUtils - Deleted /raw1.vcf.out Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: org.broadinstitute.sting.gatk.CommandLineGATK. Program will exit. Seems like if HaplotypeCaller call java again and use the default java in the system, that is Java 1.6 Can I change the "java" version parameter? Thanks in advance, Created 2013-10-02 00:56:39 | Updated | Tags: variantrecalibrator haplotypecaller queue mode Hi team, I have a basic question, sorry. I am trying to use HaplotypeCaller and VariantRecalibrator with QUEUE but I don't know how add: VariantRecalibrator.mode = BOTH HaplotypeCaller.genotyping_mode = DISCOVERY I get a "not found: value BOTH/DISCOVERY" error. Thank you in advance, Created 2013-09-30 16:54:27 | Updated | Tags: queue scala I am trying to build Queue from Sting package downloaded from Github, but the ant building process always fails with different errors. I wonder if there's any alternative way to build Queue. Is there any scala script available that I can study or customize for automating GATK runs? Created 2013-09-27 13:45:04 | Updated | Tags: test queue pipeline I'm wondering if there's any way to skip the GATKCommandLine line in the vcf-header (in a vcf file generated by UnifiedGenotyper). I thought that the --remove_program_records would do this but it doesn't seem to do the trick. I'm still seeing the header line. ##GATKCommandLine=<ID=UnifiedGenotyper,Version=2.7-2-g6bda569,Date="Fri Sep 27 15:17:59 CEST 2013", [...] > The reason this is important to me is that I'm using the Pipeline test code provided in Queue and, as you know, this is based on md5 sums, and as the time when the tests was run is included, the md5 hash changes for each run. So, if there is no way to skip the header, is there any other, better way to do this. Cheers, Johan Created 2013-09-23 07:53:40 | Updated | Tags: bqsr queue parallelism When using queue for BQSR scatter/gather parellelism I've been seeing the following: java.lang.IllegalArgumentException: Table1 188,3 not equal to 189,3 at org.broadinstitute.sting.utils.recalibration.RecalUtils.combineTables(RecalUtils.java:808) at org.broadinstitute.sting.utils.recalibration.RecalibrationReport.combine(RecalibrationReport.java:147) at org.broadinstitute.sting.gatk.walkers.bqsr.BQSRGatherer.gather(BQSRGatherer.java:86) at org.broadinstitute.sting.queue.function.scattergather.GathererFunction.run(GathererFunction.scala:42) at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:53) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

I'm using gatk version: v2.4-7-g5e89f01 (I can't keep up the pace with you guys). I'm wondering if this is a know issue, and if so, if there is a workaround or a fix in later GATK versions.

Cheers, Johan

Created 2013-09-10 14:25:18 | Updated | Tags: queue

When trying to run the Haplotype caller using GATK Queue all my jobs failed because they couldn't write to the tmp files in the ./.queue directory.

This seems to be related to the working directory from which the Queue command is started, and the ./.queue directory is created.

The head node of the SGE grid I want to use and from where I start the Queue is server8.

When I start the Queue command on that server the working directory is translated from a path that is accessible from everywhere on the SGE cluster to a path that is only accessible locally.

Example working directory: /home/sge_share_sever8/variantCalling is translated to the absolute path /data/variantCalling which is only accesible locally.

If I start the Queue from a working directory that is mounted from another server everything workes fine.

Example working directory: /home/sge_share_sever12/variantCalling stays /home/sge_share_sever12/variantCalling and everything works fine.

Is it a bug or feature that the ./.queue directory is translated from a normal path to an absolute path?

Or can I specify another location for the ./.queue directory that will not be translated to an absolute path.

Created 2013-08-27 18:48:00 | Updated | Tags: queue lsf

This is not a question, per se - I suppose it's more of an observation.

We recently upgraded LSF on one of our clusters to v9.0.1, and quickly discovered that Queue can't submit jobs. The reaction was rather violent - the entire JVM crashed, and the stack trace showed it dying in lsb_submit(). We downgraded LSF to v8.3.0, and everything is working fine (so far).

I know Queue is compiled against the LSF v7.0.6 API, it would appear that it's not binary-compatible with LSF 9.x.

Hope this helps others in the future...

Created 2013-08-13 05:52:52 | Updated | Tags: queue scala

Hi, I am trying to assign queue name to individual bsub commands generated by Queue.jar. Basically I want it to generate

bsub -q short "the command"
bsub "the command"

Any advice on how to assign queue names, to submit to, from within the scala file would be really helpful. Thanks.

Created 2013-07-23 19:20:19 | Updated | Tags: queue qscript scala

I'm working on a set of related Queue scripts. I would like to have functionality shared between them, ideally in separate scala files which would be imported. Is there a way to specify additional paths for the Queue scala compiler to search or do I have to bake my library into the gatk when I build it?

Created 2013-06-25 15:41:09 | Updated | Tags: unifiedgenotyper queue scala development scattergatter

I've noticed some strange behavior from Queue where in some cases, when I scatter/gather the Unified Genotyper in indel-mode it will introduce Cycles in the graph. This causes to Queue to die with a StackOverflowError which seems to be caused by the graphDepth function in QGraph due to the recursion becoming unbounded. This cause me some headaches yesterday as I tried to figure out how to make the function tail-recursive, before noticing the message: ERROR 17:18:21,292 QGraph - Cycles were detected in the graph this morning.

This leads me to one request and one question. First the request: It would be nice if Queue would exit if the graph validation fails, as it would make identifying the source of the problem simpler. It this possible?

Secondly the question: do you have any ideas as to what might cause the cycles?

I have tried looking at the graphviz files and I cannot identify any cycles from those (though when looking at the s/g-plots it's really difficult to make any sense of it).

My code looks like this:

val candidateSnps = new File(outputDir + "/" + projectName + ".candidate.snp.vcf")
val candidateIndels = new File(outputDir + "/" + projectName + ".candidate.indel.vcf")

// SNP and INDEL Calls

val targets = new File(outputDir + "/" + projectName + ".targets")

// Take regions based on indels called in previous step
val postCleaningBamList =
for (bam <- cohortList) yield {
val indelRealignedBam = swapExt(bam, ".bam", ".clean.bam")
indelRealignedBam
}

val afterCleanupSnps = swapExt(candidateSnps, ".candidate.snp.vcf", ".cleaned.snp.vcf")
val afterCleanupIndels = swapExt(candidateIndels, ".candidate.indel.vcf", ".cleaned.indel.vcf")

// Call snps/indels again
add(indelCall(postCleaningBamList, afterCleanupIndels))

Where the cohortList is a Seq[File].

Right now I've solved this by setting this.scatterCount = 1 in the indelCall case class, however this doesn't feel quite satisfactory to me, so any pointers for a more robust solution would be greatly appreciated.

Created 2013-06-06 21:08:30 | Updated | Tags: commandlinegatk queue hadoop mapreduce google

Hello, Im new to GATK and Queue. I understand that we can write a QScript in Queue to generate separate GATK jobs and run them on a cluster of several nodes. Can we implement GATK or Queue on google hadoop?

Created 2013-06-05 16:09:04 | Updated | Tags: queue gridengine

Good morning team!

First, I have to qualify my question with that I'm a unix sysadmin- trying to get the "queue" functionality implemented in our cluster so our analysts can play. I'm hoping my question is simple, here goes:

My first attempt at executing the "hello world" example came up with this error:

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:28,560 QScriptManager - Compiling 1 QScript INFO 11:04:31,265 QScriptManager - Compilation complete INFO 11:04:31,340 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,340 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:04:31,340 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:04:31,340 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:04:31,341 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:31,341 HelpFormatter - Date/Time: 2013/06/05 11:04:31 INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,346 QCommandLine - Scripting HelloWorld INFO 11:04:31,363 QCommandLine - Added 1 functions INFO 11:04:31,364 QGraph - Generating graph. INFO 11:04:31,373 QGraph - Running jobs. ERROR 11:04:31,427 QGraph - Uncaught error running jobs. java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa': libdrmaa.so: cannot open shared object file: No such file or directory

ooops! Seems I can't find the drmaa library by default. So, I fixed that by adding the following directory to the library search path on the node: /gridware/sge/lib/lx-amd64 (which is where that library lives).

Success! Sort of. The error above is resolved, but I am now getting the error below, and this is where I'm stuck. It doesn't look like the job is actually getting submitted, OR, it's getting submitted and dies. I would really appreciate any insight the team can offer, we are very excited to try to get this environment to work, thank you in advance!

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:52,728 QScriptManager - Compiling 1 QScript INFO 11:07:55,208 QScriptManager - Compilation complete INFO 11:07:55,271 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,271 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:07:55,271 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:07:55,271 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:07:55,272 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:55,272 HelpFormatter - Date/Time: 2013/06/05 11:07:55 INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,276 QCommandLine - Scripting HelloWorld INFO 11:07:55,292 QCommandLine - Added 1 functions INFO 11:07:55,292 QGraph - Generating graph. INFO 11:07:55,298 QGraph - Running jobs. INFO 11:07:55,481 FunctionEdge - Starting: echo hello world INFO 11:07:55,482 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out ERROR 11:07:55,507 Retry - Caught error during attempt 1 of 4. org.ggf.drmaa.InternalException: Error reading answer list from qmaster at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfunliftedTree111.applymcVsp(DrmaaJobRunner.scala:87) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$11.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunneranonfunliftedTree111.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.util.Retry.attempt(Retry.scala:49) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree11(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) ERROR 11:07:55,510 Retry - Retrying in 1.0 minute. Created 2013-05-13 12:39:01 | Updated | Tags: haplotypecaller queue scatter-gather Hi, I just managed to use HaplotypeCaller with the lasted version of Queue to call variants on 40 human exomes. The HaplotypeCaller job were scattered into 50 sub jobs and spread in our cluster with Sun Grid Engine. The problem I found is that sub jobs take quite vary time to finish, which is from 5 hours to 80 hours and majority of them are below 55 hours, hence the whole job were actually slowed down by just a few longer sub jobs. I know that part of the difference were definitely caused by the performance of the cluster node running the job, but I think the major cause of the difference is reply on how the job were split. The qscript I used is adapted from here (without filtering part), from which I can not figure out how the job were split. Hence, I am wondering if anyone could tell me based on what (Genomic Regions ?) HaplotypeCaller job were actually scattered and how I can split the job more evenly so most of the sub jobs will finish at about the same time. Thanks in advance, Best, Yaobo Created 2013-04-22 20:49:54 | Updated 2013-04-22 21:03:16 | Tags: queue scatter-gather At the Minnesota Supercomputing Institute, our environment requires that jobs on our high performance clusters reserve an entire node. I have implemented my own Torque Manager/Runner for our environment based on the Grid Engine Manager/Runner. The way I have gotten this to work in our environment is to set the nCoresRequest for the scatter/gather method to the minimum required of eight. My understanding is that for the InDelRealigner, for example, the job reserves a node with eight cores, but only uses one. That means our users would have their compute time allocation consumed eight times faster than is necessary. What I am wondering is are there options that I am missing where some number of the scatter/gather requests can be grouped into a single job submission? If I were writing this as a PBS script for our environment and I wanted to use 16 cores in a scatter/gather implementation, I would write two jobs, each with eight commands. They would look something like the following: #PBS Job Configuration stuff pbsdsh -n 0 java -jar ... & pbsdsh -n 1 java -jar ... & pbsdsh -n 2 java -jar ... & pbsdsh -n 3 java -jar ... & pbsdsh -n 4 java -jar ... & pbsdsh -n 5 java -jar ... & pbsdsh -n 6 java -jar ... & pbsdsh -n 7 java -jar ... & wait Has anyone done something similar in Queue? Any pointers? Thanks in advance! Created 2013-03-20 07:44:16 | Updated | Tags: printreadswalker queue My qscript for GATK printreads walker takes very long time for gather function( BAM gather function) which uses picard mergesamfiles. How can I write custom gather function in qscript to improve the performance. 1. I want to try mutli-threading for picard mergesamfiles ( which is false by default) 2. I also to try simple concatenation of bam files (scattered bam files are already sorted) Created 2013-03-15 00:06:36 | Updated | Tags: dataprocessingpipeline queue My lab is trying to use Queue, but out pipeline is spawning very large number of jobs (as would be expected). Our lab has very spiky data processing requiernments and it would be useful to be able to limit the number of jobs queue submits at a time so other, non-queue jobs can process in parallel. Is this possible? Thanks Created 2013-02-27 21:55:09 | Updated | Tags: queue Hi, I am trying to build GATK w/ queue from HEAD of the gatk-protected repository (2a7af4316478348f7ea58e0803b3391593d6dbd6). I am getting the following error during the Queue extensions building phase: [java] Exception in thread "main" java.lang.NoClassDefFoundError: org/broadinstitute/sting/commandline/MissingArgumentException [java] at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:180) [java] at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) [java] at org.broadinstitute.sting.queue.extensions.gatk.GATKExtensionsGenerator.main(GATKExtensionsGenerator.java:104) [java] Caused by: java.lang.ClassNotFoundException: org.broadinstitute.sting.commandline.MissingArgumentException [java] at java.net.URLClassLoader1.run(URLClassLoader.java:202) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:190) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [java] at sun.misc.LauncherAppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247) [java] ... 3 more BUILD FAILED /hpc/users/lindem03/packages/gatk-mssm/dev/build.xml:509: Java returned: 1 at org.apache.tools.ant.taskdefs.Java.execute(Java.java:111) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:392) at org.apache.tools.ant.Target.performTasks(Target.java:413) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399) at org.apache.tools.ant.Project.executeTarget(Project.java:1368) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1251) at org.apache.tools.ant.Main.runBuild(Main.java:811) at org.apache.tools.ant.Main.startAnt(Main.java:217) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) When I run ant in debug mode I see the directory with the newly build GATK classes on classpath. Is there something else I might be missing. I am trying to upgrade from 1.6... Thanks, Michael Created 2013-02-25 11:45:46 | Updated | Tags: queue intervals GATK Queue implements a Scatter/Gather algorithm to create a set of intervals in order to parallelise data alalysis. If -scatter_gather option is issued, a respective number of interval files will be created and the input BAM files will be processed using these intervals. However a question arises what happens if the point of subsequent analysis is at or near the start/end of an interval? Are all the codes which support -l/-L options robust in the respect to interval positions? Since the input files are not actually spliced, all information is available to the processing program which could make the right decisions so that no artefacts are produced. Are there any restrictions on interval creation? Perhaps it should be at least a few read lengths. Anything else? Thanks in advance! Created 2013-01-24 22:01:36 | Updated | Tags: queue markduplicates I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams. So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool. A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution. Created 2013-01-15 19:38:43 | Updated 2013-01-15 20:02:25 | Tags: queue lsf I had no problem to run GATK two weeks ago. But today, when I run the following GATK command, I got error message. It seems it cannot load library " liblsf.so". Please see below. Is there any change recently on GATK library? 2:15pm qyu@vbronze /bit/data01/projects/GC_coverage java -Xmx4g -Djava.io.tmpdir=/broad/hptmp/vdauwera -jar /humgen/gsa-scr1/vdauwera/gatk/walker/dist/Queue.jar -S /bit/data01/projects/GC_coverage/src/IntervalCovGG.scala -i /humgen/gsa-hpprojects/GATK/data/genes_of_interest.exon_targets.interval_list -b /bit/data01/projects/GC_coverage/testGCcoverage/data/DEV-2129.list -o /bit/data01/projects/GC_coverage/testGCcoverage/DEV-2129.gatkreport -m 16 -sc 100 -bsub -jobQueue priority -startFromScratch -run The error shows: ERROR 14:13:06,238 QGraph - Uncaught error running jobs. java.lang.UnsatisfiedLinkError: Unable to load library 'lsf': liblsf.so: cannot open shared object file: No such file or directory at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:163) at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:236) at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:199) at org.broadinstitute.sting.jna.lsf.v7_0_6.LibBat.<clinit>(LibBat.java:9 0) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner.<init>(Lsf 706JobRunner.scala:233) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner.<clinit>(L sf706JobRunner.scala) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner.<init>(Lsf7 06JobRunner.scala:47) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobManager.create(Lsf 706JobManager.scala:35) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobManager.create(Lsf 706JobManager.scala:33) at org.broadinstitute.sting.queue.engine.QGraph.newRunner(QGraph.scala:6 32) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:408 ) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:131) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scal a:127) at org.broadinstitute.sting.commandline.CommandLineProgram.start(Command LineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(Command LineProgram.java:146) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala: 62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load librar y 'lsf': liblsf.so: cannot open shared object file: No such file or directory at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:163) at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:236) at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:199) at org.broadinstitute.sting.jna.lsf.v7_0_6.LibBat.<clinit>(LibBat.java:9 ......... Thanks, Qing Created 2013-01-10 20:43:28 | Updated 2013-01-10 20:49:09 | Tags: dataprocessingpipeline queue dpp performance community Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ? Thanks Created 2013-01-08 18:47:46 | Updated 2013-01-08 19:34:57 | Tags: dataprocessingpipeline queue scala I am running the following command to test my scala using GATK-2.3.5 > java -Xmx4g -Djava.io.tmpdir=tmp -jar /Queue-2.3-5-g49ed93c/Queue.jar -S Queue-2.3-5-g49ed93c/ExampleCountReads.scala -R /GATK-2.3-5-g49ed93c/resources/exampleFASTA.fasta -I /GATK-2.3-5-g49ed93c/resources/exampleBAM.bam and I am getting this error > INFO 13:42:55,166 QScriptManager - Compiling 1 QScript > ERROR 13:42:55,348 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of 'G' > ERROR 13:42:55,356 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,356 QScriptManager - ^ > ERROR 13:42:55,357 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected > ERROR 13:42:55,358 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,358 QScriptManager - ^ > ERROR 13:42:55,358 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected > ERROR 13:42:55,359 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,360 QScriptManager - ^ > ERROR 13:42:55,360 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of '>' > ERROR 13:42:55,361 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,362 QScriptManager - ^ > ERROR 13:42:55,362 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected > ERROR 13:42:55,363 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,365 QScriptManager - ^ > ERROR 13:42:55,366 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected > ERROR 13:42:55,367 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,367 QScriptManager - ^ > ERROR 13:42:55,367 QScriptManager - ExampleCountReads.scala:39: in XML literal: '>' expected instead of ' ' > ERROR 13:42:55,369 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,369 QScriptManager - ^ > ERROR 13:42:55,369 QScriptManager - ExampleCountReads.scala:67: in XML literal: in XML content, please use '}}' to express '}' > ERROR 13:42:55,369 QScriptManager - } > ERROR 13:42:55,369 QScriptManager - ^ > ERROR 13:42:55,370 QScriptManager - ExampleCountReads.scala:39: I encountered a '}' where I didn't expect one, maybe this tag isn't closed <WalkerName> > ERROR 13:42:55,371 QScriptManager - // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help > ERROR 13:42:55,371 QScriptManager - ^ > ERROR 13:42:55,371 QScriptManager - ExampleCountReads.scala:68: '}' expected but eof found. > ERROR 13:42:55,371 QScriptManager - } > ERROR 13:42:55,372 QScriptManager - ^ > ERROR 13:42:55,390 QScriptManager - 10 errors found > ##### ERROR ------------------------------------------------------------------------------------------ > ##### ERROR stack trace > org.broadinstitute.sting.queue.QException: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors > at org.broadinstitute.sting.queue.QScriptManager.loadScripts(QScriptManager.scala:46) > at org.broadinstitute.sting.queue.QCommandLine.orgbroadinstitutestingqueueQCommandLineqScriptPluginManager(QCommandLine.scala:94) > at org.broadinstitute.sting.queue.QCommandLine.getArgumentSources(QCommandLine.scala:225) > at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:197) > at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) > at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala:61)
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-5-g49ed93c):
> ##### ERROR
> ##### ERROR Please visit the wiki to see if this is a known problem
> ##### ERROR If not, please post the error, with stack trace, to the GATK forum
> ##### ERROR
> ##### ERROR MESSAGE: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
> ##### ERROR ------------------------------------------------------------------------------------------
> INFO  13:42:55,487 QCommandLine - Shutting down jobs. Please wait...
> 13.462u 1.029s 0:10.41 139.0% 0+0k 0+0io 1pf+0w
> 

I am sure java, gatk and queue are installed properly and the files exist in the relevant directory. Thank you,

Created 2012-12-24 12:14:53 | Updated 2013-01-07 19:44:02 | Tags: queue scala

Hi, folks,

I am trying my queue scala scripts, often get error message:

Conflicting collector combinations in option list; please refer to the release notes for the combinations allowed

However, from the debug information, the command line looks correct (?):

ERROR 12:07:05,460 FunctionEdge - Error: 'java' '-Xmx8192m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/Users/wxing/TPU/run/.queue/tmp' '-cp' '/Users/wxing/TPU/gatk/dist/Queue.jar' 'net.sf.picard.sam.SortSam' 'INPUT=/Users/wxing/TPU/run/SRR064286_1.fastq.aligned.sam' 'TMP_DIR=/Users/wxing/TPU/run/.queue/tmp' 'OUTPUT=/Users/wxing/TPU/run/SRR064286_1.fastq.aligned.bam' 'VALIDATION_STRINGENCY=SILENT' 'SO=coordinate' 'CREATE_INDEX=true'

Anyone had similar issues? any tricks for debugging?

Many thanks, Wei

Created 2012-12-13 18:44:03 | Updated | Tags: queue lsf

Hi,

I am testing queue scripts with new installed LSF v8.3. The test script is:

java -Djava.io.tmpdir=/tmp -jar jar2216/Queue.jar -S Queue-2.2-16-g9f648cb/resources/ExampleCountReads.scala -R Queue-2.2-16-g9f648cb/resources/exampleFASTA.fasta -I Queue-2.2-16-g9f648cb/resources/exampleBAM.bam --bsub -run

where I get error message as follows:

'java' '-Xmx1024m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/data/cmb/wxing/gatk/.queue/tmp' '-cp' '/data/cmb/wxing/gatk/jar2216/Queue.jar' 'org.broadinstitute.sting.gatk. CommandLineGATK' '-T' 'CountReads' '-I' '/data/cmb/wxing/gatk/Queue-2.2-16-g9f648cb/resources/exampleBAM.bam' '-R' '/data/cmb/wxing/gatk/Queue-2.2-16-g9f648cb/resources/exampleFASTA.fasta' java.lang.UnsatisfiedLinkError: Error looking up function 'ls_getLicenseUsage': /usr/local/lsf/8.3/linux2.6-glibc2.3-x86_64/lib/liblsf.so: undefined symbol: ls_getLicenseUsage at com.sun.jna.Function.(Function.java:179) at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:344) at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:324) at com.sun.jna.Native.register(Native.java:1341) at com.sun.jna.Native.register(Native.java:1018) at org.broadinstitute.sting.jna.lsf.v7_0_6.LibLsf.(LibLsf.java:86) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner$.unitDivisor(Lsf706JobRunner.scala:401) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner$.org$broadinstitute$sting$queue$engine$lsf$Lsf706JobRunnerconvertUnits(Lsf706JobRunner.scala:416) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner.start(Lsf706JobRunner.scala:98) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) Any clues on the issue "java.lang.UnsatisfiedLinkError: Error looking up function 'ls_getLicenseUsage': /usr/local/lsf/8.3/linux2.6-glibc2.3-x86_64/lib/liblsf.so: ". Or anyone had similar problems? Anyone think it could be the version of our LSF (v8.3) as the code seem based on version 706? Many thanks, Wei Created 2012-11-22 15:48:43 | Updated 2012-11-22 15:49:11 | Tags: unifiedgenotyper queue qscript nullpointerexception I'm building a variant calling qscript (it's available here), heavily based on the the MethodsDevelopmentCallingPipeline.scala. I cannot however run into trouble when setting the "this.scatterCount" of the GenotyperBase to more than 1 - in which case I get a NullPointerException (I include the full error message below). I use the following command line: java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/NewVariantCalling.scala -i NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam -R /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta -res /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/ **-sg 2** -nt 8 -run -l DEBUG -startFromScratch As you can see, I'm using the files from the gatk bundle, and I guess these should be alright for this purpose? Just to be clear I use the "-res" parameter to point to the directory where all the resource files are located, dbsnp, hapmap, etc. and the -sg parameter is what controls the scatter/gather count. I've tried to search in the code for what might be causing this, and I can conclude that the org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc is called with str (its parameter) being an empty string, which is what causes contig to be null, which in turn creates the NullPointerException on line 408 when this line is executed: stop = getContigInfo(contig).getSequenceLength(); This, I guess, is the obvious stuff, but this far I haven't been able to figure this out any further that this. I'm not sure if this is caused by a bug in my script, or by a bug in the GATK. Right now I'm thinking the latter of the two, since I have used the scatter/gather function in other scripts without any trouble. Any ideas of where to continue from here, or confirmation that this is indeed something related to the GATK code would be much appreciated. Cheers, Johan ERROR 16:22:50,781 FunctionEdge - Error: LocusScatterFunction: List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf.idx, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam) > List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_1_of_2/scatter.intervals, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_2_of_2/scatter.intervals) java.lang.NullPointerException at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:408) at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:84) at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:97) at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:538) at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:520) at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindings(IntervalUtils.java:499) at org.broadinstitute.sting.queue.extensions.gatk.GATKIntervals.locs(GATKIntervals.scala:60) at org.broadinstitute.sting.queue.extensions.gatk.LocusScatterFunction.run(LocusScatterFunction.scala:39) at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:28) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

Created 2012-11-08 13:13:47 | Updated | Tags: queue resources

Is there any where I can find the integration test file with the md5sum "45d97df6d291695b92668e8a55c54cd0", which is used in the DataProcessingPipelineTest class? Since my tests fail with another md5sum calculated I would be interested to know what the differences between the files are.

Created 2012-11-08 12:58:35 | Updated 2013-01-07 20:07:00 | Tags: queue pipeline scala

Hi there, I wanted to reproduce in my variant calling Queue script the same conditional you have in MethodsDevelopmenCallingPipeline, i.e. including InbreedingCoeff depending on the number of samples. However, in that script the number of samples is passed to the Target object as an integer, and I would like to count it from the bam file list passed as an input to the script.

Therefore I followed the method in DataProcessingPipeline, i.e.

import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size

But unfortunately, despite DataProcessingPipeline works just fine, when I put these lines in my other script I get the following error:

INFO  12:48:08,616 HelpFormatter - Date/Time: 2012/11/08 12:48:08
INFO  12:48:08,616 HelpFormatter - ----------------------------------------------------------------------
INFO  12:48:08,616 HelpFormatter - ----------------------------------------------------------------------
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.DynamicClassResolutionException: Could not create module HaplotypeCallerStep because      Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-2-gf44cc4e):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------

I tried several alternatives looking at the imports in DataProcessingPipeline but maybe I am missing something. Could you please advise?

thanks very much Francesco

Created 2012-11-08 12:21:41 | Updated | Tags: queue

What is the best way to get Queue to optimize utilization of a given number of cores in an SGE cluster? The DataProcessingPipeline.scala has a hidden parameter "scatter_gather" which sets the nContigs variable. Is it safe to use this option? For example, if you had 100 cores available in your cluster could you set the option to 100? Is there any advantage to setting it higher?

Without setting it, Queue appears to set the nContigs value based on the number of chromosomes in the BAM input. So if using a whole genome BAM it's 25, your example Chr20 data it's 1, or with an unaligned BAM it's 0. So if starting with unaligned data, it appears to run on a single core?

Created 2012-09-12 15:24:46 | Updated 2013-01-07 20:40:00 | Tags: queue

Queue purports to offer users a seamless pipelined integration of BWA, Picard and GATK. Yet Queue requires BAM files as input, implying I would need to call BWA separately to do alignment and then Picard or samtools to convert to BAM, then go back to Queue to work with the BAMs. At this point I run into compatibility issues for instance as described here. So is there a way I can set Queue to take FASTQ files as input?

Created 2012-08-14 13:15:37 | Updated 2013-01-07 19:51:08 | Tags: queue drmaa

I've been running Queue using the DRMAA, and I've noticed one thing which I would like to bring up for discussion. The job names are generated using the following code at this point:

 // Set the display name to < 512 characters of the description
// NOTE: Not sure if this is configuration specific?
protected val jobNameLength = 500
protected val jobNameFilter = """[^A-Za-z0-9_]"""
protected def functionNativeSpec = function.jobNativeArgs.mkString(" ")

def start() {
session.synchronized {
val drmaaJob: JobTemplate = session.createJobTemplate

drmaaJob.setJobName(function.description.take(jobNameLength).replaceAll(jobNameFilter, "_"))
[...]

For me this yields names looking something like this:

"java__Xmx3072mD"

This is not very useful for telling the jobs apart. I'm running my jobs via drmaa on a system using the SLURM resource manager. So the cut-off in the name above can be attributed to the slurm system cutting of the name. Even so, I think that there should be more reasonable ways to create the name - using the function.jobName for example.

So, this leads me to my question - is there any particular reason that the job names are generated the way they are? And if not, do you (the gatk team) want a patch changing this to using the funciton.jobName instead?

Furthermore I would be interested in hearing from other users using gatk queue over drmaa, since I think it might be interesting to develop this further. I have as an example implemented setting a had to implement setting a hard wall time in the jobRunner, since the cluster I'm running on demands this. I'm sure that there are more solutions like that out there, and I would be thrilled to hear about them.

Created 2012-08-02 13:34:54 | Updated 2012-08-02 13:35:20 | Tags: unifiedgenotyper queue

It says in the "ExampleUnifiedGenotyper" qscript that it runs an incomplete version of the UnifiedGenotyper. I have a two questions about this.

• I cannot find the org.broadinstitute.sting.queue.extensions.gatk.UnifiedGenotyper class, is this not in the public source or am I blind?
• Does the "incomplete" in the script starting comment mean that this extension is incomplete in its functionality? Or is it something else that is incomplete in its nature?

Basically I'm wondering if there are any pitfalls for me if I start using the example script as a base for my own variant calling qscript.