Tagged with #queue
6 documentation articles | 4 announcements | 50 forum discussions


Comments (15)

This document explains the concepts involved and how they are applied within the GATK (and Queue where applicable). For specific configuration recommendations, see the companion document on parallelizing GATK tools.

1. Introducing the concept of parallelism

Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).

Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.

This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.

The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.

A quick warning about tradeoffs

Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.

Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.

Parallel computing in practice (sort of)

OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?

Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.

Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:

  • open the file, count the number of lines in the file, tell us the number, close the file

Note that tell us the number can mean writing it to the console, or storing it somewhere for use later on.

Now let's say we want to know the number of words on each line. The set of instructions would be:

  • open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number

And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.

So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:

  • open the file, index the lines

  • read the first line, count the number of words, tell us the number

  • read the second line, count the number of words, tell us the number
  • read the third line, count the number of words, tell us the number
  • [repeat for all lines]

  • collect final results and close the file

Here, the read the Nth line steps can be performed in parallel, because they are all independent operations.

You'll notice that we added a step, index the lines. That's a little bit of peliminary work that allows us to perform the read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.

Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.

2. Parallelizing the GATK

There are three different modes of parallelism offered by the GATK, and to really understand the difference you first need to understand what are the different levels of computing that are involved.

A quick word about levels of computing

By levels of computing, we mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster.

  • Core: the level below the machine. On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.

  • Machine: the middle of the scale. For most of us, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.

  • Cluster: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility.

Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster level.

Multi-threading

In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.

Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.

If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. He can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that.

Hey, if you have a better example, let us know in the forum and we'll use that instead.

Alright, now that you understand the idea of multithreading, let's get practical: how do we do get the GATK to use multi-threading?

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively. They can be combined, since they act at different levels of computing:

  • -nt / --num_threads controls the number of data threads sent to the processor (acting at the machine level)

  • -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread (acting at the core level).

Not all GATK tools can use these options due to the nature of the analyses that they perform and how they traverse the data. Even in the case of tools that are used sequentially to perform a multi-step process, the individual tools may not support the same options. For example, at time of writing (Dec. 2012), of the tools involved in local realignment around indels, RealignerTargetCreator supports -nt but not -nct, while IndelRealigner does not support either of these options.

In addition, there are some important technical details that affect how these options can be used with optimal results. Those are explained along with specific recommendations for the main GATK tools in a companion document on parallelizing the GATK.

Scatter-gather

If you Google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.

So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?

As you know by now, multithreading specifically refers to what happens internally when the program (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy as used by the GATK involves a separate program, called Queue, which generates separate GATK jobs (each with its own command-line) to achieve the instructions given in a so-called Qscript (i.e. a script written for Queue in a programming language called Scala).

At the simplest level, the Qscript can involve a single GATK tool*. In that case Queue will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, Queue will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).

Note that Queue has additional capabilities, such as managing the use of multiple GATK tools in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. To learn more about pipelining the GATK with Queue, please see the Queue documentation.

Compare and combine

So you see, scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But Queue can dispatch scattered GATK jobs to different machines in a computing cluster by interfacing with your cluster's job management software.

That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.

The good news is that you can combine scatter-gather and multithreading: use Queue to scatter GATK jobs to different nodes on your cluster, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.

Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.

Comments (19)

This document provides technical details and recommendations on how the parallelism options offered by the GATK can be used to yield optimal performance results.

Overview

As explained in the primer on parallelism for the GATK, there are two main kinds of parallelism that can be applied to the GATK: multi-threading and scatter-gather (using Queue).

Multi-threading options

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

  • -nt / --num_threads controls the number of data threads sent to the processor
  • -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

For more information on how these multi-threading options work, please read the primer on parallelism for the GATK.

Memory considerations for multi-threading

Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory. In contrast, CPU threads will share the memory allocated to their “mother” data thread, so you don’t need to worry about allocating memory based on the number of CPU threads you use.

Additional consideration when using -nct with versions 2.2 and 2.3

Because of the way the -nct option was originally implemented, in versions 2.2 and 2.3, there is one CPU thread that is reserved by the system to “manage” the rest. So if you use -nct, you’ll only really start seeing a speedup with -nct 3 (which yields two effective "working" threads) and above. This limitation has been resolved in the implementation that will be available in versions 2.4 and up.

Scatter-gather

For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Applicability of parallelism to the major GATK tools

Please note that not all tools support all parallelization modes. The parallelization modes that are available for each tool depend partly on the type of traversal that the tool uses to walk through the data, and partly on the nature of the analyses it performs.

Tool Full name Type of traversal NT NCT SG
RTC RealignerTargetCreator RodWalker + - -
IR IndelRealigner ReadWalker - - +
BR BaseRecalibrator LocusWalker - + +
PR PrintReads ReadWalker - + -
RR ReduceReads ReadWalker - - +
UG UnifiedGenotyper LocusWalker + + +

Recommended configurations

The table below summarizes configurations that we typically use for our own projects (one per tool, except we give three alternate possibilities for the UnifiedGenotyper). The different values allocated for each tool reflect not only the technical capabilities of these tools (which options are supported), but also our empirical observations of what provides the best tradeoffs between performance gains and commitment of resources. Please note however that this is meant only as a guide, and that we cannot give you any guarantee that these configurations are the best for your own setup. You will probably have to experiment with the settings to find the configuration that is right for you.

Tool RTC IR BR PR RR UG
Available modes NT SG NCT,SG NCT SG NT,NCT,SG
Cluster nodes 1 4 4 1 4 4 / 4 / 4
CPU threads (-nct) 1 1 8 4-8 1 3 / 6 / 24
Data threads (-nt) 24 1 1 1 1 8 / 4 / 1
Memory (Gb) 48 4 4 4 4 32 / 16 / 4

Where NT is data multithreading, NCT is CPU multithreading and SG is scatter-gather using Queue. For more details on scatter-gather, see the primer on parallelism for the GATK and the Queue documentation.

Comments (1)

1. What is Scala?

Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.

The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.

2. Where do I learn more about Scala?

  • http://www.scala-lang.org
  • http://programming-scala.labs.oreilly.com
  • http://www.scala-lang.org/docu/files/ScalaByExample.pdf
  • http://devcheatsheet.com/tag/scala/
  • http://davetron5000.github.com/scala-style/index.html

3. What is the difference between var and val?

var is a value you can later modify, while val is similar to final in Java.

4. What is the difference between Scala collections and Java collections? / Why do I get the error: type mismatch?

Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.

   MyQScript.scala:39: error: type mismatch;
     found   : java.util.List[java.lang.String]
     required: scala.List[String]
        val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)

Use the implicit definitions in JavaConversions to automatically convert the basic Java collections to and from Scala collections.

import collection.JavaConversions._

Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.

var str = "A"
str + "B"
println(str) // prints: A
str += "C"
println(str) // prints: AC

var set = Set("A")
set + "B"
println(set) // prints: Set(A)
set += "C"
println(set) // prints: Set(A, C)

5. How do I append to a list?

Use the :+ operator for a single value.

  var myList = List.empty[String]
  myList :+= "a"
  myList :+= "b"
  myList :+= "c"

Use ++ for appending a list.

  var myList = List.empty[String]
  myList ++= List("a", "b", "c")

6. How do I add to a set?

Use the + operator.

  var mySet = Set.empty[String]
  mySet += "a"
  mySet += "b"
  mySet += "c"

7. How do I add to a map?

Use the + and -> operators.

  var myMap = Map.empty[String,Int]
  myMap += "a" -> 1
  myMap += "b" -> 2
  myMap += "c" -> 3

8. What are Option, Some, and None?

Option is a Scala generic type that can either be some generic value or None. Queue often uses it to represent primitives that may be null.

  var myNullableInt1: Option[Int] = Some(1)
  var myNullableInt2: Option[Int] = None

9. What is _ / What is the underscore?

François Armand's slide deck is a good introduction: http://www.slideshare.net/normation/scala-dreaded

To quote from his slides:

Give me a variable name but
- I don't care of what it is
- and/or
- don't want to pollute my namespace with it

10. How do I format a String?

Use the .format() method.

This Java snippet:

String formatted = String.format("%s %i", myString, myInt);

In Scala would be:

val formatted = "%s %i".format(myString, myInt)

11. Can I use Scala Enumerations as QScript @Arguments?

No. Currently Scala's Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java enums if for example you are importing a Java based walker's enum type.

If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.

Comments (10)

Objective

Test that Queue is correctly installed, and that the supporting tools like Java are in your path.

Prerequisites

  • Basic familiarity with the command-line environment
  • Understand what is a PATH variable
  • GATK installed
  • Queue downloaded and placed on path

Steps

  1. Invoke the Queue usage/help message
  2. Troubleshooting

1. Invoke the Queue usage/help message

The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.

Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.

Action

Type the following command:

java -jar <path to Queue.jar> --help

replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.

Expected Result

You should see usage output similar to the following:

usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
       [-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
       <temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
       <emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>]
       [-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
       <status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
       [-debug] [-h]

 -S,--script <script>                                                      QScript scala file
 -jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
 -jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
 -jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
 -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
                                                                           output for compute farm jobs.
 -memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
 -runDir,--run_directory <run_directory>                                   Root directory to run functions from.
 -tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
 -emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
 -emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
                                                                           otherwise 25.
 -emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
 -emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
 -emailUser,--emailUsername <emailUsername>                                Email SMTP username. Defaults to none.
 -emailPass,--emailPassword <emailPassword>                                Email SMTP password. Defaults to none. Not
                                                                           secure! See emailPassFile.
 -emailPassFile,--emailPasswordFile <emailPasswordFile>                    Email SMTP password file. Defaults to none.
 -bsub,--bsub_all_jobs                                                     Use bsub to submit jobs
 -run,--run_scripts                                                        Run QScripts.  Without this flag set only
                                                                           performs a dry run.
 -dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
                                                                           http://en.wikipedia.org/wiki/DOT_language
 -expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
                                                                           a .dot file.  Otherwise overwrites the
                                                                           dot_graph
 -startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
                                                                           outputs were previously output successfully.
 -status,--status                                                          Get status of jobs for the qscript
 -statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
                                                                           completion or on error.
 -statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
                                                                           completion or on error.
 -keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
                                                                           any Function marked as intermediate.
 -retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
                                                                           command fails.  Defaults to no retries.
 -l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
                                                                           setting INFO get's you INFO up to FATAL,
                                                                           setting ERROR gets you ERROR and FATAL level
                                                                           logging.
 -log,--log_to_file <log_to_file>                                          Set the logging location
 -quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
                                                                           stdout
 -debug,--debug_mode                                                       Set the logging file string to include a lot
                                                                           of debugging information (SLOW!)
 -h,--help                                                                 Generate this help message

If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.


2. Troubleshooting

Let's try to figure out what's not working.

Action

First, make sure that your Java version is at least 1.6, by typing the following command:

java -version

Expected Result

You should see something similar to the following text:

java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)  

Remedial actions

If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like

java: Command not found  

make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.

On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.

Comments (17)

Introduction

Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.

The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

BWA alignment

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)

To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.

Best Practices for Variant Calling

Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.

Problems with Variant Calling with Pacific Biosciences

  • Calling must be more permissive of indels in the data.

You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.

  • Base quality thresholds should be adjusted to the specifics of your data.

Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.

  • Reference bias

To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.

Comments (25)

Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

Command-line arguments

Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

the data processing pipeline

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

Examples

  1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

  2. Performing realignment and the full data processing pipeline in one pair-ended bam file

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

Comments (0)

The presentation videos for:

  • Day 1 (Best Practices for Variant Calling with GATK)
  • Day 2 (Building Analysis Pipelines with Queue)

are available here: http://www.broadinstitute.org/gatk/guide/events?id=3391

Comments (0)

Slides for :

  • Day 1 (Best Practices for Variant Calling with GATK)
  • Day 2 (Building Analysis Pipelines with Queue)

are available at this link in our DropBox slide archive :

https://www.dropbox.com/sh/ryz7bx4aeqc7mkl/44Q_2jvIyH

Comments (2)

Register now for a spot at the upcoming GATK workshop, which will be held in Cambridge, MA on October 21-22.

http://www.cvent.com/events/broade-workshop-gatk-best-practices-building-analysis-pipelines-with-queue/event-summary-de1eaa027413404ba6dc04c128d52c63.aspx

This workshop will cover the following topics:

  • GATK Best Practices for Variant Detection
  • Building Analysis Pipelines with Queue

The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!

Comments (0)

We have discovered a bug that seriously impacts the results of BQSR/ BaseRecalibrator when it is run with the scatter-gather functionality of Queue. To be clear, the bug does NOT affect BaseRecalibrator runs performed "normally", i.e. WITHOUT Queue's scatter-gather.

Consequences/ Solution:

Please be aware that if you have been using BaseRecalibrator scatter-gathered with Queue (GATK versions 2.0 and 2.1), your results may be wrong. You will need to redo the base recalibration of your data WITHOUT scatter-gathering.

This issue will be fixed in the next release (version 2.2). We apologize for any inconvenience this may cause you!

Comments (0)

Hi all,

I'm stumped by some behavior regarding job submission by Queue (although I'm not sure if the problem is related to Queue).

I wrote a QScript and tested it on a single machine running SGE; everything worked entirely as expected. I've since moved the analysis over to a cluster using UGE. Everything works mostly as expected, except for 2 issues: one I posted about on the forums here, and one where the job submissions begin taking longer over time.

When the pipeline is first started, jobs are submitted rapidly (within a few seconds of each other). Over time, jobs take increasingly longer (e.g. minutes) to submit, regardless of the job. The trend can be seen in the attached .pdf file. I can kill the pipeline, immediately restart it (both from scratch or not), and reproduce the behavior. Additionally, I can qsub the same jobs from a bash for loop without any problems.

The terminal pauses after outputting the "FunctionEdge - Output written to..." line for a minute (or more) between each submission, even with (what I think are) simple jobs:

INFO  13:07:14,093 FunctionEdge - Starting: rm 12.sam 12.sorted.bam 12.sorted.bai 12.sorted.intervals 12.sorted.realigned.bam 12.sorted.realigned.bai 
INFO  13:07:14,093 FunctionEdge - Output written to /data/ryanabashbash/src/Queue1104bDev/CleanupIntermediates.out 
INFO  13:08:08,810 DrmaaJobRunner - Submitted job id: 37461

On the single machine with SGE, these same jobs were submitted almost instantaneously, and I can qsub the same jobs from a bash for loop on the cluster just as fast. I'm guessing its some sort of interaction between Queue and the architecture that I'm overlooking? Does anyone have any suggestions on how to get some traction in tracking down the cause, or has anyone seen similar behavior before?

As an aside, my .out files all contain something similar to the following errors at the beginning of the file (I haven't been able to track down the source, and I'm not sure if it is related; it doesn't seem to affect the output of the job):

sh: module: line 1: syntax error: unexpected end of file
sh: error importing function definition for `module'

Thanks for any suggestions or pointers!

Comments (7)

Hi,

Thanks to previous replies can run Queue and the relevant walker on a distributed computing server. The question was if I define my scala script to require an argument for the output file, using the -o parameter like so:

            // Required arguments.  All initialized to empty values.
            ....

             @Input(doc="Output file.", shortName="o")
                              var outputFile: File = _

How do I direct the output to pipe the result to a specified directory? Currently I have the code: genotyper.out = swapExt(qscript.bamFile, "bam", outputFile, "unfiltered.vcf")

Currently when I include the string -o /path/to/my/output/files/MyResearch.vcf

The script creates a series of folders within the directory where I execute Queue from. In this case my results were sent to: /Queue-2-8-1-g932cd3a/MyResearch./path/to/my/output/files/MyResearch.unfiltered.vcf

when all I wanted was the output to appear in the path: /path/to/my/output/files/MyResearch.unfiltered.vcf

As always any help is much appreciated.

Comments (7)

Our cluster is setup in such a way that each compute node has a fairly generous scratch disk associated with it. This means that it would be really nice to be able to write temporary files to those scratch disks instead of having to write them over the network to the globally accessible disk. I've been experimenting with different ways of trying to do this using Queue, but so far I've come up short.

Since Queue tries to create all temporary directories (or at least check for their existence) before the jobs are run. I would like to know if there is any way that I could redirect the temporary outputs to the scratch disk using environment variables (To achieve something like this in the end: mycommand -tmp $TMPDIR ...). Since Queue basically just sends commands to be executed on the compute nodes I don't understand why it's necessary to check all temporary dirs before hand. I'm sure I'm missing something here, and I'd like to know what.

From my understanding of Queue I could see two types of temporary files that need to be written to globally accessible disk. The first being the compiled QScript, which location I guess could be set by changing the following piece of code in org.broadinstitute.sting.queue.QCommandLine from:

  private lazy val qScriptPluginManager = {
    qScriptClasses = IOUtils.tempDir("Q-Classes-", "", settings.qSettings.tempDirectory)
    qScriptManager.loadScripts(scripts, qScriptClasses)
    new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
  }

to:

  private lazy val qScriptPluginManager = {
    qScriptClasses = IOUtils.tempDir("Q-Classes-", "", "/path/to/globally/accessible/disk")
    qScriptManager.loadScripts(scripts, qScriptClasses)
    new PluginManager[QScript](qPluginType, Seq(qScriptClasses.toURI.toURL))
  }

Of course not hard coded, but feed into the the QCommandLine as an argument, but this is just to illustrate the point.

And the other one would be scatter-gather jobs, which can be explicitly redirected using: --job_scatter_gather_directory, to make sure that all of those end up in a globally accessible part of the file system.

All other temporary files should be possible to write to local non-globally-accessible disk? Or am I completely wrong here, and this is not a path worth pursuing further?

Comments (3)

Hi,

I'm trying to use Queue with Torque through the PbsEngine jobRunner. The institutional cluster I'm trying to use doesn't allow users to select a queue. You are supposed to request the proper resources and a job routing algorithm selects the right queue for you.

I was able to confirm that doing this change allows me to use Queue in this kind of enviroment.

In "gatk-protected / public / queue-framework / src / main / scala / org / broadinstitute / sting / queue / engine / pbsengine / PbsEngineJobRunner.scala":

Change: // If the job queue is set specify the job queue if (function.jobQueue != null) nativeSpec += " -q " + function.jobQueue else nativeSpec += " -q normal "

to: // If the job queue is set specify the job queue if (function.jobQueue != null) nativeSpec += " -q " + function.jobQueue

Thanks, Carlos

PS: Do the GATK team accepts pull requests? If so, I could submit a pull request with this change.

Comments (2)

Hi (once more) I am attempting to run Queue with a scala script and scheduling it with jobrunner. The script works nicely, but when I run it with jobRunner I get the error

"Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa':libdrmaa.so: cannot open shared object file: No such file or directory."

When I try to pass the location of the libdrmaa.so file (-Djava.library.path=/opt/sge625/sge/lib/lx24-amd64/) the result is the same.

How would I point jobRunner to the correct path for the Drmaa.so library?

Comments (3)

I'm working on add RSEM to our RNAseq pipeline which uses Queue. RSEM takes a number of inputs on the command line, so I have a case class and override commandLine for this to work. Nothing special there.

However, RSEM wants a prefix of the output sample names. If i give it sample_name, it will generate a whole bunch of files, sample_name.genes.results with expression values for genes, sample_name.isoforms.results with expression values for isoforms, sample_name.genome.bam, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai with mappings etc, etc.

What's the best way to handle this in terms of @Output?

Should I use (1):

case class rsem(inFq1: File, inFq2: File, prefix: String) extends ExternalCommonArgs {
   ...
   @Output val myPrefix = prefix
   ...

and them use the prefix in the downstream jobs? Or should I use (2):

case class rsem(inFq1: File, inFq2: File, prefix: String, bam: File, geneResults: File) extends ExternalCommonArgs {
   ...
   @Output val myBam = bam
   @Output val myGeneRes = geneResults
   ...

In (2), I would still use prefix in the def commandLine, of course.

Is there a preferred way to handle this in Queue?

Comments (2)

Hi,

I found if I add library(grid) to gatk-protected / public / R / scripts / org / broadinstitute / sting / queue / util / queueJobReport.R the error goes away.

INFO 15:49:53,263 QCommandLine - Writing final jobs report... INFO 15:49:53,263 QJobsReporter - Writing JobLogging GATKReport to file /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.txt INFO 15:49:53,270 QJobsReporter - Plotting JobLogging GATKReport to file /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.pdf DEBUG 15:49:53,278 RScriptExecutor - Executing: DEBUG 15:49:53,278 RScriptExecutor - Rscript DEBUG 15:49:53,278 RScriptExecutor - -e DEBUG 15:49:53,278 RScriptExecutor - tempLibDir = '/Users/cborroto/workspace/sciencemodule/data_processing/tmp/Rlib.5689446532231761075';install.packages(pkgs=c('/Users/cborroto/workspace/sciencemodule/data_processing/tmp/RlibSources.4324191911112824298/gsalib'), lib=tempLibDir, repos=NULL, type='source', INSTALL_opts=c('--no-libs', '--no-data', '--no-help', '--no-demo', '--no-exec'));library('gsalib', lib.loc=tempLibDir);source('/Users/cborroto/workspace/sciencemodule/data_processing/tmp/queueJobReport.6856864909021911181.R'); DEBUG 15:49:53,278 RScriptExecutor - /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.txt DEBUG 15:49:53,278 RScriptExecutor - /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.pdf * installing *source* package ‘gsalib’ ... ** R ** preparing package for lazy loading ** building package indices ** testing if installed package can be loaded * DONE (gsalib) Loading required package: methods KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009 Attaching package: ‘gplots’ The following object is masked from ‘package:stats’: lowess Loading required package: plyr Attaching package: ‘reshape’ The following objects are masked from ‘package:plyr’: rename, round_any [1] "Report" [1] "Project : /Users/cborroto/workspace/sciencemodule/data_processing/GenePeeksPipeline.jobreport.txt" Error in do.call("layer", list(mapping = mapping, data = data, stat = stat, : could not find function "arrow" Calls: source ... geom_segment -> <Anonymous> -> <Anonymous> -> do.call Execution halted DEBUG 15:49:55,207 RScriptExecutor - Result: 1 WARN 15:49:55,207 RScriptExecutor - RScript exited with 1

Thanks, Carlos

Comments (2)

I wrote my first script in scala to run haplotyperCaller walker of GATK. However, I am running into some errors when I execute the *.scala script. I am unable to figure out the source of error, any help will be appreciated.

package org.broadinstitute.sting.queue.qscripts
import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class haplotypeCaller extends QScript {

                @Input(doc="Reference file for the bam files",shortName="R")
                var referenceFile: File = _

                @Input(doc="One or more bam files",shortName="I")
                var bamFiles: List[File] = Nil

                @Argument(doc="the interval string",shortName="L")
                var intervalString: String = ""

                @Argument(doc="heterozygosity to be considered",shortName="heterozygosity")
                var het: Double = 0.006

                @Argument(doc="which sites to emit",shortName="output_mode")
                var outmode: String = "EMIT_ALL_SITES"

                @Output(doc="outFile to write snps and indels",shortName="out")
                var outFile: File = _

        def script(){
                val hc = new HaplotypeCaller
                hc.memoryLimit=20
                add(hc)
        }

}

command line arguments passed: java -jar ./Queue-2.7-4-g6f46d11/Queue.jar -S haplotypeCaller.scala -R ./reference/xx.fasta -I x1.realigned.bam / -I x2.realigned.bam -I x3.realigned.bam -I x4.realigned.bam -I x5.realigned.bam -I x6.realigned.bam -I x7.realigned.bam / -I x8.realigned.bam -I x9.realigned.bam -I x10.realigned.bam -I x11.realigned.bam -I x12.realigned.bam / -L chr1 -heterozygosity 0.006 -output_mode EMIT_ALL_SITES -out gatk.hc.chr1.raw.snps.indels.vcf -run

error: ##### ERROR MESSAGE: Walker requires a reference but none was provided.

The reference file exists in the above mentioned path. I even tried running with absolute path for the reference file but was not successful. Any help will be appreciated.

Comments (0)

I wrote my first script in scala to run haplotyperCaller walker of GATK. However, I am running into some errors when I execute the *.scala script. I am unable to figure out the source of error, any help will be appreciated.

package org.broadinstitute.sting.queue.qscripts
import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class haplotypeCaller extends QScript {

                @Input(doc="Reference file for the bam files",shortName="R")
                var referenceFile: File = _

                @Input(doc="One or more bam files",shortName="I")
                var bamFiles: List[File] = Nil

                @Argument(doc="the interval string",shortName="L")
                var intervalString: String = ""

                @Argument(doc="heterozygosity to be considered",shortName="heterozygosity")
                var het: Double = 0.006

                @Argument(doc="which sites to emit",shortName="output_mode")
                var outmode: String = "EMIT_ALL_SITES"

                @Output(doc="outFile to write snps and indels",shortName="out")
                var outFile: File = _

        def script(){
                val hc = new HaplotypeCaller
                hc.memoryLimit=20
                add(hc)
        }

}

command line arguments passed: java -jar ./Queue-2.7-4-g6f46d11/Queue.jar -S haplotypeCaller.scala -R ./reference/xx.fasta -I x1.realigned.bam / -I x2.realigned.bam -I x3.realigned.bam -I x4.realigned.bam -I x5.realigned.bam -I x6.realigned.bam -I x7.realigned.bam / -I x8.realigned.bam -I x9.realigned.bam -I x10.realigned.bam -I x11.realigned.bam -I x12.realigned.bam / -L chr1 -heterozygosity 0.006 -output_mode EMIT_ALL_SITES -out gatk.hc.chr1.raw.snps.indels.vcf -run

error: ##### ERROR MESSAGE: Walker requires a reference but none was provided.

The reference file exists in the above mentioned path. I even tried running with absolute path for the reference file but was not successful. Any help will be appreciated.

Comments (22)

Hi,

I've been using Queue for pipelineing now for a little while, and have run in to an issue with the job report. The issue is that it's blank, after the pipeline has finished, the entire contents of the file is

#:GATKReport.v1.1:0

I'm getting output from Queue saying that

 INFO  21:37:35,646 QJobsReporter - Writing JobLogging GATKReport to file /home/daniel.klevebring/bin/autoseqer/AutoSeqPipeline.jobreport.txt 

Any ideas why? The report would be a great tool, so I'm really interested in getting it to work properly.

Happy holidays,

Daniel

Comments (1)

import org.broadinstitute.sting.gatk.walkers.variantrecalibration
import org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorArgumentCollection
import org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorArgumentCollection.Mode
import org.broadinstitute.sting.queue.extensions.gatk.CountLoci
import org.broadinstitute.sting.queue.extensions.gatk._
import org.broadinstitute.sting.queue.function.JavaCommandLineFunction

class MyScript extends org.broadinstitute.sting.queue.QScript {
  @Input(doc="The reference file.", shortName="R")
  val reference:File= null
  @Input(doc="The left reads.", shortName="lr")
  val leftreads:File= null
  @Input(doc="The right reads.", shortName="rr")
  val rightreads:File= null


 def script ={
   class leftsaiFormationFunction extends CommandLineFunction{
      @Output(doc="leftSai")
      val leftsai:File=new File("leftsai.sai")
      def commandLine=required("bwa")+required("aln")+required(reference)+required(leftreads)+required(">")+
        required(leftsai)
    }
    val leftsaiFormation=new leftsaiFormationFunction()

    class rightsaiFormationFunction extends CommandLineFunction{
      @Output(doc="rightSai")
      val rightsai:File=new File("rightsai.sai")
      def commandLine=required("bwa")+required("aln")+required(reference)+required(rightreads)+required(">")+
        required(rightsai)
    }
    val rightsaiFormation=new rightsaiFormationFunction()

    class combinedSamFormationFunction extends CommandLineFunction{
      @Input(doc="leftSai")
      val leftsai:File=leftsaiFormation.leftsai
      @Input(doc="rightSai")
      val rightsai:File=rightsaiFormation.rightsai
      @Output(doc="combinedSam")
      val combinedSam:File=new File("combined.sam")
      def commandLine=required("bwa")+required("sampe")+required(reference)+required(leftsai)+
        required(rightsai)+required(leftreads)+required(
        rightreads)+required(">")+required(combinedSam)
    }
    val combinedSamFormation=new combinedSamFormationFunction()


    add(leftsaiFormation)
    add(rightsaiFormation)
    add(combinedSamFormation)

  }
}


the error message says as follow:
lxd@lxd-System-Product-Name:~/Documents/queue$ java -jar Queue.jar -S 111111.scala -R human_g1k_v37.fasta -lr C166_Modified_1.fq -rr C166_Modified_2.fq -run -startFromScratch 
INFO  16:05:35,810 QScriptManager - Compiling 1 QScript 
INFO  16:05:39,696 QScriptManager - Compilation complete 
INFO  16:05:39,759 HelpFormatter - ---------------------------------------------------------------------- 
INFO  16:05:39,759 HelpFormatter - Queue v2.7-2-g6bda569, Compiled 2013/08/28 16:33:34 
INFO  16:05:39,759 HelpFormatter - Copyright (c) 2012 The Broad Institute 
INFO  16:05:39,759 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:05:39,760 HelpFormatter - Program Args: -S 111111.scala -R human_g1k_v37.fasta -lr C166_Modified_1.fq -rr C166_Modified_2.fq -run -startFromScratch 
INFO  16:05:39,760 HelpFormatter - Date/Time: 2013/12/06 16:05:39 
INFO  16:05:39,760 HelpFormatter - ---------------------------------------------------------------------- 
INFO  16:05:39,760 HelpFormatter - ---------------------------------------------------------------------- 
INFO  16:05:39,766 QCommandLine - Scripting MyScript 
INFO  16:05:39,800 QCommandLine - Added 3 functions 
INFO  16:05:39,801 QGraph - Generating graph. 
INFO  16:05:39,811 QGraph - Running jobs. 
INFO  16:05:39,812 QGraph - Removing outputs from previous runs. 
INFO  16:05:39,840 FunctionEdge - Starting:  'bwa'  'aln'  'human_g1k_v37.fasta'  'C166_Modified_1.fq'  '>'  '/home/lxd/Documents/queue/leftsai.sai'  
INFO  16:05:39,840 FunctionEdge - Output written to /home/lxd/Documents/queue/leftsai.sai.out 
INFO  16:05:39,858 FunctionEdge - Starting:  'bwa'  'aln'  'human_g1k_v37.fasta'  'C166_Modified_2.fq'  '>'  '/home/lxd/Documents/queue/rightsai.sai'  
INFO  16:05:39,858 FunctionEdge - Output written to /home/lxd/Documents/queue/rightsai.sai.out 
INFO  16:05:39,862 QGraph - 1 Pend, 2 Run, 0 Fail, 0 Done 
ERROR 16:06:09,834 FunctionEdge - Error:  'bwa'  'aln'  'human_g1k_v37.fasta'  'C166_Modified_1.fq'  '>'  '/home/lxd/Documents/queue/leftsai.sai'  
ERROR 16:06:09,839 FunctionEdge - Contents of /home/lxd/Documents/queue/leftsai.sai.out:
/home/lxd/Documents/queue/.queue/tmp/.exec6752111377935879956: 2: /home/lxd/Documents/queue/.queue/tmp/.exec6752111377935879956: bwa: not found 
ERROR 16:06:09,842 FunctionEdge - Error:  'bwa'  'aln'  'human_g1k_v37.fasta'  'C166_Modified_2.fq'  '>'  '/home/lxd/Documents/queue/rightsai.sai'  
ERROR 16:06:09,842 FunctionEdge - Contents of /home/lxd/Documents/queue/rightsai.sai.out:
/home/lxd/Documents/queue/.queue/tmp/.exec7253750678399969841: 2: /home/lxd/Documents/queue/.queue/tmp/.exec7253750678399969841: bwa: not found 
INFO  16:06:09,843 QGraph - Writing incremental jobs reports... 
INFO  16:06:09,844 QJobsReporter - Writing JobLogging GATKReport to file /home/lxd/Documents/queue/111111.jobreport.txt 
INFO  16:06:09,869 QGraph - 1 Pend, 0 Run, 2 Fail, 0 Done 
INFO  16:06:09,870 QCommandLine - Writing final jobs report... 
INFO  16:06:09,870 QJobsReporter - Writing JobLogging GATKReport to file /home/lxd/Documents/queue/111111.jobreport.txt 
INFO  16:06:09,876 QJobsReporter - Plotting JobLogging GATKReport to file /home/lxd/Documents/queue/111111.jobreport.pdf 
WARN  16:06:10,262 RScriptExecutor - RScript exited with 1. Run with -l DEBUG for more info. 
INFO  16:06:10,263 QCommandLine - Done with errors 
INFO  16:06:10,264 QGraph - ------- 
INFO  16:06:10,266 QGraph - Failed:   'bwa'  'aln'  'human_g1k_v37.fasta'  'C166_Modified_1.fq'  '>'  '/home/lxd/Documents/queue/leftsai.sai'  
INFO  16:06:10,266 QGraph - Log:     /home/lxd/Documents/queue/leftsai.sai.out 
INFO  16:06:10,266 QGraph - ------- 
INFO  16:06:10,268 QGraph - Failed:   'bwa'  'aln'  'human_g1k_v37.fasta'  'C166_Modified_2.fq'  '>'  '/home/lxd/Documents/queue/rightsai.sai'  
INFO  16:06:10,268 QGraph - Log:     /home/lxd/Documents/queue/rightsai.sai.out 
INFO  16:06:10,269 QCommandLine - Script failed with 3 total jobs 
lxd@lxd-System-Product-Name:~/Documents/queue$ 

It will be really appreciated for your attention and answer!!!!!

Comments (4)

Hi,

I've been running Queue using the old DataProcessingPipeline.scala script (unmodified) for over a day now, and I'm starting to think there's an infinite loop. The output keeps adding Qnodes. Right now it's up to almost 1700000 QNodes. I've run the same script before with 1000 smaller bam files (about 1/2Gb each) and that seemed to work fine. Now I'm trying to run it on 1000 exomes, each on average 8Gb. I'm using the following options:

Here's the output:

INFO 11:09:17,548 HelpFormatter - ---------------------------------------------

INFO 11:09:17,548 HelpFormatter - Queue v2.7-4-g6f46d11, Compiled 2013/10/10 17 :29:52 INFO 11:09:17,548 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:09:17,549 HelpFormatter - For support and documentation go to http://ww w.broadinstitute.org/gatk DEBUG 11:09:17,549 HelpFormatter - Current directory: /cluster/ifs/projects/Exom es/Biesecker_CS_WE/mito_express/ES/tmp INFO 11:09:17,549 HelpFormatter - Program Args: -S /home/singhln/Projects/ES/sr c/DataProcessingPipeline.scala -bwa /home/singhln/bin/bwa -bt 10 -i /home/singhl n/Projects/ES/Data/allalignedbams.list -R /home/singhln/Projects/ES/Data/human_g 1k_v37.fasta -D /home/singhln/Projects/ES/Data/dbsnp_137.b37.vcf -p CS -bwape -l og cses.log -gv -qsub -startFromScratch -jobReport queuereport.txt -memLimit 4 - tempDir /home/singhln/Projects/ES/tmp -runDir /home/singhln/Projects/ES/Dedup -l DEBUG -run INFO 11:09:17,550 HelpFormatter - Date/Time: 2013/12/03 11:09:17

INFO 11:09:17,550 HelpFormatter - ---------------------------------------------

INFO 11:09:17,550 HelpFormatter - ---------------------------------------------

INFO 11:09:17,561 QCommandLine - Scripting DataProcessingPipeline DEBUG 11:10:46,667 QGraph - adding QNode: 0 DEBUG 11:10:46,859 QGraph - adding QNode: 100 DEBUG 11:10:47,024 QGraph - adding QNode: 200 DEBUG 11:10:47,136 QGraph - adding QNode: 300 DEBUG 11:10:47,241 QGraph - adding QNode: 400 ... ... DEBUG 10:54:46,948 QGraph - adding QNode: 1647200 DEBUG 10:55:11,294 QGraph - adding QNode: 1647300 DEBUG 10:55:23,623 QGraph - adding QNode: 1647400 DEBUG 10:55:40,081 QGraph - adding QNode: 1647500 DEBUG 10:55:52,678 QGraph - adding QNode: 1647600 DEBUG 10:56:02,486 QGraph - adding QNode: 1647700

I'm not even sure how I'd go about debugging this or if this is normal, but it does seem very strange to me. No output seems to have been created during the last 24 hours either, other than the log file.

Thanks for any help, -Larry.

Comments (12)

I am working on a Queue script that uses the selectVariants walker. Two of the arguments that I am trying to use both use an enumerated type: restrictAllelesTo and selectTypeToInclude. I have tried passing these as strings however I get java type mismatch errors. What is the simplest way to pass these parameters to the selectVariant walker in the qscript?

Comments (1)

I've submitted a pull request that adds a "logDirectory" argument to Queue. Quoting myself:

Complex Queue pipelines with a large number of intermediate steps can generate more log files than actual output, cluttering the working directory. This patch introduces a command-line parameter --logDirectory/-logDir that allows those files to be sequestered into a single directory.

The use of absolute pathnames changes the behavior of logDir. The possibilities are:

  1. logDir is not specified: No change from the existing behavior (the .out file is in the same directory as the @Output)
  2. The location for the @Output element is relative: The relative path for @Output will be rooted in logDir (whether logDir is relative or absolute)
  3. The @Output element is absolute: logDir is ignored, as if it were not specified

My assumption is that absolute directories will be rare, and that when they are used they should override any other considerations. I've tested that this works as described when logDir is not specified and when it has a value of "log" (i.e., a relative path).

At the urging of the GSA team, and following a nice model introduced by @Johan_Dahlberg, I'd like to open this change up for public comment. Obviously, I'm only considering a fairly restricted use case. Does this behavior make sense to others? Does it break anybody's workflow in any significant way?

Comments (9)

Hi,

So I've finally taken the plunge and migrated our analysis pipeline to Queue. With some great feedback from @johandahlberg, I have gotten to a state where most of the stuff is running smoothly on the cluster.

I'm trying to add Picard's CalculateHSMetrics to the pipeline, but am having some issues. This code:

case class hsmetrics(inBam: File, baitIntervals: File, targetIntervals: File, outMetrics: File) extends CalculateHsMetrics with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc="Input BAM file") val bam: File = inBam
    @Output(doc="Metrics file") val metrics: File = outMetrics
    this.input :+= bam
    this.targets = targetIntervals
    this.baits = baitIntervals
    this.output = metrics
    this.reference =  refGenome
    this.isIntermediate = false
}

Gives the following error message:

ERROR 06:56:25,047 QGraph - Missing 2 values for function:  'java'  '-Xmx2048m'  '-XX:+UseParallelOldGC'  '-XX:ParallelGCThreads=4'  '-XX:GCTimeLimit=50'  '-XX:GCHeapFreeLimit=10'  '-Djava.io.tmpdir=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp' null 'INPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.bam'  'TMP_DIR=/Users/dankle/IdeaProjects/eclipse/AutoSeq/.queue/tmp'  'VALIDATION_STRINGENCY=SILENT'  'OUTPUT=/Users/dankle/tmp/autoseqscala/exampleIND2/exampleIND2.panel.preMarkDupsHsMetrics.metrics'  'BAIT_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'TARGET_INTERVALS=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/exampleINTERVAL.intervals'  'REFERENCE_SEQUENCE=/Users/dankle/IdeaProjects/eclipse/AutoSeq/resources/bwaindex0.6/exampleFASTA.fasta'  'METRIC_ACCUMULATION_LEVEL=SAMPLE'  
ERROR 06:56:25,048 QGraph -   @Argument: jarFile - jar 
ERROR 06:56:25,049 QGraph -   @Argument: javaMainClass - Main class to run from javaClasspath 

And yeah, is seems that the jar file is currently set to null in the command line. However, MarkDuplicates runs fine without setting the jar:

case class dedup(inBam: File, outBam: File, metricsFile: File) extends MarkDuplicates with ExternalCommonArgs with SingleCoreJob with OneDayJob {
    @Input(doc = "Input bam file") var inbam = inBam
    @Output(doc = "Output BAM file with dups removed") var outbam = outBam
    this.REMOVE_DUPLICATES = true
    this.input :+= inBam
    this.output = outBam
    this.metrics = metricsFile
    this.memoryLimit = 3
    this.isIntermediate = false
}

Why does CalculateHSMetrics need the jar, but not MarkDuplicates? Both are imported with import org.broadinstitute.sting.queue.extensions.picard._.

Comments (1)

Hi there, I do apologise in case this question has been answered already but I couldn't find an updated thread on it.

In my previous code I had a conditional based on the number of samples to modify the percentBad parameter in VariantRecalibrator. Now I get an error from my scala code as if the value were not anymore a member of the class.

Checking again in the Best Practices and in the VariantRecalibrator documentation I can't find it anymore. Could you confirm it's not used anymore? or am I doing something wrong?

cheers, Francesco

Comments (1)

I am reading through the most recent workshop slides on Queue, and trying to write a scala script to connect the GATK walkers. However, I'm confused how to use the output of last walker as input for the next walker, especially when you have multiple outputs from the last walker. For example, I wrote the following script to connect RealignerTargetCreator and IndelRealigner, and I have a list of bam files as input to RealignerTargetCreator. I don't know whether I should have multiple outputs from RealignerTargetCreator, and how to use the multiple output from RealignerTargetCreator as input for IndelRealigner. My confusion is highlighted as bold comment text below:

def script() {
    val bams = QScriptUtils.createSeqFromFile(qscript.input)

    if (nContigs < 0)
      nContigs = QScriptUtils.getNumberOfContigs(bams(0))
    val baseName = ""
    val outputDir = if (qscript.outputDir.isEmpty()) baseName else qscript.outputDir + "/" + baseName


    val realigner = new RealignerTargetCreator
    realigner.reference_sequence = qscript.referenceFile
    realigner.input_file = bams
    realigner.out = new File(outputDir + "/Realigner")  // **should I have multiple outputs? how do you name individual output files?**

    val indelRealigner = new IndelRealigner
    indelRealigner.input_file :+= realigner.out  // **do I need a for-loop to go through each input files from last walker?**
    indelRealigner.targetIntervals = swapExt(realigner.out, "bam", "intervals")
    indelRealigner.nWayOut = new File(".realigned.bam")

    add(realigner)
    add(indelRealigner)

  }
Comments (3)

Hi team,
I have Java 1.6 installed in my system. I know that GATK works now with Java 1.7, but I work in a shared system and I can not change the default java version so I downloaded Java 1.7 and I work asigning the java version at the call:

If I run:
/jre1.7.0/bin/java -Djava.io.tmpdir=tmp -jar Queue.jar --help
works fine, but if I try:
/jre1.7.0/bin/java -Djava.io.tmpdir=tmp -jar Queue.jar -S myScala.file ....
I get the following error:

DEBUG 13:42:50,953 FunctionEdge - Starting: /Qscripts > 'java' '-Xmx2048m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/Qscripts/tmp' '-cp' 'Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' '/resources/exampleBAM.bam' '-R' '/resources/exampleFASTA.fasta' '-nct' '1' '-o' '/raw1.vcf' '-hets' '0.005'
INFO 13:42:50,954 FunctionEdge - Output written to /raw1.vcf.out DEBUG 13:42:50,954 IOUtils - Deleted /raw1.vcf.out Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: org.broadinstitute.sting.gatk.CommandLineGATK. Program will exit.

Seems like if HaplotypeCaller call java again and use the default java in the system, that is Java 1.6

Can I change the "java" version parameter?

Thanks in advance,

Comments (4)

Hi team, I have a basic question, sorry. I am trying to use HaplotypeCaller and VariantRecalibrator with QUEUE but I don't know how add:

VariantRecalibrator.mode = BOTH

HaplotypeCaller.genotyping_mode = DISCOVERY

I get a "not found: value BOTH/DISCOVERY" error.

Thank you in advance,

Comments (15)

I am trying to build Queue from Sting package downloaded from Github, but the ant building process always fails with different errors. I wonder if there's any alternative way to build Queue. Is there any scala script available that I can study or customize for automating GATK runs?

Comments (1)

I'm wondering if there's any way to skip the GATKCommandLine line in the vcf-header (in a vcf file generated by UnifiedGenotyper). I thought that the --remove_program_records would do this but it doesn't seem to do the trick. I'm still seeing the header line.

##GATKCommandLine=<ID=UnifiedGenotyper,Version=2.7-2-g6bda569,Date="Fri Sep 27 15:17:59 CEST 2013", [...] >

The reason this is important to me is that I'm using the Pipeline test code provided in Queue and, as you know, this is based on md5 sums, and as the time when the tests was run is included, the md5 hash changes for each run. So, if there is no way to skip the header, is there any other, better way to do this.

Cheers, Johan

Comments (19)

When using queue for BQSR scatter/gather parellelism I've been seeing the following:

java.lang.IllegalArgumentException: Table1 188,3 not equal to 189,3
        at org.broadinstitute.sting.utils.recalibration.RecalUtils.combineTables(RecalUtils.java:808)
        at org.broadinstitute.sting.utils.recalibration.RecalibrationReport.combine(RecalibrationReport.java:147)
        at org.broadinstitute.sting.gatk.walkers.bqsr.BQSRGatherer.gather(BQSRGatherer.java:86)
        at org.broadinstitute.sting.queue.function.scattergather.GathererFunction.run(GathererFunction.scala:42)
        at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:53)
        at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

I'm using gatk version: v2.4-7-g5e89f01 (I can't keep up the pace with you guys). I'm wondering if this is a know issue, and if so, if there is a workaround or a fix in later GATK versions.

Cheers, Johan

Comments (3)

When trying to run the Haplotype caller using GATK Queue all my jobs failed because they couldn't write to the tmp files in the ./.queue directory.

This seems to be related to the working directory from which the Queue command is started, and the ./.queue directory is created.

The head node of the SGE grid I want to use and from where I start the Queue is server8.

When I start the Queue command on that server the working directory is translated from a path that is accessible from everywhere on the SGE cluster to a path that is only accessible locally.

Example working directory: /home/sge_share_sever8/variantCalling is translated to the absolute path /data/variantCalling which is only accesible locally.

If I start the Queue from a working directory that is mounted from another server everything workes fine.

Example working directory: /home/sge_share_sever12/variantCalling stays /home/sge_share_sever12/variantCalling and everything works fine.

Is it a bug or feature that the ./.queue directory is translated from a normal path to an absolute path?

Or can I specify another location for the ./.queue directory that will not be translated to an absolute path.

Comments (7)

This is not a question, per se - I suppose it's more of an observation.

We recently upgraded LSF on one of our clusters to v9.0.1, and quickly discovered that Queue can't submit jobs. The reaction was rather violent - the entire JVM crashed, and the stack trace showed it dying in lsb_submit(). We downgraded LSF to v8.3.0, and everything is working fine (so far).

I know Queue is compiled against the LSF v7.0.6 API, it would appear that it's not binary-compatible with LSF 9.x.

Hope this helps others in the future...

Comments (2)

Hi, I am trying to assign queue name to individual bsub commands generated by Queue.jar. Basically I want it to generate

bsub -q short "the command"
instead of
bsub "the command"

Any advice on how to assign queue names, to submit to, from within the scala file would be really helpful. Thanks.

Comments (8)

I'm working on a set of related Queue scripts. I would like to have functionality shared between them, ideally in separate scala files which would be imported. Is there a way to specify additional paths for the Queue scala compiler to search or do I have to bake my library into the gatk when I build it?

Comments (1)

I've noticed some strange behavior from Queue where in some cases, when I scatter/gather the Unified Genotyper in indel-mode it will introduce Cycles in the graph. This causes to Queue to die with a StackOverflowError which seems to be caused by the graphDepth function in QGraph due to the recursion becoming unbounded. This cause me some headaches yesterday as I tried to figure out how to make the function tail-recursive, before noticing the message: ERROR 17:18:21,292 QGraph - Cycles were detected in the graph this morning.

This leads me to one request and one question. First the request: It would be nice if Queue would exit if the graph validation fails, as it would make identifying the source of the problem simpler. It this possible?

Secondly the question: do you have any ideas as to what might cause the cycles?

I have tried looking at the graphviz files and I cannot identify any cycles from those (though when looking at the s/g-plots it's really difficult to make any sense of it).

My code looks like this:

val candidateSnps = new File(outputDir + "/" + projectName + ".candidate.snp.vcf")
val candidateIndels = new File(outputDir + "/" + projectName + ".candidate.indel.vcf")

// SNP and INDEL Calls
add(snpCall(cohortList, candidateSnps))
add(indelCall(cohortList, candidateIndels))

val targets = new File(outputDir + "/" + projectName + ".targets")
add(target(candidateIndels, targets))

// Take regions based on indels called in previous step
val postCleaningBamList =
  for (bam <- cohortList) yield {
    val indelRealignedBam = swapExt(bam, ".bam", ".clean.bam")
    add(clean(Seq(bam), targets, indelRealignedBam))
    indelRealignedBam
  }

val afterCleanupSnps = swapExt(candidateSnps, ".candidate.snp.vcf", ".cleaned.snp.vcf")
val afterCleanupIndels = swapExt(candidateIndels, ".candidate.indel.vcf", ".cleaned.indel.vcf")

// Call snps/indels again
add(snpCall(postCleaningBamList, afterCleanupSnps))
add(indelCall(postCleaningBamList, afterCleanupIndels))

Where the cohortList is a Seq[File].

Right now I've solved this by setting this.scatterCount = 1 in the indelCall case class, however this doesn't feel quite satisfactory to me, so any pointers for a more robust solution would be greatly appreciated.

Comments (6)

Hello, I`m new to GATK and Queue. I understand that we can write a QScript in Queue to generate separate GATK jobs and run them on a cluster of several nodes. Can we implement GATK or Queue on google hadoop?

Comments (10)

Good morning team!

First, I have to qualify my question with that I'm a unix sysadmin- trying to get the "queue" functionality implemented in our cluster so our analysts can play. I'm hoping my question is simple, here goes:

We have SGE, and I have downloaded the binary "queue" package.

My first attempt at executing the "hello world" example came up with this error:

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:28,560 QScriptManager - Compiling 1 QScript INFO 11:04:31,265 QScriptManager - Compilation complete INFO 11:04:31,340 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,340 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:04:31,340 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:04:31,340 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:04:31,341 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:04:31,341 HelpFormatter - Date/Time: 2013/06/05 11:04:31 INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,341 HelpFormatter - ---------------------------------------------------------------------- INFO 11:04:31,346 QCommandLine - Scripting HelloWorld INFO 11:04:31,363 QCommandLine - Added 1 functions INFO 11:04:31,364 QGraph - Generating graph. INFO 11:04:31,373 QGraph - Running jobs. ERROR 11:04:31,427 QGraph - Uncaught error running jobs. java.lang.UnsatisfiedLinkError: Unable to load library 'drmaa': libdrmaa.so: cannot open shared object file: No such file or directory

ooops! Seems I can't find the drmaa library by default. So, I fixed that by adding the following directory to the library search path on the node: /gridware/sge/lib/lx-amd64 (which is where that library lives).

Success! Sort of. The error above is resolved, but I am now getting the error below, and this is where I'm stuck. It doesn't look like the job is actually getting submitted, OR, it's getting submitted and dies. I would really appreciate any insight the team can offer, we are very excited to try to get this environment to work, thank you in advance!

kcb@lima:~> java -jar /apps/Queue-2.5-2-gf57256b/Queue.jar -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:52,728 QScriptManager - Compiling 1 QScript INFO 11:07:55,208 QScriptManager - Compilation complete INFO 11:07:55,271 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,271 HelpFormatter - Queue v2.5-2-gf57256b, Compiled 2013/05/01 09:29:04 INFO 11:07:55,271 HelpFormatter - Copyright (c) 2012 The Broad Institute INFO 11:07:55,271 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:07:55,272 HelpFormatter - Program Args: -S /apps/Queue-2.5-2-gf57256b/examples/HelloWorld.scala -jobRunner GridEngine -run INFO 11:07:55,272 HelpFormatter - Date/Time: 2013/06/05 11:07:55 INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,272 HelpFormatter - ---------------------------------------------------------------------- INFO 11:07:55,276 QCommandLine - Scripting HelloWorld INFO 11:07:55,292 QCommandLine - Added 1 functions INFO 11:07:55,292 QGraph - Generating graph. INFO 11:07:55,298 QGraph - Running jobs. INFO 11:07:55,481 FunctionEdge - Starting: echo hello world INFO 11:07:55,482 FunctionEdge - Output written to /shared/users/kcb/HelloWorld-1.out ERROR 11:07:55,507 Retry - Caught error during attempt 1 of 4. org.ggf.drmaa.InternalException: Error reading answer list from qmaster at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:400) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.checkError(JnaSession.java:392) at org.broadinstitute.sting.jna.drmaa.v1_0.JnaSession.runJob(JnaSession.java:79) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply$mcV$sp(DrmaaJobRunner.scala:87) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner$$anonfun$liftedTree1$1$1.apply(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.util.Retry$.attempt(Retry.scala:49) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.liftedTree1$1(DrmaaJobRunner.scala:85) at org.broadinstitute.sting.queue.engine.drmaa.DrmaaJobRunner.start(DrmaaJobRunner.scala:84) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:84) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:434) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:156) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:171) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala) ERROR 11:07:55,510 Retry - Retrying in 1.0 minute.

Comments (3)

Hi,

I just managed to use HaplotypeCaller with the lasted version of Queue to call variants on 40 human exomes. The HaplotypeCaller job were scattered into 50 sub jobs and spread in our cluster with Sun Grid Engine.

The problem I found is that sub jobs take quite vary time to finish, which is from 5 hours to 80 hours and majority of them are below 55 hours, hence the whole job were actually slowed down by just a few longer sub jobs. I know that part of the difference were definitely caused by the performance of the cluster node running the job, but I think the major cause of the difference is reply on how the job were split. The qscript I used is adapted from here (without filtering part), from which I can not figure out how the job were split. Hence, I am wondering if anyone could tell me based on what (Genomic Regions ?) HaplotypeCaller job were actually scattered and how I can split the job more evenly so most of the sub jobs will finish at about the same time.

Thanks in advance,

Best,

Yaobo

Comments (2)

At the Minnesota Supercomputing Institute, our environment requires that jobs on our high performance clusters reserve an entire node. I have implemented my own Torque Manager/Runner for our environment based on the Grid Engine Manager/Runner. The way I have gotten this to work in our environment is to set the nCoresRequest for the scatter/gather method to the minimum required of eight. My understanding is that for the InDelRealigner, for example, the job reserves a node with eight cores, but only uses one. That means our users would have their compute time allocation consumed eight times faster than is necessary.

What I am wondering is are there options that I am missing where some number of the scatter/gather requests can be grouped into a single job submission? If I were writing this as a PBS script for our environment and I wanted to use 16 cores in a scatter/gather implementation, I would write two jobs, each with eight commands. They would look something like the following:

#PBS Job Configuration stuff
pbsdsh -n 0 java -jar ... &
pbsdsh -n 1 java -jar ... &
pbsdsh -n 2 java -jar ... &
pbsdsh -n 3 java -jar ... &
pbsdsh -n 4 java -jar ... &
pbsdsh -n 5 java -jar ... &
pbsdsh -n 6 java -jar ... &
pbsdsh -n 7 java -jar ... &
wait

Has anyone done something similar in Queue? Any pointers? Thanks in advance!

Comments (2)

My qscript for GATK printreads walker takes very long time for gather function( BAM gather function) which uses picard mergesamfiles. How can I write custom gather function in qscript to improve the performance. 1. I want to try mutli-threading for picard mergesamfiles ( which is false by default) 2. I also to try simple concatenation of bam files (scattered bam files are already sorted)

Comments (1)

My lab is trying to use Queue, but out pipeline is spawning very large number of jobs (as would be expected). Our lab has very spiky data processing requiernments and it would be useful to be able to limit the number of jobs queue submits at a time so other, non-queue jobs can process in parallel. Is this possible?

Thanks

Comments (2)

Hi,

I am trying to build GATK w/ queue from HEAD of the gatk-protected repository (2a7af4316478348f7ea58e0803b3391593d6dbd6). I am getting the following error during the Queue extensions building phase:

[java] Exception in thread "main" java.lang.NoClassDefFoundError: org/broadinstitute/sting/commandline/MissingArgumentException [java] at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:180) [java] at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) [java] at org.broadinstitute.sting.queue.extensions.gatk.GATKExtensionsGenerator.main(GATKExtensionsGenerator.java:104) [java] Caused by: java.lang.ClassNotFoundException: org.broadinstitute.sting.commandline.MissingArgumentException [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:202) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:190) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:247) [java] ... 3 more

BUILD FAILED /hpc/users/lindem03/packages/gatk-mssm/dev/build.xml:509: Java returned: 1 at org.apache.tools.ant.taskdefs.Java.execute(Java.java:111) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:392) at org.apache.tools.ant.Target.performTasks(Target.java:413) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399) at org.apache.tools.ant.Project.executeTarget(Project.java:1368) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1251) at org.apache.tools.ant.Main.runBuild(Main.java:811) at org.apache.tools.ant.Main.startAnt(Main.java:217) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)

When I run ant in debug mode I see the directory with the newly build GATK classes on classpath. Is there something else I might be missing. I am trying to upgrade from 1.6...

Thanks,

Michael

Comments (4)

GATK Queue implements a Scatter/Gather algorithm to create a set of intervals in order to parallelise data alalysis. If -scatter_gather option is issued, a respective number of interval files will be created and the input BAM files will be processed using these intervals. However a question arises what happens if the point of subsequent analysis is at or near the start/end of an interval? Are all the codes which support -l/-L options robust in the respect to interval positions? Since the input files are not actually spliced, all information is available to the processing program which could make the right decisions so that no artefacts are produced. Are there any restrictions on interval creation? Perhaps it should be at least a few read lengths. Anything else? Thanks in advance!

Comments (4)

I was frustrated by the .metrics file from MarkDuplicates getting deleted as an intermediate file, so I set isIntermediate=false for that step in the DataProcessingPipeline. But now I'm getting tired of manually deleting the intermediate bams.

So my request is, could that field be changed from an @Output to an @Argument? This would be on line 50 of org.broadinstitute.sting.queue.extensions.picard.MarkDuplicates.scala. I also made that a required field in my local copy, since it is required to run the Picard tool.

A similar but opposite problem is that the bai file from the IndelRealigner step is not deleted - but that looks like it would require either special handling for that walker in Queue or for the index file to be an argument to the Java walker. Neither is a particularly appealing solution.

Comments (6)

I had no problem to run GATK two weeks ago. But today, when I run the following GATK command, I got error message. It seems it cannot load library " liblsf.so". Please see below. Is there any change recently on GATK library?

2:15pm qyu@vbronze /bit/data01/projects/GC_coverage $ java -Xmx4g -Djava.io.tmpdir=/broad/hptmp/vdauwera -jar /humgen/gsa-scr1/vdauwera/gatk/walker/dist/Queue.jar -S /bit/data01/projects/GC_coverage/src/IntervalCovGG.scala -i /humgen/gsa-hpprojects/GATK/data/genes_of_interest.exon_targets.interval_list -b /bit/data01/projects/GC_coverage/testGCcoverage/data/DEV-2129.list -o /bit/data01/projects/GC_coverage/testGCcoverage/DEV-2129.gatkreport -m 16 -sc 100 -bsub -jobQueue priority -startFromScratch -run

The error shows:

ERROR 14:13:06,238 QGraph - Uncaught error running jobs. 
java.lang.UnsatisfiedLinkError: Unable to load library 'lsf': liblsf.so: cannot 
open shared object file: No such file or directory
        at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:163)
        at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:236)
        at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:199)
        at org.broadinstitute.sting.jna.lsf.v7_0_6.LibBat.<clinit>(LibBat.java:9
0)
        at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner$.<init>(Lsf
706JobRunner.scala:233)
        at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner$.<clinit>(L
sf706JobRunner.scala)
        at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner.<init>(Lsf7
06JobRunner.scala:47)
        at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobManager.create(Lsf
706JobManager.scala:35)
        at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobManager.create(Lsf
706JobManager.scala:33)
        at org.broadinstitute.sting.queue.engine.QGraph.newRunner(QGraph.scala:6
32)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:408
)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:131)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scal
a:127)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(Command
LineProgram.java:236)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(Command
LineProgram.java:146)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:
62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load librar
y 'lsf': liblsf.so: cannot open shared object file: No such file or directory
        at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:163)
        at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:236)
        at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:199)
        at org.broadinstitute.sting.jna.lsf.v7_0_6.LibBat.<clinit>(LibBat.java:9
.........

Thanks, Qing

Comments (3)

Is there any rule of thumb for allocating memory through "bsub" for running DataProcessingPipeline per bam file or per number of reads ?

Thanks

Comments (16)

I am running the following command to test my scala using GATK-2.3.5

> java -Xmx4g -Djava.io.tmpdir=tmp -jar /Queue-2.3-5-g49ed93c/Queue.jar -S Queue-2.3-5-g49ed93c/ExampleCountReads.scala -R /GATK-2.3-5-g49ed93c/resources/exampleFASTA.fasta -I /GATK-2.3-5-g49ed93c/resources/exampleBAM.bam

and I am getting this error

> INFO  13:42:55,166 QScriptManager - Compiling 1 QScript 
> ERROR 13:42:55,348 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of 'G' 
> ERROR 13:42:55,356 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,356 QScriptManager -                           ^ 
> ERROR 13:42:55,357 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected 
> ERROR 13:42:55,358 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,358 QScriptManager -                            ^ 
> ERROR 13:42:55,358 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected 
> ERROR 13:42:55,359 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,360 QScriptManager -                             ^ 
> ERROR 13:42:55,360 QScriptManager - ExampleCountReads.scala:39: in XML literal: '=' expected instead of '>' 
> ERROR 13:42:55,361 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,362 QScriptManager -                                               ^ 
> ERROR 13:42:55,362 QScriptManager - ExampleCountReads.scala:39: in XML literal: ' or " delimited attribute value or '{' scala-expr '}' expected 
> ERROR 13:42:55,363 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,365 QScriptManager -                                                 ^ 
> ERROR 13:42:55,366 QScriptManager - ExampleCountReads.scala:39: in XML literal: whitespace expected 
> ERROR 13:42:55,367 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,367 QScriptManager -                                                  ^ 
> ERROR 13:42:55,367 QScriptManager - ExampleCountReads.scala:39: in XML literal: '>' expected instead of ' ' 
> ERROR 13:42:55,369 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,369 QScriptManager -                                                   ^ 
> ERROR 13:42:55,369 QScriptManager - ExampleCountReads.scala:67: in XML literal: in XML content, please use '}}' to express '}' 
> ERROR 13:42:55,369 QScriptManager -   } 
> ERROR 13:42:55,369 QScriptManager -   ^ 
> ERROR 13:42:55,370 QScriptManager - ExampleCountReads.scala:39:  I encountered a '}' where I didn't expect one, maybe this tag isn't closed <WalkerName> 
> ERROR 13:42:55,371 QScriptManager -     // java -jar <path to GenomeAnalysisTK.jar> -T <WalkerName> -help 
> ERROR 13:42:55,371 QScriptManager -                                                     ^ 
> ERROR 13:42:55,371 QScriptManager - ExampleCountReads.scala:68: '}' expected but eof found. 
> ERROR 13:42:55,371 QScriptManager - } 
> ERROR 13:42:55,372 QScriptManager -  ^ 
> ERROR 13:42:55,390 QScriptManager - 10 errors found 
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR stack trace 
> org.broadinstitute.sting.queue.QException: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
>   at org.broadinstitute.sting.queue.QScriptManager.loadScripts(QScriptManager.scala:46)
>   at org.broadinstitute.sting.queue.QCommandLine.org$broadinstitute$sting$queue$QCommandLine$$qScriptPluginManager(QCommandLine.scala:94)
>   at org.broadinstitute.sting.queue.QCommandLine.getArgumentSources(QCommandLine.scala:225)
>   at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:197)
>   at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147)
>   at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:61)
>   at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
> ##### ERROR ------------------------------------------------------------------------------------------
> ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.3-5-g49ed93c):
> ##### ERROR
> ##### ERROR Please visit the wiki to see if this is a known problem
> ##### ERROR If not, please post the error, with stack trace, to the GATK forum
> ##### ERROR Visit our website and forum for extensive documentation and answers to 
> ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
> ##### ERROR
> ##### ERROR MESSAGE: Compile of /medpop/mpg-psrl/Parabase/Tools/Queue-2.3-5-g49ed93c/ExampleCountReads.scala failed with 10 errors
> ##### ERROR ------------------------------------------------------------------------------------------
> INFO  13:42:55,487 QCommandLine - Shutting down jobs. Please wait... 
> 13.462u 1.029s 0:10.41 139.0% 0+0k 0+0io 1pf+0w
> 

I am sure java, gatk and queue are installed properly and the files exist in the relevant directory. Thank you,

Comments (0)

Hi, folks,

I am trying my queue scala scripts, often get error message:

Conflicting collector combinations in option list; please refer to the release notes for the combinations allowed

However, from the debug information, the command line looks correct (?):

ERROR 12:07:05,460 FunctionEdge - Error: 'java' '-Xmx8192m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/Users/wxing/TPU/run/.queue/tmp' '-cp' '/Users/wxing/TPU/gatk/dist/Queue.jar' 'net.sf.picard.sam.SortSam' 'INPUT=/Users/wxing/TPU/run/SRR064286_1.fastq.aligned.sam' 'TMP_DIR=/Users/wxing/TPU/run/.queue/tmp' 'OUTPUT=/Users/wxing/TPU/run/SRR064286_1.fastq.aligned.bam' 'VALIDATION_STRINGENCY=SILENT' 'SO=coordinate' 'CREATE_INDEX=true'

Anyone had similar issues? any tricks for debugging?

Many thanks, Wei

Comments (7)

Hi,

I am testing queue scripts with new installed LSF v8.3. The test script is:

java -Djava.io.tmpdir=/tmp -jar jar2216/Queue.jar -S Queue-2.2-16-g9f648cb/resources/ExampleCountReads.scala -R Queue-2.2-16-g9f648cb/resources/exampleFASTA.fasta -I Queue-2.2-16-g9f648cb/resources/exampleBAM.bam --bsub -run

where I get error message as follows:

'java' '-Xmx1024m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/data/cmb/wxing/gatk/.queue/tmp' '-cp' '/data/cmb/wxing/gatk/jar2216/Queue.jar' 'org.broadinstitute.sting.gatk. CommandLineGATK' '-T' 'CountReads' '-I' '/data/cmb/wxing/gatk/Queue-2.2-16-g9f648cb/resources/exampleBAM.bam' '-R' '/data/cmb/wxing/gatk/Queue-2.2-16-g9f648cb/resources/exampleFASTA.fasta' java.lang.UnsatisfiedLinkError: Error looking up function 'ls_getLicenseUsage': /usr/local/lsf/8.3/linux2.6-glibc2.3-x86_64/lib/liblsf.so: undefined symbol: ls_getLicenseUsage at com.sun.jna.Function.(Function.java:179) at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:344) at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:324) at com.sun.jna.Native.register(Native.java:1341) at com.sun.jna.Native.register(Native.java:1018) at org.broadinstitute.sting.jna.lsf.v7_0_6.LibLsf.(LibLsf.java:86) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner$.unitDivisor(Lsf706JobRunner.scala:401) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner$.org$broadinstitute$sting$queue$engine$lsf$Lsf706JobRunner$$convertUnits(Lsf706JobRunner.scala:416) at org.broadinstitute.sting.queue.engine.lsf.Lsf706JobRunner.start(Lsf706JobRunner.scala:98) at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83) at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432) at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154) at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:237) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:147) at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62) at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)

Any clues on the issue "java.lang.UnsatisfiedLinkError: Error looking up function 'ls_getLicenseUsage': /usr/local/lsf/8.3/linux2.6-glibc2.3-x86_64/lib/liblsf.so: ". Or anyone had similar problems?

Anyone think it could be the version of our LSF (v8.3) as the code seem based on version 706?

Many thanks, Wei

Comments (2)

I'm building a variant calling qscript (it's available here), heavily based on the the MethodsDevelopmentCallingPipeline.scala. I cannot however run into trouble when setting the "this.scatterCount" of the GenotyperBase to more than 1 - in which case I get a NullPointerException (I include the full error message below).

I use the following command line:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/NewVariantCalling.scala -i NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam -R /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta -res /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/ **-sg 2** -nt 8 -run -l DEBUG -startFromScratch

As you can see, I'm using the files from the gatk bundle, and I guess these should be alright for this purpose? Just to be clear I use the "-res" parameter to point to the directory where all the resource files are located, dbsnp, hapmap, etc. and the -sg parameter is what controls the scatter/gather count.

I've tried to search in the code for what might be causing this, and I can conclude that the org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc is called with str (its parameter) being an empty string, which is what causes contig to be null, which in turn creates the NullPointerException on line 408 when this line is executed: stop = getContigInfo(contig).getSequenceLength();

This, I guess, is the obvious stuff, but this far I haven't been able to figure this out any further that this. I'm not sure if this is caused by a bug in my script, or by a bug in the GATK. Right now I'm thinking the latter of the two, since I have used the scatter/gather function in other scripts without any trouble.

Any ideas of where to continue from here, or confirmation that this is indeed something related to the GATK code would be much appreciated.

Cheers, Johan

ERROR 16:22:50,781 FunctionEdge - Error: LocusScatterFunction: List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf.idx, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam) > List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_1_of_2/scatter.intervals, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_2_of_2/scatter.intervals) 
java.lang.NullPointerException
        at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:408)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:84)
        at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:97)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:538)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:520)
        at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindings(IntervalUtils.java:499)
        at org.broadinstitute.sting.queue.extensions.gatk.GATKIntervals.locs(GATKIntervals.scala:60)
        at org.broadinstitute.sting.queue.extensions.gatk.LocusScatterFunction.run(LocusScatterFunction.scala:39)
        at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:28)
        at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83)
        at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432)
        at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154)
        at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
Comments (5)

Is there any where I can find the integration test file with the md5sum "45d97df6d291695b92668e8a55c54cd0", which is used in the DataProcessingPipelineTest class? Since my tests fail with another md5sum calculated I would be interested to know what the differences between the files are.

Comments (13)

Hi there, I wanted to reproduce in my variant calling Queue script the same conditional you have in MethodsDevelopmenCallingPipeline, i.e. including InbreedingCoeff depending on the number of samples. However, in that script the number of samples is passed to the Target object as an integer, and I would like to count it from the bam file list passed as an input to the script.

Therefore I followed the method in DataProcessingPipeline, i.e.

import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size

But unfortunately, despite DataProcessingPipeline works just fine, when I put these lines in my other script I get the following error:

INFO  12:48:08,616 HelpFormatter - Date/Time: 2012/11/08 12:48:08 
INFO  12:48:08,616 HelpFormatter - ---------------------------------------------------------------------- 
INFO  12:48:08,616 HelpFormatter - ---------------------------------------------------------------------- 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.DynamicClassResolutionException: Could not create module HaplotypeCallerStep because      Cannot instantiate class (Invocation failure) caused by exception null
at org.broadinstitute.sting.utils.classloader.PluginManager.createByType(PluginManager.java:306)
at org.broadinstitute.sting.utils.classloader.PluginManager.createAllTypes(PluginManager.java:317)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:126)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-2-gf44cc4e):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------

I tried several alternatives looking at the imports in DataProcessingPipeline but maybe I am missing something. Could you please advise?

thanks very much Francesco

Comments (3)

What is the best way to get Queue to optimize utilization of a given number of cores in an SGE cluster? The DataProcessingPipeline.scala has a hidden parameter "scatter_gather" which sets the nContigs variable. Is it safe to use this option? For example, if you had 100 cores available in your cluster could you set the option to 100? Is there any advantage to setting it higher?

Without setting it, Queue appears to set the nContigs value based on the number of chromosomes in the BAM input. So if using a whole genome BAM it's 25, your example Chr20 data it's 1, or with an unaligned BAM it's 0. So if starting with unaligned data, it appears to run on a single core?

Comments (4)

I am having difficulties getting Queue to determine the order of jobs added to the queue. Using the @Input and @Output definitions of input and output files, the dependencies are defined and Queue waits for one output method to finish prior to starting the subsequent method.

Since the order the method is added to the queue does not determine the dependencies, my assumption is that Queue looks at the names of the variables added to the queue to determine which method's output is another method's input. Regardless, I've tried working with variable names in both added methods along with those defined in the @Input and @Output. All of my trials seem to come up short as Queue runs the jobs in a manner inconsistent with the @Input, @Output, and variables defined and added as arguments to methods added to the queue.

What is the secret with defining the order of jobs added to the queue? Are there any additional rules in defining variables or the @Input/@Output that I am missing?

Any help is good help. Thanks.

Comments (3)

Is there any example of a queue script calling variants with the HaplotypeCaller? thanks! Francesco

Comments (14)

Queue purports to offer users a seamless pipelined integration of BWA, Picard and GATK. Yet Queue requires BAM files as input, implying I would need to call BWA separately to do alignment and then Picard or samtools to convert to BAM, then go back to Queue to work with the BAMs. At this point I run into compatibility issues for instance as described here. So is there a way I can set Queue to take FASTQ files as input?

Comments (23)

I've been running Queue using the DRMAA, and I've noticed one thing which I would like to bring up for discussion. The job names are generated using the following code at this point:

 // Set the display name to < 512 characters of the description
  // NOTE: Not sure if this is configuration specific?
  protected val jobNameLength = 500
  protected val jobNameFilter = """[^A-Za-z0-9_]"""
  protected def functionNativeSpec = function.jobNativeArgs.mkString(" ")

  def start() {
    session.synchronized {
      val drmaaJob: JobTemplate = session.createJobTemplate

      drmaaJob.setJobName(function.description.take(jobNameLength).replaceAll(jobNameFilter, "_"))
  [...]

For me this yields names looking something like this:

"__java_____Xmx3072m_____D"

This is not very useful for telling the jobs apart. I'm running my jobs via drmaa on a system using the SLURM resource manager. So the cut-off in the name above can be attributed to the slurm system cutting of the name. Even so, I think that there should be more reasonable ways to create the name - using the function.jobName for example.

So, this leads me to my question - is there any particular reason that the job names are generated the way they are? And if not, do you (the gatk team) want a patch changing this to using the funciton.jobName instead?

Furthermore I would be interested in hearing from other users using gatk queue over drmaa, since I think it might be interesting to develop this further. I have as an example implemented setting a had to implement setting a hard wall time in the jobRunner, since the cluster I'm running on demands this. I'm sure that there are more solutions like that out there, and I would be thrilled to hear about them.

Comments (15)

It says in the "ExampleUnifiedGenotyper" qscript that it runs an incomplete version of the UnifiedGenotyper. I have a two questions about this.

  • I cannot find the org.broadinstitute.sting.queue.extensions.gatk.UnifiedGenotyper class, is this not in the public source or am I blind?
  • Does the "incomplete" in the script starting comment mean that this extension is incomplete in its functionality? Or is it something else that is incomplete in its nature?

Basically I'm wondering if there are any pitfalls for me if I start using the example script as a base for my own variant calling qscript.