The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.
The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.
Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).
This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.
Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.
| Argument (short-name) | Argument (long-name) | Description |
|---|---|---|
| -i <BAM file / BAM list> | --input <BAM file / BAM list> | input BAM file - or list of BAM files. |
| -R <fasta> | --reference <fasta> | Reference fasta file. |
| -D <vcf> | --dbsnp <dbsnp vcf> | dbsnp ROD to use (must be in VCF format). |
| Argument (short-name) | Argument (long-name) | Description |
|---|---|---|
| -indels <vcf> | --extra_indels <vcf> | VCF files to use as reference indels for Indel Realignment. |
| -bwa <path> | --path_to_bwa <path> | The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option) |
| -outputDir <path> | --output_directory <path> | Output path for the processed BAM files. |
| -L <GATK interval string> | --gatk_interval_string <GATK interval string> | the -L interval string to be used by GATK - output bams at interval only |
| -intervals <GATK interval file> | --gatk_interval_file <GATK interval file> | an intervals file to be used by GATK - output bams at intervals |
| Argument (short-name) | Argument (long-name) | Description |
|---|---|---|
| -p <name> | --project <name> | the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam |
| -knowns | --knowns_only | Perform cleaning on knowns only. |
| -sw | --use_smith_waterman | Perform cleaning using Smith Waterman |
| -bwase | --use_bwa_single_ended | Decompose input BAM file and fully realign it using BWA and assume Single Ended reads |
| -bwape | --use_bwa_pair_ended | Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads |
Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:
This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.
This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.
This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.
This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate
The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.
The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.
We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.
Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.
PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.
Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)
java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run
Performing realignment and the full data processing pipeline in one pair-ended bam file
java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run
Yes.
For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.
In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.
@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil
On the command line specify the arguments by repeating the argument name.
-filter filter1 -filter filter2 -filter filter3
Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.
def script {
var myCommand = new MyFunction
myCommand.filters = this.filterNames
}
For a full example of command line arguments see the QScripts documentation.
Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.
class MySplitter extends InProcessFunction {
@Input(doc="inputs")
var in: File = _
@Output(doc="outputs")
var out: List[File] = Nil
def run {
StingUtilityMethod.quickSplitFile(in, out)
}
}
var splitter = new MySplitter
splitter.in = new File("input.txt")
splitter.out = List(new File("out1.txt"), new File("out2.txt"))
add(splitter)
See Queue CommandLineFunctions for more information on how @Input and @Output are used.
Create an instance of a ListWriterFunction and add it in your script method.
import org.broadinstitute.sting.queue.function.ListWriterFunction
</pre>
<pre>
val writeBamList = new ListWriterFunction
writeBamList.inputFiles = bamFiles
writeBamList.listFile = new File("myBams.list")
add(writeBamList)
Queue contains a trait mixin you can use to add Log4J support to your classes.
Add the import for the trait Logging to your QScript.
import org.broadinstitute.sting.queue.util.Logging
Mixin the trait to your class.
class MyScript extends Logging {
...
Then use the mixed in logger to write debug output when the user specifies -l DEBUG.
logger.debug("This will only be displayed when debugging is enabled.")
Try ant clean.
Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.
No. QScript will create all parent directories for outputs.
Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:
RUNLIMIT
240.0 min of gsa3
Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.
The easiest way to do this at the moment is to mixin a trait.
First define a trait which adds your java options:
trait RemoteDebugging extends JavaCommandLineFunction {
override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
}
Then mix in the trait to your walker and otherwise run it as normal:
val printReadsDebug = new PrintReads with RemoteDebugging
printReadsDebug.reference_sequence = "my.fasta"
// continue setting up your walker...
add(printReadsDebug)
If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.
INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 16:25:55,140 QCommandLine - Added 4 functions
INFO 16:25:55,140 QGraph - Generating graph.
INFO 16:25:55,164 QGraph - Generating scatter gather jobs.
INFO 16:25:55,714 QGraph - Removing original jobs.
INFO 16:25:55,716 QGraph - Adding scatter gather jobs.
INFO 16:25:55,779 QGraph - Regenerating graph.
INFO 16:25:55,790 QGraph - Running jobs.
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done
INFO 16:25:55,902 QCommandLine - Done
Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.
GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:
Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.
With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.
You have two options: donwload the binary distribution (prepackaged, ready to run program) or build it from source.
This is obviously the easiest way to go. Links are on the Downloads page.
Briefly, here's what you need to know/do:
Queue is part of the Sting repository. Download the source from our repository on Github. Run the following command:
git clone git://github.com/broadgsa/gatk.git Sting
Use ant to build the source.
cd Sting
ant queue
Queue uses the Ivy dependency manager to fetch all other dependencies. Just make sure you have suitable versions of the JDK and Ant!
See this article on how to test your installation of Queue.
See this article on running Queue for the first time for full details.
Queue arguments can be listed by running with --help
java -jar dist/Queue.jar --help
To list the arguments required by a QScript, add the script with -S and run with --help.
java -jar dist/Queue.jar -S script.scala --help
Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.
See QFunction and Command Line Options for more info on adjusting Queue options.
Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.
Every QScript includes the following steps:
add() to Queue for dispatch and monitoringThe basic command-line to run the Queue pipelines on the command line is
java -jar Queue.jar -S <script>.scala
See the main article Queue QScripts for more info on QScripts.
While most QScripts are analysis pipelines that are custom-built for specific projects, some have been released as supported tools. See
The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples
Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.
Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile .libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.
Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.
To clarify your pipeline, override the dotString() function:
class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
@Input(doc="foo") var bam = bamIn
@Input(doc="foo") var bamIndex = bai(bamIn)
@Output(doc="foo") var recalData = recalDataIn
memoryLimit = Some(4)
override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}
Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:
Processing data originated in the Pacific Biosciences RS platform has been evaluated by the GSA and publicly presented in numerous occasions. The guidelines we describe in this document were the result of a systematic technology development experiment on some datasets (human, E. coli and Rhodobacter) from the Broad Institute. These guidelines produced better results than the ones obtained using alternative pipelines up to this date (september 2011) for the datasets tested, but there is no guarantee that it will be the best for every dataset and that other pipelines won't supersede it in the future.
The pipeline we propose here is illustrated in a Q script (PacbioProcessingPipeline.scala) distributed with the GATK as an example for educational purposes. This pipeline has not been extensively tested and is not supported by the GATK team. You are free to use it and modify it for your needs following the guidelines below.

First we take the filtered_subreads.fq file output by the Pacific Biosciences RS SMRT pipeline and align it using BWA. We use BWA with the bwasw algorithm and allow for relaxing the gap open penalty to account for the excess of insertions and deletions known to be typical error modes of the data. For an idea on what parameters to use check suggestions given by the BWA author in the BWA manual page that are specific to Pacbio. The goal is to account for Pacific Biosciences RS known error mode and benefit from the long reads for a high scoring overall match. (for older versions, you can use the filtered_subreads.fasta and combine the base quality scores extracted from the h5 files using Pacific Biosciences SMRT pipeline python tools)
To produce a BAM file that is sorted by coordinate with adequate read group information we use Picard tools: SortSam and AddOrReplaceReadGroups. These steps are necessary because all subsequent tools require that the BAM file follow these rules. It is also generally considered good practices to have your BAM file conform to these specifications.
Once we have a proper BAM file, it is important to estimate the empirical quality scores using statistics based on a known callset (e.g. latest dbSNP) and the following covariates: QualityScore, Dinucleotide and ReadGroup. You can follow the GATK's Best Practices for Variant Detection according the type of data you have, with the exception of indel realignment, because the tool has not been adapted for Pacific Biosciences RS data.
You will have to adjust your calling thresholds in the Unified Genotyper to allow sites with a higher indel rate to be analyzed.
Be aware that the Unified Genotyper has cutoffs for base quality score and if your data is on average Q20 (a common occurrence with Pacific Biosciences RS data) you may need to adjust your quality thresholds to allow the GATK to analyze your data. There is no right answer here, you have to choose parameters consistent with your average base quality scores, evaluate the calls made with the selected threshold and modify as necessary.
To account for the high insertion and deletion error rate of the Pacific Biosciences data instrument, we often have to set the gap open penalty to be lower than the base mismatch penalty in order to maximize alignment performance. Despite aligning most of the reads successfully, this creates the side effect that the aligner will sometimes prefer to "hide" a true SNP inside an insertion. The result is accurate mapping, albeit with a reference-biased alignment. It is important to note however, that reference bias is an artifact of the alignment process, not the data, and can be greatly reduced by locally realigning the reads based on the reference and the data. Presently, the available software for local realignment is not compatible with the length and the high indel rate of Pacific Bioscience data, but we expect new tools to handle this problem in the future. Ultimately reference bias will mask real calls and you will have to inspect these by hand.
script method, a QScript will add one or more CommandLineFunctions.@Input and @Output.@Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.Each CommandLineFunction must define the actual command line to run as follows.
class MyCommandLine extends CommandLineFunction {
def commandLine = "myScript.sh hello world"
}
If you're writing a one-off CommandLineFunction that is not destined for use by other QScripts, it's often easiest to construct the command line directly rather than through the API methods provided in the CommandLineFunction class.
For example:
def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)
If you're writing a CommandLineFunction that will become part of Queue and/or
will be used by other QScripts, however, our best practice recommendation is
to construct your command line only using the methods provided in the
CommandLineFunction class: required(), optional(), conditional(), and repeat()
The reason for this is that these methods automatically escape the values you
give them so that they'll be interpreted literally within the shell scripts
Queue generates to run your command, and they also manage whitespace separation of command-line tokens for you. This prevents (for example) a value like MQ > 10 from being interpreted as an output redirection by the shell, and avoids issues with values containing embedded spaces. The methods also give you the ability to turn escaping and/or whitespace separation off as needed. An example:
override def commandLine = super.commandLine +
required("eff") +
conditional(verbose, "-v") +
optional("-c", config) +
required("-i", "vcf") +
required("-o", "vcf") +
required(genomeVersion) +
required(inVcf) +
required(">", escape=false) + // This will be shell-interpreted as an output redirection
required(outVcf)
The CommandLineFunctions built into Queue, including the CommandLineFunctions
automatically generated for GATK Walkers, are all written using this pattern.
This means that when you configure a GATK Walker or one of the other built-in
CommandLineFunctions in a QScript, you can rely on all of your values being
safely escaped and taken literally when the commands are run, including values
containing characters that would normally be interpreted by the shell such as
MQ > 10.
Below is a brief overview of the API methods available to you in the CommandLineFunction class for safely constructing command lines:
required() Used for command-line arguments that are always present, e.g.:
required("-f", "filename") returns: " '-f' 'filename' "
required("-f", "filename", escape=false) returns: " -f filename "
required("java") returns: " 'java' "
required("INPUT=", "myBam.bam", spaceSeparated=false) returns: " 'INPUT=myBam.bam' "
optional() Used for command-line arguments that may or may not be present, e.g.:
optional("-f", myVar) behaves like required() if myVar has a value, but returns ""
if myVar is null/Nil/None
conditional() Used for command-line arguments that should only be included if some condition is true, e.g.:
conditional(verbose, "-v") returns " '-v' " if verbose is true, otherwise returns ""
repeat() Used for command-line arguments that are repeated multiple times on the command line, e.g.:
repeat("-f", List("file1", "file2", "file3")) returns: " '-f' 'file1' '-f' 'file2' '-f' 'file3' "
CommandLineFunction arguments use a similar syntax to arguments.
CommandLineFunction variables are annotated with @Input, @Output, or @Argument annotations.
So that Queue can track the input and output files of a command, CommandLineFunction @Input and @Output must be java.io.File objects.
class MyCommandLine extends CommandLineFunction {
@Input(doc="input file")
var inputFile: File = _
def commandLine = "myScript.sh -fileParam " + inputFile
}
CommandLineFunction variables can also provide indirect access to java.io.File inputs and outputs via the FileProvider trait.
class MyCommandLine extends CommandLineFunction {
@Input(doc="named input file")
var inputFile: ExampleFileProvider = _
def commandLine = "myScript.sh " + inputFile
}
// An example FileProvider that stores a 'name' with a 'file'.
class ExampleFileProvider(var name: String, var file: File) extends org.broadinstitute.sting.queue.function.FileProvider {
override def toString = " -fileName " + name + " -fileParam " + file
}
Optional files can be specified via required=false, and can use the CommandLineFunction.optional() utility method, as described above:
class MyCommandLine extends CommandLineFunction {
@Input(doc="input file", required=false)
var inputFile: File = _
// -fileParam will only be added if the QScript sets inputFile on this instance of MyCommandLine
def commandLine = required("myScript.sh") + optional("-fileParam", inputFile)
}
A List or Set of files can use the CommandLineFunction.repeat() utility method, as described above:
class MyCommandLine extends CommandLineFunction {
@Input(doc="input file")
var inputFile: List[File] = Nil // NOTE: Do not set List or Set variables to null!
// -fileParam will added as many times as the QScript adds the inputFile on this instance of MyCommandLine
def commandLine = required("myScript.sh") + repeat("-fileParam", inputFile)
}
A command line function can define other required arguments via @Argument.
class MyCommandLine extends CommandLineFunction {
@Argument(doc="message to display")
var veryImportantMessage: String = _
// If the QScript does not specify the required veryImportantMessage, the pipeline will not run.
def commandLine = required("myScript.sh") + required(veryImportantMessage)
}
class SamToolsIndex extends CommandLineFunction {
@Input(doc="bam to index") var bamFile: File = _
@Output(doc="bam index") var baiFile: File = _
def commandLine = "samtools index %s %s".format(bamFile, baiFile)
)
Or, using the CommandLineFunction API methods to construct the command line with automatic shell escaping:
class SamToolsIndex extends CommandLineFunction {
@Input(doc="bam to index") var bamFile: File = _
@Output(doc="bam index") var baiFile: File = _
def commandLine = required("samtools") + required("index") + required(bamFile) + required(baiFile)
)
Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.
QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.
The following is a basic outline of a QScript:
import org.broadinstitute.sting.queue.QScript
// List other imports here
// Define the overall QScript here.
class MyScript extends QScript {
// List script arguments here.
@Input(doc="My QScript inputs")
var scriptInput: File = _
// Create and add the functions in the script here.
def script = {
var myCL = new MyCommandLine
myCL.myInput = scriptInput // Example variable input
myCL.myOutput = new File("/path/to/output") // Example hardcoded output
add(myCL)
}
}
Imports can be any scala or java imports in scala syntax.
import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.
To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.
The QScript must define a method script.
The QScript can define helper methods or variables.
The body of script should create and add Queue CommandlineFunctions.
class MyScript extends org.broadinstitute.sting.queue.QScript {
def script = add(new CommandLineFunction { def commandLine = "echo hello world" })
}
A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.
A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.
QScript command line arguments can be marked as optional by setting required=false.
class MyScript extends org.broadinstitute.sting.queue.QScript { @Input(doc="example message to echo") var message: String = _ def script = add(new CommandLineFunction { def commandLine = "echo " + message }) }
See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.
After functions are defined they should be added to the QScript pipeline using add().
for (vcf <- vcfs) {
val ve = new VariantEval
ve.vcfFile = vcf
ve.evalFile = swapExt(vcf, "vcf", "eval")
add(ve)
}
Queue tracks dependencies between functions via variables annotated with @Input and @Output.
Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.
See the main article Queue CommandLineFunctions for more information.
The latest version of the example files are available in the Sting git repository under public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/.
To print the list of arguments required by an existing QScript run with -help.
CommandLineFunction variables set correctly, run without -run.-run.The following is a "hello world" example that runs a single command line to echo hello world.
import org.broadinstitute.sting.queue.QScript
class HelloWorld extends QScript {
def script = {
add(new CommandLineFunction {
def commandLine = "echo hello world"
})
}
}
The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run
It should produce output similar to:
INFO 16:23:27,825 QScriptManager - Compiling 1 QScript
INFO 16:23:31,289 QScriptManager - Compilation complete
INFO 16:23:34,631 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO 16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run
INFO 16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34
INFO 16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,634 QCommandLine - Scripting HelloWorld
INFO 16:23:34,651 QCommandLine - Added 1 functions
INFO 16:23:34,651 QGraph - Generating graph.
INFO 16:23:34,660 QGraph - Running jobs.
INFO 16:23:34,689 ShellJobRunner - Starting: echo hello world
INFO 16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out
INFO 16:23:34,771 ShellJobRunner - Done: echo hello world
INFO 16:23:34,773 QGraph - Deleting intermediate files.
INFO 16:23:34,773 QCommandLine - Done
This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.
The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.
To list the command line parameters, including the required parameters, run with -help.
java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help
The help output should appear similar to this:
INFO 10:26:08,491 QScriptManager - Compiling 1 QScript
INFO 10:26:11,926 QScriptManager - Compilation complete
---------------------------------------------------------
Program Name: org.broadinstitute.sting.queue.QCommandLine
---------------------------------------------------------
---------------------------------------------------------
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
[-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
<dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
<job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
[-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPassFile <emailPasswordFile>] [-emailPass <emailPassword>]
[-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
[-filter <filternames>] [-filterExpression <filterexpressions>]
-S,--script <script> QScript scala file
-run,--run_scripts Run QScripts. Without this flag set only
performs a dry run.
-jobRunner,--job_runner <job_runner> Use the specified job runner to dispatch
command line jobs
-bsub,--bsub Equivalent to -jobRunner Lsf706
-status,--status Get status of jobs for the qscript
-retry,--retry_failed <retry_failed> Retry the specified number of times after a
command fails. Defaults to no retries.
-startFromScratch,--start_from_scratch Runs all command line functions even if the
outputs were previously output successfully.
-keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of
any Function marked as intermediate.
-statusTo,--status_email_to <status_email_to> Email address to send emails to upon
completion or on error.
-statusFrom,--status_email_from <status_email_from> Email address to send emails from upon
completion or on error.
-dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to
a .dot file. Otherwise overwrites the
dot_graph
-jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs.
-jobProject,--job_project <job_project> Default project for compute farm jobs.
-jobQueue,--job_queue <job_queue> Default queue for compute farm jobs.
-jobPriority,--job_priority <job_priority> Default priority for jobs.
-memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory> Root directory to run functions from.
-tempDir,--temp_directory <temp_directory> Temp directory to pass to functions.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather
output for compute farm jobs.
-emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL Email should use SSL. Defaults to false.
-emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none.
-emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none.
-emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not
secure! See emailPassFile.
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file> Set the logging location
-quiet,--quiet_output_mode Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help Generate this help message
Arguments for ExampleUnifiedGenotyper:
-R,--referencefile <referencefile> The reference file for the bam files.
-I,--bamfile <bamfile> Bam file to genotype.
-L,--intervals <intervals> An optional file with a list of intervals to proccess.
-filter,--filternames <filternames> A optional list of filter names.
-filterExpression,--filterexpressions <filterexpressions> An optional list of filter expressions.
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.commandline.MissingArgumentException:
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:192)
at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:172)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:199)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:57)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing.
##### ERROR Argument with name '--referencefile' (-R) is missing.
##### ERROR ------------------------------------------------------------------------------------------
To dry run the pipeline:
java \
-Djava.io.tmpdir=tmp \
-jar dist/Queue.jar \
-S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \
-R human_b36_both.fasta \
-I pilot2_daughters.chr20.10k-11k.bam \
-L chr20.interval_list \
-filter StrandBias -filterExpression "SB>=0.10" \
-filter AlleleBalance -filterExpression "AB>=0.75" \
-filter QualByDepth -filterExpression "QD<5" \
-filter HomopolymerRun -filterExpression "HRun>=4"
The dry run output should appear similar to this:
INFO 10:45:00,354 QScriptManager - Compiling 1 QScript
INFO 10:45:04,855 QScriptManager - Compilation complete
INFO 10:45:05,058 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO 10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4
INFO 10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05
INFO 10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 10:45:05,150 QCommandLine - Added 4 functions
INFO 10:45:05,150 QGraph - Generating graph.
INFO 10:45:05,169 QGraph - Generating scatter gather jobs.
INFO 10:45:05,182 QGraph - Removing original jobs.
INFO 10:45:05,183 QGraph - Adding scatter gather jobs.
INFO 10:45:05,231 QGraph - Regenerating graph.
INFO 10:45:05,247 QGraph - -------
INFO 10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals
INFO 10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,254 QGraph - -------
INFO 10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,279 QGraph - -------
INFO 10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,283 QGraph - -------
INFO 10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,288 QGraph - -------
INFO 10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,289 QGraph - -------
INFO 10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples
INFO 10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,292 QGraph - -------
INFO 10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval
INFO 10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out
INFO 10:45:05,296 QGraph - -------
INFO 10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun
INFO 10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out
INFO 10:45:05,302 QGraph - -------
INFO 10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval
INFO 10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out
INFO 10:45:05,304 QGraph - Dry run completed successfully!
INFO 10:45:05,304 QGraph - Re-run with "-run" to execute the functions.
INFO 10:45:05,304 QCommandLine - Done
QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables.
A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits.
A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions.
See the following example:
class MyScript extends org.broadinstitute.sting.queue.QScript {
// Create an alias 'qscript' for 'MyScript.this'
qscript =>
// This is a script argument
@Argument(doc="message to display")
var message: String = _
// This is a script argument
@Argument(doc="number of times to display")
var count: Int = _
trait ReusableArguments extends MyCommandLineFunction {
// Whenever a function is created 'with' this trait, it will copy the message.
this.commandLineMessage = qscript.message
}
abstract class MyCommandLineFunction extends CommandLineFunction {
// This is a per command line argument
@Argument(doc="message to display")
var commandLineMessage: String = _
}
class MyEchoFunction extends MyCommandLineFunction {
def commandLine = "echo " + commandLineMessage
}
class MyAlsoEchoFunction extends MyCommandLineFunction {
def commandLine = "echo also " + commandLineMessage
}
def script = {
for (i <- 1 to count) {
val echo = new MyEchoFunction with ReusableArguments
val alsoEcho = new MyAlsoEchoFunction with ReusableArguments
add(echo, alsoEcho)
}
}
}
I'm building a variant calling qscript (it's available here), heavily based on the the MethodsDevelopmentCallingPipeline.scala. I cannot however run into trouble when setting the "this.scatterCount" of the GenotyperBase to more than 1 - in which case I get a NullPointerException (I include the full error message below).
I use the following command line:
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/NewVariantCalling.scala -i NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam -R /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta -res /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/ **-sg 2** -nt 8 -run -l DEBUG -startFromScratch
As you can see, I'm using the files from the gatk bundle, and I guess these should be alright for this purpose? Just to be clear I use the "-res" parameter to point to the directory where all the resource files are located, dbsnp, hapmap, etc. and the -sg parameter is what controls the scatter/gather count.
I've tried to search in the code for what might be causing this, and I can conclude that the org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc is called with str (its parameter) being an empty string, which is what causes contig to be null, which in turn creates the NullPointerException on line 408 when this line is executed:
stop = getContigInfo(contig).getSequenceLength();
This, I guess, is the obvious stuff, but this far I haven't been able to figure this out any further that this. I'm not sure if this is caused by a bug in my script, or by a bug in the GATK. Right now I'm thinking the latter of the two, since I have used the scatter/gather function in other scripts without any trouble.
Any ideas of where to continue from here, or confirmation that this is indeed something related to the GATK code would be much appreciated.
Cheers, Johan
ERROR 16:22:50,781 FunctionEdge - Error: LocusScatterFunction: List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/human_g1k_v37.fasta, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bai, /bubo/nobackup/uppnex/reference/biodata/GATK/ftp.broadinstitute.org/bundle/2.2/b37/dbsnp_137.b37.vcf.idx, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam) > List(/bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_1_of_2/scatter.intervals, /bubo/proj/a2009002/SnpSeqPipeline/SnpSeqPipeline/gatk/.queue/scatterGather/.qlog/project.snpcall-sg/temp_2_of_2/scatter.intervals)
java.lang.NullPointerException
at org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:408)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:84)
at org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:97)
at org.broadinstitute.sting.utils.interval.IntervalUtils.loadIntervals(IntervalUtils.java:538)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindingsPair(IntervalUtils.java:520)
at org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalBindings(IntervalUtils.java:499)
at org.broadinstitute.sting.queue.extensions.gatk.GATKIntervals.locs(GATKIntervals.scala:60)
at org.broadinstitute.sting.queue.extensions.gatk.LocusScatterFunction.run(LocusScatterFunction.scala:39)
at org.broadinstitute.sting.queue.engine.InProcessRunner.start(InProcessRunner.scala:28)
at org.broadinstitute.sting.queue.engine.FunctionEdge.start(FunctionEdge.scala:83)
at org.broadinstitute.sting.queue.engine.QGraph.runJobs(QGraph.scala:432)
at org.broadinstitute.sting.queue.engine.QGraph.run(QGraph.scala:154)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:145)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
In addition to testing walkers individually, you may want to also run integration tests for your QScript pipelines.
ant target pipelinetest and run under pipelinetestrun.PipelineTest to run under the ant target.PipelineTestSpec and then run it via PipelineTest.exec().When building up a pipeline test spec specify the following variables for your test.
| Variable | Type | Description |
|---|---|---|
args |
String |
The arguments to pass to the Queue test, ex: -S scala/qscript/examples/HelloWorld.scala |
jobQueue |
String |
Job Queue to run the test. Default is null which means use hour. |
fileMD5s |
Map[Path, MD5] |
Expected MD5 results for each file path. |
expectedException |
classOf[Exception] |
Expected exception from the test. |
The following example runs the ExampleCountLoci QScript on a small bam and verifies that the MD5 result is as expected.
It is checked into the Sting repository under scala/test/org/broadinstitute/sting/queue/pipeline/examples/ExampleCountLociPipelineTest.scala
package org.broadinstitute.sting.queue.pipeline.examples
import org.testng.annotations.Test
import org.broadinstitute.sting.queue.pipeline.{PipelineTest, PipelineTestSpec}
import org.broadinstitute.sting.BaseTest
class ExampleCountLociPipelineTest {
@Test
def testCountLoci {
val testOut = "count.out"
val spec = new PipelineTestSpec
spec.name = "countloci"
spec.args = Array(
" -S scala/qscript/examples/ExampleCountLoci.scala",
" -R " + BaseTest.hg18Reference,
" -I " + BaseTest.validationDataLocation + "small_bam_for_countloci.bam",
" -o " + testOut).mkString
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
PipelineTest.executeTest(spec)
}
}
To test if the script is at least compiling with your arguments run ant pipelinetest specifying the name of your class to -Dsingle:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest
Sample output:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour
[testng] => countloci PASSED DRY RUN
[testng] PASSED: testCountLoci
As of July 2011 the pipeline tests run against LSF 7.0.6 and Grid Engine 6.2u5. To include these two packages in your environment use the hidden dotkit .combined_LSF_SGE.
reuse .combined_LSF_SGE
Once you are satisfied that the dry run has completed without error, to actually run the pipeline test run ant pipelinetestrun.
ant pipelinetestrun -Dsingle=ExampleCountLociPipelineTest
Sample output:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e4c42c267b3a6]
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
If you don't know the MD5s yet you can run the command yourself on the command line and then MD5s the outputs yourself, or you can set the MD5s in your test to "" and run the pipeline.
When the MD5s are blank as in:
spec.fileMD5s += testOut -> ""
You run:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run
And the output will look like:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is , equal? = false
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
When a pipeline test fails due to an MD5 mismatch you can use the MD5 database to diff the results.
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### Updating MD5 file: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e0000deadbeef]
[testng] ##### Test countloci is going fail #####
[testng] ##### Path to expected file (MD5=67823e4722495eb10a5e0000deadbeef): integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest
[testng] ##### Path to calculated file (MD5=67823e4722495eb10a5e4c42c267b3a6): integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] ##### Diff command: diff integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] FAILED: testCountLoci
[testng] java.lang.AssertionError: 1 of 1 MD5s did not match.
If you need to examine a number of MD5s which may have changed you can briefly shut off MD5 mismatch failures by setting parameterize = true.
spec.parameterize = true
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
For this run:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run
If there's a match the output will resemble:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e4c42c267b3a6, equal? = true
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
While for a mismatch it will look like this:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e0000deadbeef, equal? = false
[testng] => countloci PASSED
[testng] PASSED: testCountLoci