GATK-Queue
From GSA
Contents |
Introduction
GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:
- Local realignment around indels
- Emitting raw SNP calls
- Emitting indels
- Masking the SNPs at indels
- Annotating SNPs using chip data
- Labeling suspicious calls based on filters
- Creating a summary report with statistics
Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.
With a Queue script users can semanticaly define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.
Building Queue from source
- Queue is part of the Sting repository.
- Make sure you have suitable versions of the JDK and Ant. See the GATK prerequisites section for more info.
- Download the source from our github repository. Run the following command:
git clone git://github.com/broadgsa/gatk.git Sting
- Use ant to build the source. Queue uses the Ivy dependency manager to fetch all other dependencies.
cd Sting ant queue
Running Queue
- See Running Queue for the first time for full details.
- Queue arguments can be listed by running with --help
java -jar dist/Queue.jar --help
- To list the argument required by a QScript add the script with -S and run with --help.
java -jar dist/Queue.jar -S script.scala --help
- By default queue runs in a "dry" mode.
- After verifying the generated commands execute the pipeline with -run.
- See QFunction and Command Line Options for more info on adjusting Queue options.
QScripts
General Information
- Queue pipelines are defined in Scala 2.8 files with a bit of syntactic sugar.
- In the QScript are the following steps:
- New instances of CommandLineFunctions are created
- Input and output arguments are specified on each function
- The function is add()'ed to Queue for dispatch and monitoring
- Run the Queue pipelines on the command line as java -jar Queue.jar -S <script>.scala.
- See the main article Queue QScripts for more info on QScripts.
Supported QScripts
While most QScripts are analysis pipelines for specific projects, some have been released as supported tools. See
Example QScripts
- The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples.
- See QScript - Examples for more info on running the examples QScripts.
Visualization and Queue
QJobReport
As of 8/29/11, Queue automatically generated GATKReport formatted runtime information about executed jobs:
- See this presentation for general introduction to QJobReport
- Queue attempts to run a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile
.libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
- Caveat: the system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
- Caveat: this feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.
DOT visualization of Pipelines
Queue emits a queue.dot file to help visualize your commands. You can open this file in dot, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline. To clarify your pipeline override the dotString() function:
class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
@Input(doc="foo") var bam = bamIn
@Input(doc="foo") var bamIndex = bai(bamIn)
@Output(doc="foo") var recalData = recalDataIn
memoryLimit = Some(4)
override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}
Here we only see "CountCovariates my.bam [-OQ]", for example, in the dot. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:
See also
- Running Queue for the first time
- Queue with IntelliJ IDEA
- Queue QScripts
- QFunction and Command Line Options
- Queue CommandLineFunctions
- Pipelining the GATK using Queue
- Queue with Grid Engine
- Queue Frequently Asked Questions
Additional Help
- See Queue Frequently Asked Questions for answers to some common issues.
- Please contact http://getsatisfaction.com/gsa for additional help.
