GATK-Queue

From GSA

(Redirected from Queue)
Jump to: navigation, search

Contents

Introduction

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

  • Local realignment around indels
  • Emitting raw SNP calls
  • Emitting indels
  • Masking the SNPs at indels
  • Annotating SNPs using chip data
  • Labeling suspicious calls based on filters
  • Creating a summary report with statistics

Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.

With a Queue script users can semanticaly define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.

Building Queue from source

  • Queue is part of the Sting repository.
  • Make sure you have suitable versions of the JDK and Ant. See the GATK prerequisites section for more info.
  • Download the source from our github repository. Run the following command:
git clone git://github.com/broadgsa/gatk.git Sting
  • Use ant to build the source. Queue uses the Ivy dependency manager to fetch all other dependencies.
cd Sting
ant queue

Running Queue

java -jar dist/Queue.jar --help
  • To list the argument required by a QScript add the script with -S and run with --help.
java -jar dist/Queue.jar -S script.scala --help
  • By default queue runs in a "dry" mode.
  • After verifying the generated commands execute the pipeline with -run.
  • See QFunction and Command Line Options for more info on adjusting Queue options.

QScripts

General Information

  • Queue pipelines are defined in Scala 2.8 files with a bit of syntactic sugar.
  • In the QScript are the following steps:
    • New instances of CommandLineFunctions are created
    • Input and output arguments are specified on each function
    • The function is add()'ed to Queue for dispatch and monitoring
  • Run the Queue pipelines on the command line as java -jar Queue.jar -S <script>.scala.
  • See the main article Queue QScripts for more info on QScripts.

Supported QScripts

While most QScripts are analysis pipelines for specific projects, some have been released as supported tools. See

Example QScripts

Visualization and Queue

QJobReport

As of 8/29/11, Queue automatically generated GATKReport formatted runtime information about executed jobs:

  • Queue attempts to run a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile
.libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
  • Caveat: the system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
  • Caveat: this feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.

DOT visualization of Pipelines

Queue emits a queue.dot file to help visualize your commands. You can open this file in dot, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline. To clarify your pipeline override the dotString() function:

class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
    @Input(doc="foo") var bam = bamIn
    @Input(doc="foo") var bamIndex = bai(bamIn)
    @Output(doc="foo") var recalData = recalDataIn
    memoryLimit = Some(4)
    override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
    def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}

Here we only see "CountCovariates my.bam [-OQ]", for example, in the dot. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:

File:QueuePipeline.jpg

See also

Additional Help

Personal tools