As mentioned in the introductory materials, the core concept behind the GATK tools is the walker. The Queue scripting framework contains several mechanisms which make it easy to chain together GATK walkers.
As part of authoring your walker there are several Queue behaviors that you can specify for QScript authors using your particular walker.
Queue can significantly speed up generating walker outputs by passing different instances of the GATK the same BAM or VCF data but specifying different regions of the data to analyze. After the different instances output their individual results Queue will gather the results back to the original output path requested by QScript.
Queue limits the level it will split genomic data by examining the @PartitionBy() annotation for your walker which specifies a PartitionType. This table lists the different partition types along with the default partition level for each of the different walker types.
| PartitionType | Default for Walker Type | Description | Example Intervals | Example Splits |
|---|---|---|---|---|
| PartitionType.CONTIG | Read walkers | Data is grouped together so that all genomic data from the same contig is never presented to two different instances of the GATK. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | split 1: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60; split 2:chr3:10-11 |
| PartitionType.INTERVAL | (none) | Data is split down to the interval level but never divides up an explicitly specified interval. If no explicit intervals are specified in the QScript for the GATK then this is effectively the same as splitting by contig. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | split 1: chr1:10-11, chr2:10-20, chr2:30-40; split 2: chr2:50-60, chr3:10-11 |
| PartitionType.LOCUS | Locus walkers, ROD walkers | Data is split down to the locus level possibly dividing up intervals. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | split 1: chr1:10-11, chr2:10-20, chr2:30-35; split 2: chr2:36-40, chr2:50-60, chr3:10-11 |
| PartitionType.NONE | Read pair walkers, Duplicate walkers | The data cannot be split and Queue must run the single instance of the GATK as specified in the QScript. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | no split: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 |
If you walker is implemented in a way that Queue should not divide up your data you should explicitly set the @PartitionBy(PartitionType.NONE). If your walker can theoretically be run per genome location specify @PartitionBy(PartitionType.LOCUS).
@PartitionBy(PartitionType.LOCUS)
public class ExampleWalker extends LocusWalker<Integer, Integer> {
...
Queue will join the standard walker outputs.
| Output type | Default gatherer implementation |
|---|---|
| SAMFileWriter | The BAM files are joined together using Picard's MergeSamFiles. |
| VCFWriter | The VCF files are joined together using the GATK CombineVariants. |
| PrintStream | The first two files are scanned for a common header. The header is written once into the output, and then each file is appended to the output, skipping past with the header lines. |
If your PrintStream is not a simple text file that can be concatenated together, you must implement a Gatherer. Extend your custom Gatherer from the abstract base class and implement the gather() method.
package org.broadinstitute.sting.commandline;
import java.io.File;
import java.util.List;
/**
* Combines a list of files into a single output.
*/
public abstract class Gatherer {
/**
* Gathers a list of files into a single output.
* @param inputs Files to combine.
* @param output Path to output file.
*/
public abstract void gather(List<File> inputs, File output);
/**
* Returns true if the caller should wait for the input files to propagate over NFS before running gather().
*/
public boolean waitForInputs() { return true; }
}
Specify your gatherer using the @Gather() annotation by your @Output.
@Output
@Gather(MyGatherer.class)
public PrintStream out;
Queue will run your custom gatherer to join the intermediate outputs together.
Running 'ant queue' builds a set of Queue extensions for the GATK-Engine. Every GATK walker and command line program in the compiled GenomeAnalysisTK.jar a Queue compatible wrapper is generated.
The extensions can be imported via import org.broadinstitute.sting.queue.extensions.gatk._
import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._
class MyQscript extends QScript {
...
Note that the generated GATK extensions will automatically handle shell-escaping of all values assigned to the various Walker parameters, so you can rest assured that all of your values will be taken literally by the shell. Do not attempt to escape values yourself -- ie.,
Do this:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")
NOT this:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")
In addition to the GATK documentation on this wiki you can also find the full list of arguments for each walker extension in a variety of ways.
The source code for the extensions is generated during ant queue and placed in this directory:
build/queue-extensions/src
When properly configured an IDE can provide command completion of the walker extensions. See Queue with IntelliJ IDEA for our recommended settings.
If you do not have access to an IDE you can still find the names of the generated variables using the command line. The generated variable names on each extension are based off of the fullName of the Walker argument. To see the built in documentation for each Walker, run the GATK with:
java -jar GenomeAnalysisTK.jar -T <walker name> -help
Once the import statement is specified you can add() instances of gatk extensions in your QScript's script() method.
If the GATK walker input allows more than one of a value you should specify the values as a List().
def script() {
val snps = new UnifiedGenotyper
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
add(snps)
}
Although it may be harder for others trying to read your QScript, for each of the long name arguments the extensions contain aliases to their short names as well.
def script() {
val snps = new UnifiedGenotyper
snps.R = new File("testdata/exampleFASTA.fasta")
snps.I = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
add(snps)
}
Here are a few more examples using various list assignment operators.
def script() {
val countCovariates = new CountCovariates
// Append to list using item appender :+
countCovariates.rodBind :+= RodBind("dbsnp", "VCF", dbSNP)
// Append to list using collection appender ++
countCovariates.covariate ++= List("ReadGroupCovariate", "QualityScoreCovariate", "CycleCovariate", "DinucCovariate")
// Assign list using plain old object assignment
countCovariates.input_file = List(inBam)
// The following is not a list, so just assigning one file to another
countCovariates.recal_file = outRecalFile
add(countCovariates)
}
By default Queue runs the GATK from the current classpath. This works best since the extensions are generated and compiled at time same time the GATK is compiled via ant queue.
If you need to swap in a different version of the GATK you may not be able to use the generated extensions. The alternate GATK jar must have the same command line arguments as the GATK compiled with Queue. Otherwise the arguments will not match and you will get an error when Queue attempts to run the alternate GATK jar. In this case you will have to create your own custom CommandLineFunction for your analysis.
def script {
val snps = new UnifiedGenotyper
snps.jarFile = new File("myPatchedGATK.jar")
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
add(snps)
}
Queue currently allows QScript authors to explicitly invoke scatter/gather on GATK walkers by setting the scatter count on a function.
def script {
val snps = new UnifiedGenotyper
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
snps.scatterCount = 20
add(snps)
}
This will run the UnifiedGenotyper up to 20 ways parallel and then will merge the partial VCFs back into the single snps.vcf.
Some walkers are still being updated to support Queue fully. For example they may not have defined the @Input and @Output and thus Queue is unable to correctly track their dependencies, or a custom Gatherer may not be implemented yet.
Hello I'm a developer in Korea. Recently, I have been developed about Bioinformatics pipeline. I'm using BWA, Samtools, Picard, GATK. And then I wanna make this tool on hadoop. The reason is why Using MR is efficient to speed or memory something like that. So, I know GATK is made by MR. If so, did you test GATK on MR? In theory, that is more efficient than just GATK.
And, If GATK needs indexed and sorted SAM, with using hadoop-BAM library do I just make index and sort??
Because I am novice in Bioinformatics, this issue is too complicated to me.
Sincere.
e-mail : leoniz127@gmail.com phone : +821027266808
I have a pipeline someone gave me; in it, it runs the following obsolete GATK command:
java -Xms5g -Xmx5g -jar src/GenomeAnalysisTK-1.0.5506/GenomeAnalysisTK.jar -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -R data/ucsc.hg19.fasta --DBSNP data/hg19/snp131.rod -I runs/nwp/run-000/chrUn_gl000228.realn.bam -recalFile runs/nwp/run-000/chrUn_gl000228.recal.csv
How does the following differ from above?
java -Xms5g -Xmx5g -jar src/GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar -T BaseRecalibrator -I runs/nwp/run-000/chrUn_gl000228.realn.bam -R data/ucsc.hg19.fasta -knownSites data/dbsnp_135.hg19.vcf -o runs/nwp/run-000/chrUn_gl000228.recal.grp
Then there is another step at this stage of the pipeline:
java -Xms5g -Xmx5g -jar src/GenomeAnalysisTK-1.0.5506/GenomeAnalysisTK.jar -R data/ucsc.hg19.fasta -I runs/nwp/run-000/chrUn_gl000228.realn.bam -o runs/nwp/run-000/chrUn_gl000228.recal.bam -T TableRecalibration -baq RECALCULATE --doNotWriteOriginalQuals -recalFile runs/nwp/run-000/chrUn_gl000228.recal.csv
How does one run this last step map in 2.1-8 version of GATK?
Hi to all
I have just started using GATK and I have few question about some tools and about the general workflow.
I have 3 exome-seq data from a trio and I have to detect rare or private variants that segregate with the disease.
From the 3 aligned bam file I procedeed with the GATK pipeline (ADDgroupInfo, MarkDup, Realign, BQSR, Unified Genotyper and variant filtration) and I generated 3 VCF file.
As now I have to use the PhaseByTrasmission tool, should I merge the 3 VCF file?
Or it was better to merge the BAM file after adding the group info and proceed with the other analysis?
And should I create my .ped file,(I visited http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped, but I couln't understand how ped file is generated) based on the read group that I have assigned?
Thanks!!!
Hi there, I wanted to reproduce in my variant calling Queue script the same conditional you have in MethodsDevelopmenCallingPipeline, i.e. including InbreedingCoeff depending on the number of samples. However, in that script the number of samples is passed to the Target object as an integer, and I would like to count it from the bam file list passed as an input to the script.
Therefore I followed the method in DataProcessingPipeline, i.e.
import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size
But unfortunately, despite DataProcessingPipeline works just fine, when I put these lines in my other script I get the following error:
INFO 12:48:08,616 HelpFormatter - Date/Time: 2012/11/08 12:48:08
INFO 12:48:08,616 HelpFormatter - ----------------------------------------------------------------------
INFO 12:48:08,616 HelpFormatter - ----------------------------------------------------------------------
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.DynamicClassResolutionException: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
at org.broadinstitute.sting.utils.classloader.PluginManager.createByType(PluginManager.java:306)
at org.broadinstitute.sting.utils.classloader.PluginManager.createAllTypes(PluginManager.java:317)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:126)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-2-gf44cc4e):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------
I tried several alternatives looking at the imports in DataProcessingPipeline but maybe I am missing something. Could you please advise?
thanks very much Francesco