This section contains articles related to developing for the GATK. Topics covered include how to write new walkers and Queue scripts, as well as some deeper GATK engine information that is relevant for developers.
The AlignmentContext and ReadBackedPileup work together to provide the read data associated with a given locus. This section details the tools the GATK provides for working with collections of aligned reads.
Read backed pileups are objects that contain all of the reads and their offsets that "pile up" at a locus on the genome. They are the basic input data for the GATK LocusWalkers, and underlie most of the locus-based analysis tools like the recalibrator and SNP caller. Unfortunately, there are many ways to view this data, and version one grew unwieldy trying to support all of these approaches. Version two of the ReadBackedPileup presents a consistent and clean interface for working pileup data, as well as supporting the iterable() interface to enable the convenient for ( PileupElement p : pileup ) for-each loop support.
The best way is simply to grab the pileup (the underlying representation of the locus data) from your AlignmentContext object in map:
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
ReadBackedPileup pileup = context.getPileup();
This aligns your calculations with the GATK core infrastructure, and avoids any unnecessary data copying from the engine to your walker.
public ReadBackedPileup(GenomeLoc loc, ArrayList<PileupElement> pileup )
requiring only a list, in order of read / offset in the pileup, of PileupElements.
If you happen to have lists of SAMRecords and integer offsets into them you can construct a ReadBackedPileup this way:
public ReadBackedPileup(GenomeLoc loc, List<SAMRecord> reads, List<Integer> offsets )
for ( PileupElement p : pileup ) {
System.out.printf("%c %c %d%n", p.getBase(), p.getSecondBase(), p.getQual());
// you can get the read itself too using p.getRead()
}
This is the most efficient way to get data, and should be used whenever possible.
You can use:
public byte[] getBases()
public byte[] getSecondaryBases()
public byte[] getQuals()
To get the bases and quals as a byte[] array, which is the underlying base representation in the SAM-JDK.
Use the follow function to get counts of A, C, G, T in order:
public int[] getBaseCounts()
Which returns a int[4] vector with counts according to BaseUtils.simpleBaseToBaseIndex for each base.
The GATK can very efficiently stratify pileups by sample, and less efficiently stratify by read group, strand, mapping quality, base quality, or any arbitrary filter function. The sample-specific functions can be called as follows:
pileup.getSamples();
pileup.getPileupForSample(String sampleName);
In addition to the rich set of filtering primitives built into the ReadBackedPileup, you can supply your own primitives by implmenting a PileupElementFilter:
public interface PileupElementFilter {
public boolean allow(final PileupElement pileupElement);
}
and passing it to ReadBackedPileup's generic filter function:
public ReadBackedPileup getFilteredPileup(PileupElementFilter filter);
See the ReadBackedPileup's java documentation for a complete list of built-in filtering primitives.
While ReadBackedPileup is the preferred mechanism for aligned reads, some walkers still use the StratifiedAlignmentContext to carve up selections of reads. If you find functions that you require in StratifiedAlignmentContext that seem to have no analog in ReadBackedPileup, please let us know and we'll port the required functions for you.
The GATK build system uses the Ivy dependency manager to make it easy for our users to add additional dependencies. Ivy can pull the latest jars and their dependencies from the Maven repository, making adding or updating a dependency as simple as adding a new line to the ivy.xml file.
If your tool is available in the maven repository, add a line to the ivy.xml file similar to the following:
<dependency org="junit" name="junit" rev="4.4" />
If you would like to add a dependency to a tool not available in the maven repository, please email gsahelp@broadinstitute.org
Because we work so closely with the SAM-JDK/Picard team and are critically dependent on the code they produce, we have a special procedure for updating the SAM/Picard jars. Please use the following procedure to when updating sam-*.jar or picard-*.jar.
Download and build the latest versions of Picard public and Picard private (restricted to Broad Institute users) from their respective svns.
Get the latest svn versions for picard public and picard private by running the following commands:
svn info $PICARD_PUBLIC_HOME | grep "Revision" svn info $PICARD_PRIVATE_HOME | grep "Revision"
Rename the jars and xmls in $STING_HOME/settings/repository/net.sf to {picard|sam}-$PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.PICARD_PUBLIC_SVN_REV.{jar|xml}
Update the jars in $STING_HOME/settings/repository/net.sf with their newer equivalents in $PICARD_PUBLIC_HOME/dist/picard_lib.
Update the xmls in $STING_HOME/settings/repository/net.sf with the appropriate version number ($PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.$PICARD_PUBLIC_SVN_REV).
Create the picard private jar with the following command:
ant clean package -Dexecutable=PicardPrivate -Dpicard.dist.dir=${PICARD_PRIVATE_HOME}/dist
Rename picard-private-parts-*.jar in $STING_HOME/settings/repository/edu.mit.broad to picard-private-parts-$PICARD_PRIVATE_SVN_REV.jar.
Update picard-private-parts-*.jar in $STING_HOME/settings/repository/edu.mit.broad with the picard-private-parts.jar in $STING_HOME/dist/packages/picard-private-parts.
Update the xml in $STING_HOME/settings/repository/edu.mit.broad to reflect the new revision and publication date.
This document describes the workflow we use within GSA to do coverage analysis of the GATK codebase. It is primarily meant as an internal reference for team members, but are making it public to provide an example of how we work. There are a few mentions of internal server names etc.; please just disregard those as they will not be applicable to you.
ant clean with.clover unittest
Note that you have to explicitly disable scala (due to a limitation in how it's currently integrated in build.xml). Note you can use things like -Dsingle="ReducerUnitTest" as well.
It seems that clover requires a lot of memory, so a few things are necessary:
setenv ANT_OPTS "-Xmx8g"
There's plenty of memory on gsa4, so it's not a problem to require so much memory
You can add the argument -Dclover.instrument.level=statement if you want line-level resolution on the report, but note this is astronomically expensive for the entire unit test suite. It's fine though if you want to run specific run tests.
> ant clover.report
Buildfile: /Users/depristo/Desktop/broadLocal/GATK/unstable/build.xml
clover.report:
[clover-html-report] Clover Version 3.1.8, built on November 13 2012 (build-876)
[clover-html-report] Loaded from: /Users/depristo/Desktop/broadLocal/GATK/unstable/private/resources/clover/lib/clover.jar
[clover-html-report] Clover: Community License registered to Broad Institute.
[clover-html-report] Loading coverage database from: '/Users/depristo/Desktop/broadLocal/GATK/unstable/.clover/clover3_1_8.db'
[clover-html-report] Writing HTML report to '/Users/depristo/Desktop/broadLocal/GATK/unstable/clover_html'
[clover-html-report] Done. Processed 132 packages in 20943ms (158ms per package).
[mkdir] Created dir: /Users/depristo/private_html/report/clover
[copy] Copying 4545 files to /Users/depristo/private_html/report/clover
BUILD SUCCESSFUL
The clover files are present in a subdirectory clover_html as well as copied to your private_html/report directory. Note this can be very expensive given our large number of tests. For example, I've been waiting for the report to generate for nearly an hour on gsa4.
ant clean with.clover unittest clover.report
will clean the source, rebuild with clover engaged, run the unit tests, and generate the clover report. Note that currently unittests may be failing due to classcast and other exceptions in the clover run. We're looking into it. But you can still run clover.report after the failed run, as the db contains all of the run information, even through it failed (though failed methods won't be counted).
Here's a real-life example of assessing coverage in all BQSR utilities at once:
ant clean with.clover unittest -Dclover.instrument.level=statement -Dsingle="recalibration/*UnitTest" clover.report
Clover can make the tests very slow. Currently we are run in method count only mode (we don't have line number resolution (looking into fixing this). Also note that running with clover over the entire unittest set requires 32G of RAM (set automatically by ant).

This workflow is appropriate for developing unit tests for a single package or class. The turn-around time for clover on a single package is very fast, even with statement-level coverage. The overall workflow looks like:
Here's a concrete example. Right now I'm looking at the unit test coverage for GenomeLoc, one of the earliest and most important classes in the GATK. I really want good unit test coverage here. So I start by running GenomeLoc unit tests specifically:
ant clean with.clover unittest -Dsingle="GenomeLocUnitTest" -Dclover.instrument.level=statement clover.report
Next, I open up the clover coverage report in clover_html/index.html in my GATK directory, and landing on the Dashboard. Everything looks pretty bad, but that's because I only ran the GenomeLoc tests, and it displays the entire project coverage. I click on the "Coverage" link in the upper-left frame, and scroll down to the package where GenomeLoc lives (org.broadinstitute.sting.utils). At the bottom of this page I find my two classes, GenomeLoc and GenomeLocParser.CachingSequenceDictionary:

These have ~50% statement-level coverage each. Not ideal, really.
Let's dive into GenomeLoc itself a bit more. Clicking on the GenomeLoc link brings up to the code coverage page. Here you can see a few things very quickly.

For methods with poor test coverage (branches or overall) I'd look into their uses, and try to answer a few questions:
If the code needs tests, I would design specific unit tests (or data providers that cover all possible cases) for these function. Once that newly-written code is in place, I would rerun the ant tasks above to get updated coverage information, and continue until I'm satisfied.
In theory, any class implementing the OutputStream interface. In practice, three types of classes are commonly used: PrintStreams for plain text files, SAMFileWriters for BAM files, and VCFWriters for VCF files.
To declare a basic PrintStream for output, use the following declaration syntax:
@Output
public PrintStream out;
And use it just as you would any other PrintStream:
out.println("Hello, world!");
By default, @Output streams prepopulate fullName, shortName, required, and doc. required in this context means that the GATK will always fill in the contents of the out field for you. If the user specifies no --out command-line argument, the 'out' field will be prepopulated with a stream pointing to System.out.
If your walker outputs a custom format that requires more than simple concatenation by Queue you should also implement a custom Gatherer.
For some applications, you might need to manage their own SAM readers and writers directly from inside your walker. Current best practice for creating these Readers / Writers is to declare arguments of type SAMFileReader or SAMFileWriter as in the following example:
@Output
SAMFileWriter outputBamFile = null;
If you do not specify the full name and short name, the writer will provide system default names for these arguments. Creating a SAMFileWriter in this way will create the type of writer most commonly used by members of the GSA group at the Broad Institute -- it will use the same header as the input BAM and require presorted data. To change either of these attributes, use the StingSAMIterator interface instead:
@Output
StingSAMFileWriter outputBamFile = null;
and later, in initialize(), run one or both of the following methods:
outputBAMFile.writeHeader(customHeader); outputBAMFile.setPresorted(false);
You can change the header or presorted state until the first alignment is written to the file.
VCFWriter outputs behave similarly to PrintStreams and SAMFileWriters. Declare a VCFWriter as follows:
@Output(doc="File to which variants should be written",required=true) protected VCFWriter writer = null;
The walkers provide a protected logger instance. Users can adjust the debug level of the walkers using the -l command line option.
Turning on verbose logging can produce more output than is really necessary. To selectively turn on logging for a class or package, specify a log4j.properties property file from the command line as follows:
-Dlog4j.configuration=file:///<your development root>/Sting/java/config/log4j.properties
An example log4j.properties file is available in the java/config directory of the Git repository.
The GATK discovers walker documentation by reading it out of the Javadoc, Sun's design pattern for providing documentation for packages and classes. This page will provide an extremely brief explanation of how to write Javadoc; more information on writing javadoc comments can be found in Sun's documentation.
The GATK's build system uses the javadoc parser to extract the javadoc for classes and packages and embed the contents of that javadoc in the help system. If you add Javadoc to your package or walker, it will automatically appear in the help. The javadoc parser will pick up on 'standard' javadoc comments, such as the following, taken from PrintReadsWalker:
/**
* This walker prints out the input reads in SAM format. Alternatively, the walker can write reads into a specified BAM file.
*/
You can add javadoc to your package by creating a special file, package-info.java, in the package directory. This file should consist of the javadoc for your package plus a package descriptor line. One such example follows:
/**
* @help.display.name Miscellaneous walkers (experimental)
*/
package org.broadinstitute.sting.playground.gatk.walkers;
Additionally, the GATK provides two extra custom tags for overriding the information that ultimately makes it into the help.
@help.display.name Changes the name of the package as it appears in help. Note that the name of the walker cannot be changed as it is required to be passed verbatim to the -T argument.
@help.summary Changes the description which appears on the right-hand column of the help text. This is useful if you'd like to provide a more concise description of the walker that should appear in the help.
@help.description Changes the description which appears at the bottom of the help text with -T <your walker> --help is specified. This is useful if you'd like to present a more complete description of your walker.
Walkers can be hidden from the documentation system by adding the @Hidden annotation to the top of each walker. @Hidden walkers can still be run from the command-line, but their documentation will not be visible to end users. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.
Because the building of our help text is actually heavyweight and can dramatically increase compile time on some systems, we have a mechanism to disable help generation.
Compile with the following command:
ant -Ddisable.help=true
to disable generation of help.
Yes.
For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.
In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.
@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil
On the command line specify the arguments by repeating the argument name.
-filter filter1 -filter filter2 -filter filter3
Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.
def script {
var myCommand = new MyFunction
myCommand.filters = this.filterNames
}
For a full example of command line arguments see the QScripts documentation.
Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.
class MySplitter extends InProcessFunction {
@Input(doc="inputs")
var in: File = _
@Output(doc="outputs")
var out: List[File] = Nil
def run {
StingUtilityMethod.quickSplitFile(in, out)
}
}
var splitter = new MySplitter
splitter.in = new File("input.txt")
splitter.out = List(new File("out1.txt"), new File("out2.txt"))
add(splitter)
See Queue CommandLineFunctions for more information on how @Input and @Output are used.
Create an instance of a ListWriterFunction and add it in your script method.
import org.broadinstitute.sting.queue.function.ListWriterFunction
</pre>
<pre>
val writeBamList = new ListWriterFunction
writeBamList.inputFiles = bamFiles
writeBamList.listFile = new File("myBams.list")
add(writeBamList)
Queue contains a trait mixin you can use to add Log4J support to your classes.
Add the import for the trait Logging to your QScript.
import org.broadinstitute.sting.queue.util.Logging
Mixin the trait to your class.
class MyScript extends Logging {
...
Then use the mixed in logger to write debug output when the user specifies -l DEBUG.
logger.debug("This will only be displayed when debugging is enabled.")
Try ant clean.
Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.
No. QScript will create all parent directories for outputs.
Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:
RUNLIMIT
240.0 min of gsa3
Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.
The easiest way to do this at the moment is to mixin a trait.
First define a trait which adds your java options:
trait RemoteDebugging extends JavaCommandLineFunction {
override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
}
Then mix in the trait to your walker and otherwise run it as normal:
val printReadsDebug = new PrintReads with RemoteDebugging
printReadsDebug.reference_sequence = "my.fasta"
// continue setting up your walker...
add(printReadsDebug)
If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.
INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 16:25:55,140 QCommandLine - Added 4 functions
INFO 16:25:55,140 QGraph - Generating graph.
INFO 16:25:55,164 QGraph - Generating scatter gather jobs.
INFO 16:25:55,714 QGraph - Removing original jobs.
INFO 16:25:55,716 QGraph - Adding scatter gather jobs.
INFO 16:25:55,779 QGraph - Regenerating graph.
INFO 16:25:55,790 QGraph - Running jobs.
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done
INFO 16:25:55,902 QCommandLine - Done
Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.
Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.
The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.
var and val?var is a value you can later modify, while val is similar to final in Java.
Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.
MyQScript.scala:39: error: type mismatch;
found : java.util.List[java.lang.String]
required: scala.List[String]
val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)
Use the implicit definitions in JavaConversions to automatically convert the basic Java collections to and from Scala collections.
import collection.JavaConversions._
Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.
var str = "A"
str + "B"
println(str) // prints: A
str += "C"
println(str) // prints: AC
var set = Set("A")
set + "B"
println(set) // prints: Set(A)
set += "C"
println(set) // prints: Set(A, C)
Use the :+ operator for a single value.
var myList = List.empty[String]
myList :+= "a"
myList :+= "b"
myList :+= "c"
Use ++ for appending a list.
var myList = List.empty[String]
myList ++= List("a", "b", "c")
Use the + operator.
var mySet = Set.empty[String]
mySet += "a"
mySet += "b"
mySet += "c"
Use the + and -> operators.
var myMap = Map.empty[String,Int]
myMap += "a" -> 1
myMap += "b" -> 2
myMap += "c" -> 3
Option is a Scala generic type that can either be some generic value or None. Queue often uses it to represent primitives that may be null.
var myNullableInt1: Option[Int] = Some(1)
var myNullableInt2: Option[Int] = None
François Armand's slide deck is a good introduction: http://www.slideshare.net/normation/scala-dreaded
To quote from his slides:
Give me a variable name but
- I don't care of what it is
- and/or
- don't want to pollute my namespace with it
Use the .format() method.
This Java snippet:
String formatted = String.format("%s %i", myString, myInt);
In Scala would be:
val formatted = "%s %i".format(myString, myInt)
No. Currently Scala's Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java enums if for example you are importing a Java based walker's enum type.
If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.
Yes. Be sure to install the scala plugin and setup your IDE as listed in [Queue with IntelliJ IDEA(http://gatkforums.broadinstitute.org/discussion/1309/queue-with-intellij-idea).
Check if there is an update to your Scala plugin as well.
Check your IntelliJ IDEA settings to for the following:
File Types have *.scala as a registered pattern for Scala files.This document describes the current GATK coding standards for documentation and unit testing. The overall goal is that all functions be well documented, have unit tests, and conform to the coding conventions described in this guideline. It is primarily meant as an internal reference for team members, but we are making it public to provide an example of how we work. There are a few mentions of specific team member responsibilities and who to contact with questions; please just disregard those as they will not be applicable to you.
The Genome Analysis Toolkit generally follows Java coding standards and good practices, which can be viewed at Sun's site.
The original coding standard document for the GATK was developed in early 2009. It remains a reasonable starting point but may be superseded by statements on this page (available as a PDF).
Code in the GATK should be structured into clear, simple, and testable functions. Clear means that the function takes a limited number of arguments, most of which are values not modified, and in general should return newly allocated results, as opposed to directly modifying the input arguments (functional style). The max. size of functions should be approximately one screen's worth of real estate (no more than 80 lines), including inline comments. If you are writing functions that are much larger than this, you must refactor your code into modular components.
Do not duplicate code. If you are finding yourself wanting to make a copy of functionality, refactor the code you want to duplicate and enhance it. Duplicating code introduces bugs, makes the system harder to maintain, and will require more work since you will have a new function that must be tested, as opposed to expanding the tests on the existing functionality.
Functions must be documented following the javadoc conventions. That means that the first line of the comment should be a simple statement of the purpose of the function. Following that is an expanded description of the function, such as edge case conditions, requirements on the argument, state changes, etc. Finally comes the @param and @return fields, that should describe the meaning of each function argument, restrictions on the values allowed or returned. In general, the return field should be about types and ranges of those values, not the meaning of the result, as this should be in the body of the documentation.
The GATK uses Contracts for Java to help us enforce code quality during testing. See CoFoJa for more information. If you've never programmed with contracts, read their excellent description Adding contracts to a stack. Contracts are only enabled when we are testing the code (unittests and integration tests) and not during normal execution, so contracts can be reasonably expensive to compute. They are best used to enforce assumptions about the status of class variables and return results.
Contracts are tricky when it comes to input arguments. The best practice is simple:
Below is an example private function that makes good use of input argument contracts:
/**
* Helper function to write out a IGV formatted line to out, at loc, with values
*
* http://www.broadinstitute.org/software/igv/IGV
*
* @param out a non-null PrintStream where we'll write our line
* @param loc the location of values
* @param featureName string name of this feature (see IGV format)
* @param values the floating point values to associate with loc and feature name in out
*/
@Requires({
"out != null",
"loc != null",
"values.length > 0"
})
private void printIGVFormatRow(final PrintStream out, final GenomeLoc loc, final String featureName, final double ... values) {
// note that start and stop are 0 based, but the stop is exclusive so we don't subtract 1
out.printf("%s\t%d\t%d\t%s", loc.getContig(), loc.getStart() - 1, loc.getStop(), featureName);
for ( final double value : values )
out.print(String.format("\t%.3f", value));
out.println();
}
Final java fields cannot be reassigned once set. Nearly all variables you write should be final, unless they are obviously accumulator results or other things you actually want to modify. Nearly all of your function arguments should be final. Being final stops incorrect reassigns (a major bug source) as well as more clearly captures the flow of information through the code.
/**
* Get the reference bases from referenceReader spanned by the extended location of this active region,
* including additional padding bp on either side. If this expanded region would exceed the boundaries
* of the active region's contig, the returned result will be truncated to only include on-genome reference
* bases
* @param referenceReader the source of the reference genome bases
* @param padding the padding, in BP, we want to add to either side of this active region extended region
* @param genomeLoc a non-null genome loc indicating the base span of the bp we'd like to get the reference for
* @return a non-null array of bytes holding the reference bases in referenceReader
*/
@Ensures("result != null")
public byte[] getReference( final IndexedFastaSequenceFile referenceReader, final int padding, final GenomeLoc genomeLoc ) {
if ( referenceReader == null ) throw new IllegalArgumentException("referenceReader cannot be null");
if ( padding < 0 ) throw new IllegalArgumentException("padding must be a positive integer but got " + padding);
if ( genomeLoc == null ) throw new IllegalArgumentException("genomeLoc cannot be null");
if ( genomeLoc.size() == 0 ) throw new IllegalArgumentException("GenomeLoc must have size > 0 but got " + genomeLoc);
final byte[] reference = referenceReader.getSubsequenceAt( genomeLoc.getContig(),
Math.max(1, genomeLoc.getStart() - padding),
Math.min(referenceReader.getSequenceDictionary().getSequence(genomeLoc.getContig()).getSequenceLength(), genomeLoc.getStop() + padding) ).getBases();
return reference;
}
All classes and methods in the GATK should have unit tests to ensure that they work properly, and to protect yourself and others who may want to extend, modify, enhance, or optimize you code. That GATK development team assumes that anything that isn't unit tested is broken. Perhaps right now they aren't broken, but with a team of 10 people they will become broken soon if you don't ensure they are correct going forward with unit tests.
Walkers are a particularly complex issue. UnitTesting the map and reduce results is very hard, and in my view largely unnecessary. That said, you should write your walkers and supporting classes in such a way that all of the complex data processing functions are separated from the map and reduce functions, and those should be unit tested properly.
Code coverage tells you how much of your class, at the statement or function level, has unit testing coverage. The GATK development standard is to reach something >80% method coverage (and ideally >80% statement coverage). The target is flexible as some methods are trivial (they just call into another method) so perhaps don't need coverage. At the statement level, you get deducted from 100% for branches that check for things that perhaps you don't care about, such as illegal arguments, so reaching 100% statement level coverage is unrealistic for most clases.
You can find out more information about generating code coverage results at Analyzing coverage with clover
We've created a unit testing example template in the GATK codebase that provides examples of creating core GATK data structures from scratch for unit testing. The code is in class ExampleToCopyUnitTest and can be viewed here in github directly ExampleToCopyUnitTest.
As of GATK 2.5, we are moving to a full code review process, which has the following benefits:
# starting a new feature
git checkout -b rp_pairhmm_GSA-332
git commit -av
git push -u origin rp_pairhmm_GSA-332
# doing work on existing feature
git commit -av
git push
# ready to submit pull-request
git fetch origin
git rebase -i origin/master
git push -f
# after being accepted, delete your branch
git checkout master
git pull
git branch -d rp_pairhmm_GSA-332
(the reviewer will remove your github branch)
You must commit your code in small commit blocks with commit messages that follow the git best practices, which require the first line of the commit to summarize the purpose of the commit, followed by -- lines that describe the changes in more detail. For example, here's a recent commit that meets this criteria that added unit tests to the GenomeLocParser:
Refactoring and unit testing GenomeLocParser
-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts. The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit
Now, git encourages you to commit code often, and develop your code in whatever order or what is best for you. So it's common to end up with 20 commits, all with strange, brief commit messages, that you want to push into the master branch. It is not acceptable to push such changes. You need to use the git command rebase to reorganize your commit history so satisfy the small number of clear commits with clear messages.
Here is a recommended git workflow using rebase:
Start every project by creating a new branch for it. From your master branch, type the following command (replacing "myBranch" with an appropriate name for the new branch):
git checkout -b myBranch
Note that you only include the -b when you're first creating the branch. After a branch is already created, you can switch to it by typing the checkout command without the -b: "git checkout myBranch"
Also note that since you're always starting a new branch from master, you should keep your master branch up-to-date by occasionally doing a "git pull" while your master branch is checked out. You shouldn't do any actual work on your master branch, however.
When you want to update your branch with the latest commits from the central repo, type this while your branch is checked out:
git fetch && git rebase origin/master
If there are conflicts while updating your branch, git will tell you what additional commands to use.
If you need to combine or reorder your commits, add "-i" to the above command, like so:
git fetch && git rebase -i origin/master
If you want to edit your commits without also retrieving any new commits, omit the "git fetch" from the above command.
If you find the above commands cumbersome or hard to remember, create aliases for them using the following commands:
git config --global alias.up '!git fetch && git rebase origin/master'
git config --global alias.edit '!git fetch && git rebase -i origin/master'
git config --global alias.done '!git push origin HEAD:master'
Then you can type "git up" to update your branch, "git edit" to combine/reorder commits, and "git done" to push your branch.
Here are more useful tutorials on how to use rebase:
If you need help with rebasing, talk to Mauricio or David and they will help you out.
The picard-public repository on sourceforge, in addition to housing the sam-jdk and picard-public, is now home to tribble and the org.broadinstitute.variant package (which includes VariantContext and associated classes as well as the VCF/BCF codecs).
If you just need to check out the sources and don't need to make any commits into the picard repository, the command is:
svn checkout svn://svn.code.sf.net/p/picard/code/trunk picard-public
Then you can attach the picard-public/src/java directory in IntelliJ as a source directory (File -> Project Structure -> Libraries -> Click the plus sign -> "Attach Files or Directories" in the latest IntelliJ).
To build picard, sam-jdk, tribble, and org.broadinstitute.variant all at once, type ant from within the picard-public directory. To run tests, type ant test
If you do need to make commits into the picard repository, first you'll need to create a sourceforge account, then contact the Picard team and request access for that account. Once you've been given access, you'll need to check out their repository using a command of this form:
svn checkout --username=YOUR_USERNAME svn+ssh://YOUR_USERNAME@svn.code.sf.net/p/picard/code/trunk picard-public
Users identify which GATK walker to run by specifying a walker name via the --analysis_type command-line argument. By default, the GATK will derive the walker name from a walker by taking the name of the walker class and removing packaging information from the start of the name, and removing the trailing text Walker from the end of the name, if it exists. For example, the GATK would, by default, assign the name PrintReads to the walker class org.broadinstitute.sting.gatk.walkers.PrintReadsWalker. To override the default walker name, annotate your walker class with @WalkerName("<my name>").
Walkers can flag exactly which primary data sources are allowed and required for a given walker. Reads, the reference, and reference-ordered data are currently considered primary data sources. Different traversal types have different default requirements for reads and reference, but currently no traversal types require reference-ordered data by default. You can add requirements to your walker with the @Requires / @Allows annotations as follows:
@Requires(DataSource.READS)
@Requires({DataSource.READS,DataSource.REFERENCE})
@Requires(value={DataSource.READS,DataSource.REFERENCE})
@Requires(value=DataSource.REFERENCE})
By default, all parameters are allowed unless you lock them down with the @Allows attribute. The command:
@Allows(value={DataSource.READS,DataSource.REFERENCE})
will only allow the reads and the reference. Any other primary data sources will cause the system to exit with an error.
Note that as of August 2011, the GATK no longer supports RMD the @Requires and @Allows syntax, as these have moved to the standard @Argument system.
Any command-line argument can be tagged with a comma-separated list of freeform tags.
The syntax for tags is as follows:
-<argument>:<tag1>,<tag2>,<tag3> <argument value>
for example:
-I:tumor <my tumor data>.bam
-eval,VCF yri.trio.chr1.vcf
There is currently no mechanism in the GATK to validate either the number of tags supplied or the content of those tags.
Tags can be accessed from within a walker by calling getToolkit().getTags(argumentValue), where argumentValue is the
parsed contents of the command-line argument to inspect.
The GATK currently has comprehensive support for tags on two built-in argument types:
-I,--input_file <input_file>
Input BAM files and BAM file lists can be tagged with any type. When a BAM file list is tagged, the tag is applied to each listed BAM file.
From within a walker, use the following code to access the supplied tag or tags:
getToolkit().getReaderIDForRead(read).getTags();
Input RODs, e.g. `-V
Tags are used to specify ROD name and ROD type. There is currently no support for adding additional tags. See the ROD system documentation for more details.
Users can create command-line arguments for walkers by creating public member variables annotated with @Argument in the walker. The @Argument annotation takes a number of differentparameters:
fullName
The full name of this argument. Defaults to the toLowerCase()’d member name. When specifying fullName on the command line, prefix with a double dash (--).
shortName
The alternate, short name for this argument. Defaults to the first letter of the member name. When specifying shortName on the command line, prefix with a single dash (-).
doc
Documentation for this argument. Will appear in help output when a user either requests help with the –-help (-h) argument or when a user specifies an invalid set of arguments. Documentation is the only argument that is always required.
required
Whether the argument is required when used with this walker. Default is required = true.
exclusiveOf
Specifies that this argument is mutually exclusive of another argument in the same walker. Defaults to not mutually exclusive of any other arguments.
validation
Specifies a regular expression used to validate the contents of the command-line argument. If the text provided by the user does not match this regex, the GATK will abort with an error.
By default, all command-line arguments will appear in the help system. To prevent new and debugging arguments from appearing in the help system,
you can add the @Hidden tag below the @Argument annotation, hiding it from the help system but allowing users to supply it on the command-line.
Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.
Arguments can be passed to the walker using either the full name or the short name. If passing arguments using the full name, the syntax is −−<arg full name> <value>.
--myint 6
If passing arguments using the short name, the syntax is -<arg short name> <value>. Note that there is a space between the short name and the value:
-m 6
Boolean (class) and boolean (primitive) arguments are a special in that they require no argument. The presence of a boolean indicates true, and its absence indicates false. The following example sets a flag to true.
-B
Two additional annotations can influence the behavior of command-line arguments.
@Hidden
Adding this annotation to an @Argument tells the help system to avoid displaying any evidence that this argument exists. This can be used to add additional debugging arguments that aren't suitable for mass consumption.
@Deprecated
Forces the GATK to throw an exception if this argument is supplied on the command-line. This can be used to supply extra documentation to the user as command-line parameters change for walkers that are in flux.
Create an required int parameter with full name –myint, short name -m. Pass this argument by adding –myint 6 or -m 6 to the command line.
import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
@Argument(doc="my integer")
public int myInt;
Create an optional float parameter with full name –myFloatingPointArgument, short name -m. Pass this argument by adding –myFloatingPointArgument 2.71 or -m 2.71.
import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
@Argument(fullName="myFloatingPointArgument",doc="a floating point argument",required=false)
public float myFloat;
The GATK will parse the argument differently depending on the type of the public member variable’s type. Many different argument types are supported, including primitives and their wrappers, arrays, typed and untyped collections, and any type with a String constructor. When the GATK cannot completely infer the type (such as in the case of untyped collections), it will assume that the argument is a String. GATK is aware of concrete implementations of some interfaces and abstract classes. If the argument’s member variable is of type List or Set, the GATK will fill the member variable with a concrete ArrayList or TreeSet, respectively. Maps are not currently supported.
Besides @Argument, the GATK provides two additional types for command-line arguments: @Input and @Output. These two inputs are very similar to @Argument but act as flags to indicate dataflow to Queue, our pipeline management software.
The @Input tag indicates that the contents of the tagged field represents a file that will be read by the walker.
The @Output tag indicates that the contents of the tagged field represents a file that will be written by the walker, for consumption by downstream walkers.
We're still determining the best way to model walker dependencies in our pipeline. As we determine best practices, we'll post them here.
As of August 2011, the GATK now provides a clean mechanism for creating walker @Input arguments and using these arguments to access Reference Meta Data provided by the RefMetaDataTracker in the map() call. This mechanism is preferred to the old implicit string-based mechanism, which has been retired.
At a very high level, the new RodBindings provide a handle for a walker to obtain the Feature records from Tribble from a map() call, specific to a command line binding provided by the user. This can be as simple as a single ROD file argument|one-to-one binding between a command line argument and a track, or as complex as an argument argument accepting multiple command line arguments, each with a specific name. The RodBindings are generic and type specific, so you can require users to provide files that emit VariantContexts, BedTables, etc, or simply the root type Feature from Tribble. Critically, the RodBindings interact nicely with the GATKDocs system, so you can provide summary and detailed documentation for each RodBinding accepted by your walker.
Suppose you have a walker that uses a single track of VariantContexts, such as SelectVariants, in its calculation. You declare a standard GATK-style @Input argument in the walker, of type RodBinding<VariantContext>:
@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;
This will require the user to provide a command line option --variant:vcf my.vcf to your walker. To get access to your variants, in the map() function you provide the variants variable to the tracker, as in:
Collection<VariantContext> vcs = tracker.getValues(variants, context.getLocation());
which returns all of the VariantContexts in variants that start at context.getLocation(). See RefMetaDataTracker in the javadocs to see the full range of getter routines.
Note that, as with regular tribble tracks, you have to provide the Tribble type of the file as a tag to the argument (:vcf). The system now checks up front that the corresponding Tribble codec produces Features that are type-compatible with the type of the RodBinding<T>.
The RodBinding class is generic, parameterized as RodBinding<T extends Feature>. This T class describes the type of the Feature required by the walker. The best practice for declaring a RodBinding is to choose the most general Feature type that will allow your walker to work. For example, if all you really care about is whether a Feature overlaps the site in map, you can use Feature itself, which supports this, and will allow any Tribble type to be provided, using a RodBinding<Feature>. If you are manipulating VariantContexts, you should declare a RodBinding<VariantContext>, which will restrict automatically the user to providing Tribble types that can create a object consistent with the VariantContext class (a VariantContext itself or subclass).
Note that in multi-argument RodBindings, as List<RodBinding<T>> arg, the system will require all files provided here to provide an object of type T. So List<RodBinding<VariantContext>> arg requires all -arg command line arguments to bind to files that produce VariantContexts.
The RodBinding system supports the standard @Argument style of allowing a vararg argument by wrapping it in a Java collection. For example, if you want to allow users to provide any number of comp tracks to your walker, simply declare a List<RodBinding<VariantContext>> field:
@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;
With this declaration, your walker will accept any number of -comp arguments, as in:
-comp:vcf 1.vcf -comp:vcf 2.vcf -comp:vcf 3.vcf
For such a command line, the comps field would be initialized to the List with three RodBindings, the first binding to 1.vcf, the second to 2.vcf and finally the third to 3.vcf.
Because this is a required argument, at least one -comp must be provided. Vararg @Input RodBindings can be optional, but you should follow proper varargs style to get the best results.
If you want to make a RodBinding optional, you first need to tell the @Input argument that its options (required=false):
@Input(fullName="discordance", required=false)
private RodBinding<VariantContext> discordanceTrack;
The GATK automagically sets this field to the value of the special static constructor method makeUnbound(Class c) to create a special "unbound" RodBinding here. This unbound object is type safe, can be safely passed to the RefMetaDataTracker get methods, and is guaranteed to never return any values. It also returns false when the isBound() method is called.
An example usage of isBound is to conditionally add header lines, as in:
if ( mask.isBound() ) {
hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));
}
The case for vararg style RodBindings is slightly different. If you want, as above, users to be able to omit the -comp track entirely, you should initialize the value of the collection to the appropriate emptyList/emptySet in Collections:
@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=false)
public List<RodBinding<VariantContext>> comps = Collections.emptyList();
which will ensure that comps.isEmpty() is true when no -comp is provided.
@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;
By default, the getName() method in RodBinding returns the fullName of the @Input. This can be overloaded on the command-line by providing not one but two tags. The first tag is interpreted as the name for the binding, and the second as the type. As in:
-variant:vcf foo.vcf => getName() == "variant"
-variant:foo,vcf foo.vcf => getName() == "foo"
This capability is useful when users need to provide more meaningful names for arguments, especially with variable arguments. For example, in VariantEval, there's a List<RodBinding<VariantContext>> comps, which may be dbsnp, hapmap, etc. This would be declared as:
@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;
where a normal command line usage would look like:
-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf
In the code, you might have a loop that looks like:
for ( final RodBinding comp : comps )
for ( final VariantContext vc : tracker.getValues(comp, context.getLocation())
out.printf("%s has a binding at %s%n", comp.getName(), getToolkit().getGenomeLocParser.createGenomeLoc(vc));
which would print out lines that included things like:
hapmap has a binding at 1:10
omni has a binding at 1:20
hapmap has a binding at 1:30
1000g has a binding at 1:30
This last example begs the question -- what happens with getName() when explicit names are not provided? The system goes out of its way to provide reasonable names for the variables:
The first occurrence is named for the fullName, where comp
Subsequent occurrences are postfixed with an integer count, starting at 2, so comp2, comp3, etc.
In the above example, the command line
-comp:vcf hapmap.vcf -comp:vcf omni.vcf -comp:vcf 1000g.vcf
would emit
comp has a binding at 1:10
comp2 has a binding at 1:20
comp has a binding at 1:30
comp3 has a binding at 1:30
The new RodBinding system supports a simple form of dynamic type resolution. If the input filetype can be specially associated with a single Tribble type (as VCF can), then you can omit the type entirely from the the command-line binding of a RodBinding!
So whereas a full command line would look like:
-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf
because these are VCF files they could technically be provided as:
-comp:hapmap hapmap.vcf -comp:omni omni.vcf -comp:1000g 1000g.vcf
If you don't care about naming, you can now say:
-comp hapmap.vcf -comp omni.vcf -comp 1000g.vcf
The best practice is simple: use a javadoc style comment above the @Input annotation, with the standard first line summary and subsequent detailed discussion of the meaning of the argument. These are then picked up by the GATKdocs system and added to the standard walker docs, following the standard structure of GATKDocs @Argument docs. Below is a best practice documentation example from SelectVariants, which accepts a required variant track and two optional discordance and concordance tracks.
public class SelectVariants extends RodWalker<Integer, Integer> {
/**
* Variants from this file are sent through the filtering and modifying routines as directed
* by the arguments to SelectVariants, and finally are emitted.
*/
@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;
/**
* A site is considered discordant if there exists some sample in eval that has a non-reference genotype
* and either the site isn't present in this track, the sample isn't present in this track,
* or the sample is called reference in this track.
*/
@Input(fullName="discordance", shortName = "disc", doc="Output variants that were not called in this Feature comparison track", required=false)
private RodBinding<VariantContext> discordanceTrack;
/**
* A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
* in both variants and concordance tracks or (2) every sample present in eval is present in the concordance
* track and they have the sample genotype call.
*/
@Input(fullName="concordance", shortName = "conc", doc="Output variants that were also called in this Feature comparison track", required=false)
private RodBinding<VariantContext> concordanceTrack;
}
Note how much better the above version is compared to the old pre-Rodbinding syntax (code below). Below you have a required argument variant that doesn't show up as a formal argument in the GATK, different from the conceptually similar @Arguments for discordanceRodName and concordanceRodName, which have no type restrictions. There's no place to document the variant argument as well, so the system is effectively blind to this essential argument.
@Requires(value={},referenceMetaData=@RMD(name="variant", type=VariantContext.class))
public class SelectVariants extends RodWalker<Integer, Integer> {
@Argument(fullName="discordance", shortName = "disc", doc="Output variants that were not called on a ROD comparison track. Use -disc ROD_NAME", required=false)
private String discordanceRodName = "";
@Argument(fullName="concordance", shortName = "conc", doc="Output variants that were also called on a ROD comparison track. Use -conc ROD_NAME", required=false)
private String concordanceRodName = "";
}
In these examples, we have declared two RodBindings in the Walker
@Input(fullName="mask", doc="Input ROD mask", required=false)
public RodBinding<Feature> mask = RodBinding.makeUnbound(Feature.class);
@Input(fullName="comp", doc="Comparison track", required=false)
public List<RodBinding<VariantContext>> comps = new ArrayList<VariantContext>();
Get the first value
Feature f = tracker.getFirstValue(mask)
Get all of the values at a location
Collection<Feature> fs = tracker.getValues(mask, thisGenomeLoc)
Get all of the features here, regardless of track
Collection<Feature> fs = tracker.getValues(Feature.class)
Determining if an optional RodBinding was provided . if ( mask.isBound() ) // writes out the mask header line, if one was provided hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));
if ( ! comps.isEmpty() ) logger.info("At least one comp was provided")
In QScripts when you need to tag a file use the class TaggedFile which extends from java.io.File.
| Example | in the QScript | on the Command Line |
|---|---|---|
| Untagged VCF | myWalker.variant = new File("my.vcf") |
-V my.vcf |
| Tagged VCF | myWalker.variant = new TaggedFile("my.vcf", "VCF") |
-V:VCF my.vcf |
| Tagged VCF | myWalker.variant = new TaggedFile("my.vcf", "VCF,custom=value") |
-V:VCF,custom=value my.vcf |
| Labeling a tumor | myWalker.input_file :+= new TaggedFile("mytumor.bam", "tumor") |
-I:tumor mytumor.bam |
No longer need to (or can) use @Requires and @Allows for ROD data. This system is now retired.
The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented.
Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.
The @By attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:
@By(DataSource.REFERENCE)
@By(DataSource.READS)
By default, the following filters are automatically added to every locus walker.
These walkers walk over the data set one location at a time, but only those locations covered by reference-ordered data. They are essentially a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome at sites where there is ROD information. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to RefWalkers<M,T> traversals with the following few quirks.
RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.
Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided:
nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location
In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null ReferenceContext and RefMetaDataTracker objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval.
ROD walkers inherit the same filters as locus walkers:
Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a RodWalker from its old RefWalker state:
//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
public class VariantEvalWalker extends RodWalker<Integer, Integer> {
The map function must now capture the number of skipped bases and protect itself from the final interval map calls:
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
nMappedSites += context.getSkippedBases();
if ( ref == null ) { // we are seeing the last site
return 0;
}
nMappedSites++;
That's all there is to it!
A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs. Here is a comparison of ROD vs. Ref walker implementation of VariantEval:
| RODWalker | RefWalker | |
|---|---|---|
| dbSNP and 1KG Pilot 2 SNP calls on chr1 | 164u (s) | 768u (s) |
| Just 1KG Pilot 2 SNP calls on chr1 | 54u (s) | 666u (s) |
Read walkers walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.
By default, the following filters are automatically added to every read walker.
Read pair walkers walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.
By default, the following filters are automatically added to every read pair walker.
Duplicate walkers walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.
By default, the following filters are automatically added to every duplicate walker.
When running either single-threaded or in shared-memory parallelism mode, the GATK guarantees that output written to an output stream created via the @Argument mechanism will ultimately be assembled in genomic order. In order to assemble the final output file, the GATK will write the output generated from each thread into a temporary output file, ultimately assembling the data via a central coordinating thread. There are three major elements in the GATK that facilitate this functionality:
Stub
The front-end interface to the output management system. Stubs will be injected into the walker by the command-line argument system and relay information from the walker to the output management system. There will be one stub per invocation of the GATK.
Storage
The back end interface, responsible for creating, writing and deleting temporary output files as well as merging their contents back into the primary output file. One Storage object will exist per shard processed in the GATK.
OutputTracker
The dispatcher; ultimately connects the stub object's output creation request back to the most appropriate storage object to satisfy that request. One OutputTracker will exist per GATK invocation.
Stubs are directly injected into the walker through the GATK's command-line argument parser as a go-between from walker to output management system. When a walker calls into the stub it's first responsibility is to call into the output tracker to retrieve an appropriate storage object. The behavior of the OutputTracker from this point forward depends mainly on the parallelization mode of this traversal of the GATK.
the OutputTracker (implemented as DirectOutputTracker) will create the storage object if necessary and return it to the stub.
The stub will forward the request to the provided storage object.
At the end of the traversal, the microscheduler will request that the OutputTracker finalize and close the file.
The OutputTracker (implemented as ThreadLocalOutputTracker) will look for a storage object associated with this thread via a ThreadLocal.
If no such storage object exists, it will be created pointing to a temporary file.
At the end of each shard processed, that file will be closed and an OutputMergeTask will be created so that the shared-memory parallelism code can merge the output at its leisure.
The shared-memory parallelism code will merge when a fixed number of temporary files appear in the input queue. The constant used to determine this frequency is fixed at compile time (see HierarchicalMicroScheduler.MAX_OUTSTANDING_OUTPUT_MERGES).
To use the output management system, declare a field in your walker of one of the existing core output types, coupled with either an @Argument or @Output annotation.
@Output(doc="Write output to this BAM filename instead of STDOUT")
SAMFileWriter out;
Currently supported output types are SAM/BAM (declare SAMFileWriter), VCF (declare VCFWriter), and any non-buffering stream extending from OutputStream.
To create a new output type, three types must be implemented: Stub, Storage, and ArgumentTypeDescriptor.
Create a new Stub class, extending/inheriting the core output type's interface and implementing the Stub interface.
OutputStreamStub extends OutputStream implements Stub<OutputStream> {
Implement a register function so that the engine can provide the stub with the session's OutputTracker.
public void register( OutputTracker outputTracker ) {
this.outputTracker = outputTracker;
}
Add as fields any parameters necessary for the storage object to create temporary storage.
private final File targetFile;
public File getOutputFile() { return targetFile; }
Implement/override every method in the core output type's interface to pass along calls to the appropriate storage object via the OutputTracker.
public void write( byte[] b, int off, int len ) throws IOException {
outputTracker.getStorage(this).write(b, off, len);
}
Create a Storage class, again extending inheriting the core output type's interface and implementing the Storage interface.
public class OutputStreamStorage extends OutputStream implements Storage<OutputStream> {
Implement constructors that will accept just the Stub or Stub + alternate file path and create a repository for data, and a close function that will close that repository.
public OutputStreamStorage( OutputStreamStub stub ) { ... }
public OutputStreamStorage( OutputStreamStub stub, File file ) { ... }
public void close() { ... }
Implement a mergeInto function capable of reconstituting the file created by the constructor, dumping it back into the core output type's interface, and removing the source file.
public void mergeInto( OutputStream targetStream ) { ... }
Add a block to StorageFactory.createStorage() capable of creating the new storage object. TODO: use reflection to generate the storage classes.
if(stub instanceof OutputStreamStub) {
if( file != null )
storage = new OutputStreamStorage((OutputStreamStub)stub,file);
else
storage = new OutputStreamStorage((OutputStreamStub)stub);
}
Create a new object inheriting from type ArgumentTypeDescriptor. Note that the ArgumentTypeDescriptor does NOT need to support the core output type's interface.
public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {
Implement a truth function indicating which types this ArgumentTypeDescriptor can service.
@Override
public boolean supports( Class type ) {
return SAMFileWriter.class.equals(type) || StingSAMFileWriter.class.equals(type);
}
Implement a parse function that constructs the new Stub object. The function should register this type as an output by caling engine.addOutput(stub).
public Object parse( ParsingEngine parsingEngine, ArgumentSource source, Type type, ArgumentMatches matches ) {
...
OutputStreamStub stub = new OutputStreamStub(new File(fileName));
...
engine.addOutput(stub);
....
return stub;
}
Add a creator for this new ArgumentTypeDescriptor in CommandLineExecutable.getArgumentTypeDescriptors().
protected Collection<ArgumentTypeDescriptor> getArgumentTypeDescriptors() {
return Arrays.asList( new VCFWriterArgumentTypeDescriptor(engine,System.out,argumentSources),
new SAMFileWriterArgumentTypeDescriptor(engine,System.out),
new OutputStreamArgumentTypeDescriptor(engine,System.out) );
}
After creating these three objects, the new output type should be ready for usage as described above.
Only non-buffering iterators are currently supported by the GATK. Of particular note, PrintWriter will appear to drop records if created by the command-line argument system; use PrintStream instead.
For efficiency, the GATK does not reduce output files together following the tree pattern used by shared-memory parallelism; output merges happen via an independent queue. Because of this, output merges happening during a treeReduce may not behave correctly.
GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:
Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.
With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.
You have two options: donwload the binary distribution (prepackaged, ready to run program) or build it from source.
This is obviously the easiest way to go. Links are on the Downloads page.
Briefly, here's what you need to know/do:
Queue is part of the Sting repository. Download the source from our repository on Github. Run the following command:
git clone git://github.com/broadgsa/gatk.git Sting
Use ant to build the source.
cd Sting
ant queue
Queue uses the Ivy dependency manager to fetch all other dependencies. Just make sure you have suitable versions of the JDK and Ant!
See this article on how to test your installation of Queue.
See this article on running Queue for the first time for full details.
Queue arguments can be listed by running with --help
java -jar dist/Queue.jar --help
To list the arguments required by a QScript, add the script with -S and run with --help.
java -jar dist/Queue.jar -S script.scala --help
Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.
See QFunction and Command Line Options for more info on adjusting Queue options.
Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.
Every QScript includes the following steps:
add() to Queue for dispatch and monitoringThe basic command-line to run the Queue pipelines on the command line is
java -jar Queue.jar -S <script>.scala
See the main article Queue QScripts for more info on QScripts.
While most QScripts are analysis pipelines that are custom-built for specific projects, some have been released as supported tools. See
The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples
Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.
Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:
bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile .libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")
The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves
This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.
Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.
To clarify your pipeline, override the dotString() function:
class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
@Input(doc="foo") var bam = bamIn
@Input(doc="foo") var bamIndex = bai(bamIn)
@Output(doc="foo") var recalData = recalDataIn
memoryLimit = Some(4)
override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}
Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:
The GATK team would love to hear about any applications within which of the GATK-Lite codebase is embedded or walkers which you have chosen to distribute. Please send an email to gsahelp to let us know!
When redistributing the GATK-Lite codebase, please abide by the terms of our copyright:
/*
* Copyright (c) 2009 The Broad Institute
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*/
The packaging tool in the Sting repository can layout packages for redistribution. Currently, only walkers checked into the GATK's git repository are well supported by the packaging system. Example packaging files can be found in $STING_HOME/packages.
Create a package xml for your project inside $STING_HOME/packages.
Key elements within the package xml include:
executable
Each occurrence of this tag will create an executable jar of the given name tag, using the main method from the given main-class tag.
main-class
This is the main class for the package. When running with java -jar YOUR_JAR.jar, main-class is the class that will be executed.
dependencies
Other dependencies can be of type class or file. If of type class, a dependency analyzer will look for all dependencies of your classes and include those files as well. File dependencies will end up in the root of your package.
resources
Supplemental files can be added to the resources section. Resource files will be copied to the resources directory within the package.
To create a package, execute the following command:
cd $STING_HOME
ant package -Dexecutable=<your executable name>
The packaging system will create a layout directory in dist/packages/<your executable>. Examine the contents of this directory. When you are happy with the results, finalize the package by running the following:
tar cvhjf <your executable>.tar.bz2 <your executable>
The MapReduce architecture of the GATK allows most walkers in the GATK to be run in a parallel-processing mode. The GATK supports two basic parallel processing models known as shared memory and scatter-gather.
Shared memory parallelism
Parallelism within a single multi-threading process with access to a large, shared RAM. Shared memory parallelism is stable and supported by many tools that access pileups of bases.
Scatter/gather (SG) parallelism
In SG parallelism, the target genomic regions are divided up into N independent GATK instances that are run separately on a single machine or across a computing cluster. The outputs of each independent walker, are merged together once all are completed. SG works very efficiently in the GATK, provided the output of a walker is independent per site (e.g. Unified Genotyper) or per chromosome (e.g. Indel Realigner). SG parallelism is a completely stable approach in the GATK, and used routinely by the GATK team in processing large data sets; it is also natively supported by GATK-Queue, which automatically scatters and gathers GATK processes given a desired N number of processes to execute simultaneously.
Note that parallel-processing will significantly speed up data processing but may produce statistically insignificant differences. While this non-determinism is not ideal in practice the minute differences have been mathematically meaningless while producing consistent results in a reasonable amount of time for whole genome and whole exome data. However, if absolute determinism is more important than speed we recommend you do not use parallelism with the GATK.
There are costs and benefits to each type of parallelism in the GATK, as outlined in the following table.
Comparison of standard parallelism approaches in the GATK
| Property | Shared Memory | Scatter/Gather |
|---|---|---|
| Stability | Stable | Stable | Retired in codebase |
| Applicable walker types | By locus and by ROD only. ReadWalkers are not supported. | All walker types. ReadWalkers can only be split safely by chromosome in general |
| Example walkers | UnifiedGenotyper, BaseRecalibrator, VariantEval | All walkers, including ReadWalkers like IndelRealigner |
| Scalability | Fewer than 32 cores. Each thread operates completely independently, so N threads requires N times more memory than 1 thread alone. Best scaling at 8 or fewer threads. | Hundreds of processes. Limited by capabilities of the underlying storage system, in general. Isilon-class storage can run thousands of jobs effectively. |
| How to enable | Use the -nt argument in the Engine, on any walker that supports shared memory parallelism (the engine will tell you if it does not). |
1. Provide -L interval lists to the GATK; a different one for each of the N independent GATK tools. For example -L chr1 for first process, and -L chr2 for the second. When all processes have finished, merge the output together, as appropriate (e.g. use MergeSam.jar for BAMs, and cat or CombineVariants for VCFs). 2. Use GATK-Queue to automatically divide up your GATK jobs. For example, using this.scatterCount = 10 argument will result in 10 independent processes. |
| Pros | - Easy to enable. - Single output file merged together by internally by the GATK engine - Efficiently uses multi-core processors sharing a single memory space | - Works for all walker types, including ReadWalkers - Scales to hundreds of independent jobs - Easy to enable with single -L argument - Directly supported and managed by GATK-Queue - Totally independent processing per interval -- failed parts can be easily resumed without repeating already successfully processed regions |
| Cons | - Limited to fewer than 32 processors without significant overhead - Limited to cores physically present on the machine, cannot take advantage of computing cluster resources - Does not work for ReadWalkers (Table Recalibrator, Indel Realigner) | - Requires manual preparation of sliced genomic intervals for processing (if you aren't using GATK-Queue). - For ReadWalkers and other tools that can only be processed by chromosome, leading to load balancing problems (chr1 is much bigger than chr22) - Sensitive to data density variation over the genome. Dividing chr20 processing in 63 1MB chunks leads to 10x differences in completion times due to data pileups around the centromere, for example. - Must wait until all parts of the scatter have completed before gathering, making the process sensitive to farm scheduling problems. If a job will run for M minutes, but waits Z minutes to start on the farm, the entire SG process will not complete for at least M + Z minutes. |
Almost certainly, either shared memory or scatter/gather parallelism is the right choice for your NGS pipeline. Our go-to option for parallelism here at the Broad Institute is S/G, since S/G allows you to cut up your jobs into hundreds of pieces to run on our standard computing farm, using GATK-Queue. When I have a small job that needs to be run quickly, am testing out program options or need a quick VariantEval result, I'm using shared memory parallelism with ~10 threads on a single large computer with a lot (>64 GB) of memory.
Basically, if I have N processors, and I want to choose between shared memory or S/G, here's how I would choose:
If all N processors are on a single computer, and my walker supports it, use shared memory parallelism
If not, use S/G
The GATK currently supports a hierarchical version of parallelism. In this form of parallelism, data is divided into shards, each shard is map/reduced independently, and the results are combined with a 'tree reduce' step. While the framework handles much of the heavy lifting of data division required for parallelism, each walker must individually be reviewed to ensure that it isn't tracking state internally in a non-threadsafe way. Many tools support shared memory parallelism, including critical tools such as:
Please review the source to discover if your walker is parallelizable, or attempt to run it with -nt 2 and if it the engine doesn't complain the walker is parallelized.
In shared memory parallelism, each thread operates on a 16 kbp shard of reference sequence in a completely independent memory space from all other threads. Up to 24 threads can run efficiently in this design, but greater parallelism is limited by resource starvation from the single reduce thread and/or memory inefficiency by keeping each thread’s data totally independent. See gatkParallelism performance 082112 these plots for an analysis of the scalability of several key GATK tools as a function of nt.
Run the GATK, using the -nt command-line argument to specify the number of threads that the GATK should attempt to use.
[[image:HierarchicalParallelism.png|thumb|Shared memory parallelism architecture]]
First, be aware that some walkers may, by design, require a rewrite for complete parallelization.
When implementing a standard (non-parallelized) walker, one must implement the reduce method, which combines an individual datum returned by the map function with the aggregate of the prior map calls. When implementing a parallelizable walker, one must also implement the org.broadinstitute.sting.gatk.walkers.TreeReducible interface and the treeReduce() function. The TreeReduce function tells the GATK how to combine two adjacent reduces, rather than one map result and one reduce.
The GATK supports writing to output files from either the map or the reduce when running in parallel. However, only unbuffered writers are currently supported. Please use PrintStream rather than PrintWriter at this time.
The GATK's support for parallelism is currently limited. The following classes of walkers are not supported by our parallelization framework:
Note that each thread operates completely independently in the current GATK implementation of shared memory parallelism. So if you provide 1Gb to the GATK with nt 1, then you should provide 4Gb to run with nt 4. If you don't do this, it is possible to starve out the GATK so that it runs much much slower.
The performance of the multi-threaded GATK is really dependent on whether you are IO or CPU limited and the relative overhead of the parallelism on your computer. Additionally, nt can start to have very high overheads with nt > 20 due to our implementation and memory contention issues.
The best option for nt is a value less or equal to the number of available cores with sufficient memory to run each threads (nt x amount provided to 1 core), capped additionally by the available IO bandwidth so that the individual threads don't starve each other.
Scatter / gather is a simple approach for process-independent parallelism with the GATK. First you scatter multiple independent GATK instances out over a network of computing nodes, each operating on slightly different genomic intervals, and when they all complete, the output of each is gathered up into a merged output dataset. In the GATK S/G is extremely easy to use, as all of the GATK tools support the -L option to operate over only genomic specific intervals, and many tools emit files that can be merged together across multiple regions. Unified Genotyper, for example, can operate over the whole genome in one process, or on each chromosome independently. The output of this later approach, after merging together, should be the same as the whole genome results, minus any slight differences due to random number effects. The simplicity and efficiency of S/G parallelism makes this a key option for getting things done quickly with the GATK.
Using S/G parallelism is extremely easy, either by hand or using the automated Scatter/Gather in GATK-Queue. Suppose I have the following command line:
java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1
This runs a single process of the GATK over chr1, and let's say it takes an hour when I run it. In order to run it with S/G parallelism mode, just partition chr1 into two independent parts:
This file distributed.tracker.txt will contain genomic locations and GATK process ids that are processing each location, in text format, so you can cat it. If you run at the command line:
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:1-125,000,000 -o my.1.vcf &
gsa1> java -jar GenomeAnalysisTK -R human.fasta -T UnifiedGenotyper -I my.bam -L chr1:125,000,001-249,250,621 -o my.2.vcf &
When these two jobs finish, I just merge the two VCFs together and I've got a complete data set in half the time.
As mentioned in the introductory materials, the core concept behind the GATK tools is the walker. The Queue scripting framework contains several mechanisms which make it easy to chain together GATK walkers.
As part of authoring your walker there are several Queue behaviors that you can specify for QScript authors using your particular walker.
Queue can significantly speed up generating walker outputs by passing different instances of the GATK the same BAM or VCF data but specifying different regions of the data to analyze. After the different instances output their individual results Queue will gather the results back to the original output path requested by QScript.
Queue limits the level it will split genomic data by examining the @PartitionBy() annotation for your walker which specifies a PartitionType. This table lists the different partition types along with the default partition level for each of the different walker types.
| PartitionType | Default for Walker Type | Description | Example Intervals | Example Splits |
|---|---|---|---|---|
| PartitionType.CONTIG | Read walkers | Data is grouped together so that all genomic data from the same contig is never presented to two different instances of the GATK. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | split 1: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60; split 2:chr3:10-11 |
| PartitionType.INTERVAL | (none) | Data is split down to the interval level but never divides up an explicitly specified interval. If no explicit intervals are specified in the QScript for the GATK then this is effectively the same as splitting by contig. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | split 1: chr1:10-11, chr2:10-20, chr2:30-40; split 2: chr2:50-60, chr3:10-11 |
| PartitionType.LOCUS | Locus walkers, ROD walkers | Data is split down to the locus level possibly dividing up intervals. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | split 1: chr1:10-11, chr2:10-20, chr2:30-35; split 2: chr2:36-40, chr2:50-60, chr3:10-11 |
| PartitionType.NONE | Read pair walkers, Duplicate walkers | The data cannot be split and Queue must run the single instance of the GATK as specified in the QScript. | original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 | no split: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 |
If you walker is implemented in a way that Queue should not divide up your data you should explicitly set the @PartitionBy(PartitionType.NONE). If your walker can theoretically be run per genome location specify @PartitionBy(PartitionType.LOCUS).
@PartitionBy(PartitionType.LOCUS)
public class ExampleWalker extends LocusWalker<Integer, Integer> {
...
Queue will join the standard walker outputs.
| Output type | Default gatherer implementation |
|---|---|
| SAMFileWriter | The BAM files are joined together using Picard's MergeSamFiles. |
| VCFWriter | The VCF files are joined together using the GATK CombineVariants. |
| PrintStream | The first two files are scanned for a common header. The header is written once into the output, and then each file is appended to the output, skipping past with the header lines. |
If your PrintStream is not a simple text file that can be concatenated together, you must implement a Gatherer. Extend your custom Gatherer from the abstract base class and implement the gather() method.
package org.broadinstitute.sting.commandline;
import java.io.File;
import java.util.List;
/**
* Combines a list of files into a single output.
*/
public abstract class Gatherer {
/**
* Gathers a list of files into a single output.
* @param inputs Files to combine.
* @param output Path to output file.
*/
public abstract void gather(List<File> inputs, File output);
/**
* Returns true if the caller should wait for the input files to propagate over NFS before running gather().
*/
public boolean waitForInputs() { return true; }
}
Specify your gatherer using the @Gather() annotation by your @Output.
@Output
@Gather(MyGatherer.class)
public PrintStream out;
Queue will run your custom gatherer to join the intermediate outputs together.
Running 'ant queue' builds a set of Queue extensions for the GATK-Engine. Every GATK walker and command line program in the compiled GenomeAnalysisTK.jar a Queue compatible wrapper is generated.
The extensions can be imported via import org.broadinstitute.sting.queue.extensions.gatk._
import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._
class MyQscript extends QScript {
...
Note that the generated GATK extensions will automatically handle shell-escaping of all values assigned to the various Walker parameters, so you can rest assured that all of your values will be taken literally by the shell. Do not attempt to escape values yourself -- ie.,
Do this:
filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")
NOT this:
filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")
In addition to the GATK documentation on this wiki you can also find the full list of arguments for each walker extension in a variety of ways.
The source code for the extensions is generated during ant queue and placed in this directory:
build/queue-extensions/src
When properly configured an IDE can provide command completion of the walker extensions. See Queue with IntelliJ IDEA for our recommended settings.
If you do not have access to an IDE you can still find the names of the generated variables using the command line. The generated variable names on each extension are based off of the fullName of the Walker argument. To see the built in documentation for each Walker, run the GATK with:
java -jar GenomeAnalysisTK.jar -T <walker name> -help
Once the import statement is specified you can add() instances of gatk extensions in your QScript's script() method.
If the GATK walker input allows more than one of a value you should specify the values as a List().
def script() {
val snps = new UnifiedGenotyper
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
add(snps)
}
Although it may be harder for others trying to read your QScript, for each of the long name arguments the extensions contain aliases to their short names as well.
def script() {
val snps = new UnifiedGenotyper
snps.R = new File("testdata/exampleFASTA.fasta")
snps.I = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
add(snps)
}
Here are a few more examples using various list assignment operators.
def script() {
val countCovariates = new CountCovariates
// Append to list using item appender :+
countCovariates.rodBind :+= RodBind("dbsnp", "VCF", dbSNP)
// Append to list using collection appender ++
countCovariates.covariate ++= List("ReadGroupCovariate", "QualityScoreCovariate", "CycleCovariate", "DinucCovariate")
// Assign list using plain old object assignment
countCovariates.input_file = List(inBam)
// The following is not a list, so just assigning one file to another
countCovariates.recal_file = outRecalFile
add(countCovariates)
}
By default Queue runs the GATK from the current classpath. This works best since the extensions are generated and compiled at time same time the GATK is compiled via ant queue.
If you need to swap in a different version of the GATK you may not be able to use the generated extensions. The alternate GATK jar must have the same command line arguments as the GATK compiled with Queue. Otherwise the arguments will not match and you will get an error when Queue attempts to run the alternate GATK jar. In this case you will have to create your own custom CommandLineFunction for your analysis.
def script {
val snps = new UnifiedGenotyper
snps.jarFile = new File("myPatchedGATK.jar")
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
add(snps)
}
Queue currently allows QScript authors to explicitly invoke scatter/gather on GATK walkers by setting the scatter count on a function.
def script {
val snps = new UnifiedGenotyper
snps.reference_file = new File("testdata/exampleFASTA.fasta")
snps.input_file = List(new File("testdata/exampleBAM.bam"))
snps.out = new File("snps.vcf")
snps.scatterCount = 20
add(snps)
}
This will run the UnifiedGenotyper up to 20 ways parallel and then will merge the partial VCFs back into the single snps.vcf.
Some walkers are still being updated to support Queue fully. For example they may not have defined the @Input and @Output and thus Queue is unable to correctly track their dependencies, or a custom Gatherer may not be implemented yet.
These are the most popular Queue command line options. For a complete and up to date list run with -help. QScripts may also add additional command line options.
| Command Line Argument | Description | Default |
|---|---|---|
-run |
If passed the scripts are run. If not passed a dry run is executed. | dry run |
-jobRunner <jobrunner> |
The job runner to dispatch jobs. Setting to Lsf706, GridEngine, or Drmaa will dispatch jobs to LSF or Grid Engine using the job settings (see below). Defaults to Shell which runs jobs on a local shell one at a time. |
Shell |
-bsub |
Alias for -jobRunner Lsf706 |
not set |
-qsub |
Alias for -jobRunner GridEngine |
not set |
-status |
Prints out a summary progress. If a QScript is currently running via -run, you can run the same command line with -status instead to print a summary of progress. |
not set |
-retry <count> |
Retries a QFunction that returns a non-zero exit code up to count times. The QFunction must not have set jobRestartable to false. |
0 = no retries |
-startFromScratch |
Restarts the graph from the beginning. If not specified for each output file specified on a QFunction, ex: /path/to/output.file, Queue will not re-run the job if a .done file is found for the all the outputs, ex: /path/to/.output.file.done. |
use .done files to determine if jobs are complete |
-keepIntermediates |
By default Queue deletes the output files of QFunctions that set .isIntermediate to true. |
delete intermediate files |
-statusTo <email> |
Email address to send status to whenever a) A job fails, or b) Queue has run all the functions it can run and is exiting. | not set |
-statusFrom <email> |
Email address to send status emails from. | user@local.domain |
-dot <file> |
If set renders the job graph to a dot file. | not rendered |
-l <logging_level> |
The minimum level of logging, DEBUG, INFO, WARN, or FATAL. |
INFO |
-log <file> |
Sets the location to save log output in addition to standard out. | not set |
-debug |
Set the logging to include a lot of debugging information (SLOW!) | not set |
-jobReport |
Path to write the job report text file. If R is installed and available on the $PATH then a pdf will be generated visualizing the job report. |
jobPrefix.jobreport.txt |
-disableJobReport |
Disables writing the job report. | not set |
-help |
Lists all of the command line arguments with their descriptions. | not set |
The following options can be specified on the command line over overridden per QFunction.
| Command Line Argument | QFunction Property | Description | Default |
|---|---|---|---|
-jobPrefix |
.jobName |
The unique name of the job. Used to prefix directories and log files. Use -jobNamePrefix on the Queue command line to replace the default prefix Q-<processid>@<host>. |
<jobNamePrefix>-<jobNumber> |
| N/A | .jobOutputFile |
Captures stdout and if jobErrorFile is null it captures stderr as well. |
<jobName>.out |
| N/A | .jobErrorFile |
If not null captures stderr. | null |
| N/A | .commandDirectory |
The directory to execute the command line from. | current directory |
-jobProject |
.jobProject |
The project name for the job. | default job runner project |
-jobQueue |
.jobQueue |
The queue to dispatch the job. | default job runner queue |
-jobPriority |
.jobPriority |
The dispatch priority for the job. Lowest priority = 0. Highest priority = 100. |
default job runner priority |
-jobNative |
.jobNativeArgs |
Native args to pass to the job runner. Currently only supported in GridEngine and Drmaa. The string is concatenated to the native arguments passed over DRMAA. Example: -w n. |
none |
-jobResReq |
.jobResourceRequests |
Resource requests to pass to the job runner. On GridEngine this is multiple -l <req>. On LSF a single -R <req> is generated. |
memory reservations and limits on LSF and GridEngine |
-jobEnv |
.jobEnvironmentNames |
Predefined environment names to pass to the job runner. On GridEngine this is -pe <env>. On LSF this is -a <env>. |
none |
-memLimit |
.memoryLimit |
The memory limit for the job in gigabytes. Used to populate the variables residentLimit and residentRequest which can also be set separately. | default job runner memory limit |
-resMemLimit |
.residentLimit |
Limit for the resident memory in gigabytes. On GridEngine this is -l mem_free=<mem>. On LSF this is -R rusage[mem=<mem>]. |
memoryLimit * 1.2 |
-resMemReq |
.residentRequest |
Requested amount of resident memory in gigabytes. On GridEngine this is -l h_rss=<mem>. On LSF this is -R rusage[select=<mem>]. |
memoryLimit |
| Command Line Argument | Description | Default |
|---|---|---|
-emailHost <hostname> |
SMTP host name | localhost |
-emailPort <port> |
SMTP port | 25 |
-emailTLS |
If set uses TLS. | not set |
-emailSSL |
If set uses SSL. | not set |
-emailUser <username> |
If set along with emailPass or emailPassFile authenticates the email with this username. | not set |
-emailPassFile <file> |
If emailUser is also set authenticates the email with contents of the file. | not set |
-emailPass <password> |
If emailUser is also set authenticates the email with this password. NOT SECURE: Use emailPassFile instead! | not set |
script method, a QScript will add one or more CommandLineFunctions.@Input and @Output.@Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.Each CommandLineFunction must define the actual command line to run as follows.
class MyCommandLine extends CommandLineFunction {
def commandLine = "myScript.sh hello world"
}
If you're writing a one-off CommandLineFunction that is not destined for use by other QScripts, it's often easiest to construct the command line directly rather than through the API methods provided in the CommandLineFunction class.
For example:
def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)
If you're writing a CommandLineFunction that will become part of Queue and/or
will be used by other QScripts, however, our best practice recommendation is
to construct your command line only using the methods provided in the
CommandLineFunction class: required(), optional(), conditional(), and repeat()
The reason for this is that these methods automatically escape the values you
give them so that they'll be interpreted literally within the shell scripts
Queue generates to run your command, and they also manage whitespace separation of command-line tokens for you. This prevents (for example) a value like MQ > 10 from being interpreted as an output redirection by the shell, and avoids issues with values containing embedded spaces. The methods also give you the ability to turn escaping and/or whitespace separation off as needed. An example:
override def commandLine = super.commandLine +
required("eff") +
conditional(verbose, "-v") +
optional("-c", config) +
required("-i", "vcf") +
required("-o", "vcf") +
required(genomeVersion) +
required(inVcf) +
required(">", escape=false) + // This will be shell-interpreted as an output redirection
required(outVcf)
The CommandLineFunctions built into Queue, including the CommandLineFunctions
automatically generated for GATK Walkers, are all written using this pattern.
This means that when you configure a GATK Walker or one of the other built-in
CommandLineFunctions in a QScript, you can rely on all of your values being
safely escaped and taken literally when the commands are run, including values
containing characters that would normally be interpreted by the shell such as
MQ > 10.
Below is a brief overview of the API methods available to you in the CommandLineFunction class for safely constructing command lines:
required() Used for command-line arguments that are always present, e.g.:
required("-f", "filename") returns: " '-f' 'filename' "
required("-f", "filename", escape=false) returns: " -f filename "
required("java") returns: " 'java' "
required("INPUT=", "myBam.bam", spaceSeparated=false) returns: " 'INPUT=myBam.bam' "
optional() Used for command-line arguments that may or may not be present, e.g.:
optional("-f", myVar) behaves like required() if myVar has a value, but returns ""
if myVar is null/Nil/None
conditional() Used for command-line arguments that should only be included if some condition is true, e.g.:
conditional(verbose, "-v") returns " '-v' " if verbose is true, otherwise returns ""
repeat() Used for command-line arguments that are repeated multiple times on the command line, e.g.:
repeat("-f", List("file1", "file2", "file3")) returns: " '-f' 'file1' '-f' 'file2' '-f' 'file3' "
CommandLineFunction arguments use a similar syntax to arguments.
CommandLineFunction variables are annotated with @Input, @Output, or @Argument annotations.
So that Queue can track the input and output files of a command, CommandLineFunction @Input and @Output must be java.io.File objects.
class MyCommandLine extends CommandLineFunction {
@Input(doc="input file")
var inputFile: File = _
def commandLine = "myScript.sh -fileParam " + inputFile
}
CommandLineFunction variables can also provide indirect access to java.io.File inputs and outputs via the FileProvider trait.
class MyCommandLine extends CommandLineFunction {
@Input(doc="named input file")
var inputFile: ExampleFileProvider = _
def commandLine = "myScript.sh " + inputFile
}
// An example FileProvider that stores a 'name' with a 'file'.
class ExampleFileProvider(var name: String, var file: File) extends org.broadinstitute.sting.queue.function.FileProvider {
override def toString = " -fileName " + name + " -fileParam " + file
}
Optional files can be specified via required=false, and can use the CommandLineFunction.optional() utility method, as described above:
class MyCommandLine extends CommandLineFunction {
@Input(doc="input file", required=false)
var inputFile: File = _
// -fileParam will only be added if the QScript sets inputFile on this instance of MyCommandLine
def commandLine = required("myScript.sh") + optional("-fileParam", inputFile)
}
A List or Set of files can use the CommandLineFunction.repeat() utility method, as described above:
class MyCommandLine extends CommandLineFunction {
@Input(doc="input file")
var inputFile: List[File] = Nil // NOTE: Do not set List or Set variables to null!
// -fileParam will added as many times as the QScript adds the inputFile on this instance of MyCommandLine
def commandLine = required("myScript.sh") + repeat("-fileParam", inputFile)
}
A command line function can define other required arguments via @Argument.
class MyCommandLine extends CommandLineFunction {
@Argument(doc="message to display")
var veryImportantMessage: String = _
// If the QScript does not specify the required veryImportantMessage, the pipeline will not run.
def commandLine = required("myScript.sh") + required(veryImportantMessage)
}
class SamToolsIndex extends CommandLineFunction {
@Input(doc="bam to index") var bamFile: File = _
@Output(doc="bam index") var baiFile: File = _
def commandLine = "samtools index %s %s".format(bamFile, baiFile)
)
Or, using the CommandLineFunction API methods to construct the command line with automatic shell escaping:
class SamToolsIndex extends CommandLineFunction {
@Input(doc="bam to index") var bamFile: File = _
@Output(doc="bam index") var baiFile: File = _
def commandLine = required("samtools") + required("index") + required(bamFile) + required(baiFile)
)
The following scala methods need to be implemented for a new JobRunner. See the implementations of GridEngine and LSF for concrete full examples.
Start should to copy the settings from the CommandLineFunction into your job scheduler and invoke the command via sh <jobScript>. As an example of what needs to be implemented, here is the current contents of the start() method in MyCustomJobRunner which contains the pseudo code.
def start() {
// TODO: Copy settings from function to your job scheduler syntax.
val mySchedulerJob = new ...
// Set the display name to 4000 characters of the description (or whatever your max is)
mySchedulerJob.displayName = function.description.take(4000)
// Set the output file for stdout
mySchedulerJob.outputFile = function.jobOutputFile.getPath
// Set the current working directory
mySchedulerJob.workingDirectory = function.commandDirectory.getPath
// If the error file is set specify the separate output for stderr
if (function.jobErrorFile != null) {
mySchedulerJob.errFile = function.jobErrorFile.getPath
}
// If a project name is set specify the project name
if (function.jobProject != null) {
mySchedulerJob.projectName = function.jobProject
}
// If the job queue is set specify the job queue
if (function.jobQueue != null) {
mySchedulerJob.queue = function.jobQueue
}
// If the resident set size is requested pass on the memory request
if (residentRequestMB.isDefined) {
mySchedulerJob.jobMemoryRequest = "%dM".format(residentRequestMB.get.ceil.toInt)
}
// If the resident set size limit is defined specify the memory limit
if (residentLimitMB.isDefined) {
mySchedulerJob.jobMemoryLimit = "%dM".format(residentLimitMB.get.ceil.toInt)
}
// If the priority is set (user specified Int) specify the priority
if (function.jobPriority.isDefined) {
mySchedulerJob.jobPriority = function.jobPriority.get
}
// Instead of running the function.commandLine, run "sh <jobScript>"
mySchedulerJob.command = "sh " + jobScript
// Store the status so it can be returned in the status method.
myStatus = RunnerStatus.RUNNING
// Start the job and store the id so it can be killed in tryStop
myJobId = mySchedulerJob.start()
}
The status method should return one of the enum values from org.broadinstitute.sting.queue.engine.RunnerStatus:
RunnerStatus.RUNNINGRunnerStatus.DONERunnerStatus.FAILEDAdd any initialization code to the companion object static initializer. See the LSF or GridEngine implementations for how this is done.
The jobs that are still in RunnerStatus.RUNNING will be passed into this function. tryStop() should send these jobs the equivalent of a Ctrl-C or SIGTERM(15), or worst case a SIGKILL(9) if SIGTERM is not available.
Once there is a basic implementation, you can try out the Hello World example with -jobRunner MyJobRunner.
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S scala/qscript/examples/HelloWorld.scala -jobRunner MyJobRunner -run
If all goes well Queue should dispatch the job to your job scheduler and wait until the status returns RunningStatus.DONE and hello world should be echo'ed into the output file, possibly with other log messages.
See QFunction and Command Line Options for more info on Queue options.
Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.
QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.
The following is a basic outline of a QScript:
import org.broadinstitute.sting.queue.QScript
// List other imports here
// Define the overall QScript here.
class MyScript extends QScript {
// List script arguments here.
@Input(doc="My QScript inputs")
var scriptInput: File = _
// Create and add the functions in the script here.
def script = {
var myCL = new MyCommandLine
myCL.myInput = scriptInput // Example variable input
myCL.myOutput = new File("/path/to/output") // Example hardcoded output
add(myCL)
}
}
Imports can be any scala or java imports in scala syntax.
import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.
To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.
The QScript must define a method script.
The QScript can define helper methods or variables.
The body of script should create and add Queue CommandlineFunctions.
class MyScript extends org.broadinstitute.sting.queue.QScript {
def script = add(new CommandLineFunction { def commandLine = "echo hello world" })
}
A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.
A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.
QScript command line arguments can be marked as optional by setting required=false.
class MyScript extends org.broadinstitute.sting.queue.QScript { @Input(doc="example message to echo") var message: String = _ def script = add(new CommandLineFunction { def commandLine = "echo " + message }) }
See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.
After functions are defined they should be added to the QScript pipeline using add().
for (vcf <- vcfs) {
val ve = new VariantEval
ve.vcfFile = vcf
ve.evalFile = swapExt(vcf, "vcf", "eval")
add(ve)
}
Queue tracks dependencies between functions via variables annotated with @Input and @Output.
Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.
See the main article Queue CommandLineFunctions for more information.
The latest version of the example files are available in the Sting git repository under public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/.
To print the list of arguments required by an existing QScript run with -help.
CommandLineFunction variables set correctly, run without -run.-run.The following is a "hello world" example that runs a single command line to echo hello world.
import org.broadinstitute.sting.queue.QScript
class HelloWorld extends QScript {
def script = {
add(new CommandLineFunction {
def commandLine = "echo hello world"
})
}
}
The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run
It should produce output similar to:
INFO 16:23:27,825 QScriptManager - Compiling 1 QScript
INFO 16:23:31,289 QScriptManager - Compilation complete
INFO 16:23:34,631 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO 16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run
INFO 16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34
INFO 16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,632 HelpFormatter - ---------------------------------------------------------
INFO 16:23:34,634 QCommandLine - Scripting HelloWorld
INFO 16:23:34,651 QCommandLine - Added 1 functions
INFO 16:23:34,651 QGraph - Generating graph.
INFO 16:23:34,660 QGraph - Running jobs.
INFO 16:23:34,689 ShellJobRunner - Starting: echo hello world
INFO 16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out
INFO 16:23:34,771 ShellJobRunner - Done: echo hello world
INFO 16:23:34,773 QGraph - Deleting intermediate files.
INFO 16:23:34,773 QCommandLine - Done
This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.
The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.
To list the command line parameters, including the required parameters, run with -help.
java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help
The help output should appear similar to this:
INFO 10:26:08,491 QScriptManager - Compiling 1 QScript
INFO 10:26:11,926 QScriptManager - Compilation complete
---------------------------------------------------------
Program Name: org.broadinstitute.sting.queue.QCommandLine
---------------------------------------------------------
---------------------------------------------------------
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
[-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
<dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
<job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
[-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPassFile <emailPasswordFile>] [-emailPass <emailPassword>]
[-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
[-filter <filternames>] [-filterExpression <filterexpressions>]
-S,--script <script> QScript scala file
-run,--run_scripts Run QScripts. Without this flag set only
performs a dry run.
-jobRunner,--job_runner <job_runner> Use the specified job runner to dispatch
command line jobs
-bsub,--bsub Equivalent to -jobRunner Lsf706
-status,--status Get status of jobs for the qscript
-retry,--retry_failed <retry_failed> Retry the specified number of times after a
command fails. Defaults to no retries.
-startFromScratch,--start_from_scratch Runs all command line functions even if the
outputs were previously output successfully.
-keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of
any Function marked as intermediate.
-statusTo,--status_email_to <status_email_to> Email address to send emails to upon
completion or on error.
-statusFrom,--status_email_from <status_email_from> Email address to send emails from upon
completion or on error.
-dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to
a .dot file. Otherwise overwrites the
dot_graph
-jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs.
-jobProject,--job_project <job_project> Default project for compute farm jobs.
-jobQueue,--job_queue <job_queue> Default queue for compute farm jobs.
-jobPriority,--job_priority <job_priority> Default priority for jobs.
-memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory> Root directory to run functions from.
-tempDir,--temp_directory <temp_directory> Temp directory to pass to functions.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather
output for compute farm jobs.
-emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL Email should use SSL. Defaults to false.
-emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none.
-emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none.
-emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not
secure! See emailPassFile.
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file> Set the logging location
-quiet,--quiet_output_mode Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help Generate this help message
Arguments for ExampleUnifiedGenotyper:
-R,--referencefile <referencefile> The reference file for the bam files.
-I,--bamfile <bamfile> Bam file to genotype.
-L,--intervals <intervals> An optional file with a list of intervals to proccess.
-filter,--filternames <filternames> A optional list of filter names.
-filterExpression,--filterexpressions <filterexpressions> An optional list of filter expressions.
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.commandline.MissingArgumentException:
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:192)
at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:172)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:199)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:57)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing.
##### ERROR Argument with name '--referencefile' (-R) is missing.
##### ERROR ------------------------------------------------------------------------------------------
To dry run the pipeline:
java \
-Djava.io.tmpdir=tmp \
-jar dist/Queue.jar \
-S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \
-R human_b36_both.fasta \
-I pilot2_daughters.chr20.10k-11k.bam \
-L chr20.interval_list \
-filter StrandBias -filterExpression "SB>=0.10" \
-filter AlleleBalance -filterExpression "AB>=0.75" \
-filter QualByDepth -filterExpression "QD<5" \
-filter HomopolymerRun -filterExpression "HRun>=4"
The dry run output should appear similar to this:
INFO 10:45:00,354 QScriptManager - Compiling 1 QScript
INFO 10:45:04,855 QScriptManager - Compilation complete
INFO 10:45:05,058 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO 10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4
INFO 10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05
INFO 10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO 10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO 10:45:05,150 QCommandLine - Added 4 functions
INFO 10:45:05,150 QGraph - Generating graph.
INFO 10:45:05,169 QGraph - Generating scatter gather jobs.
INFO 10:45:05,182 QGraph - Removing original jobs.
INFO 10:45:05,183 QGraph - Adding scatter gather jobs.
INFO 10:45:05,231 QGraph - Regenerating graph.
INFO 10:45:05,247 QGraph - -------
INFO 10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals
INFO 10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,254 QGraph - -------
INFO 10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,279 QGraph - -------
INFO 10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,283 QGraph - -------
INFO 10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO 10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,288 QGraph - -------
INFO 10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,289 QGraph - -------
INFO 10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples
INFO 10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out
INFO 10:45:05,292 QGraph - -------
INFO 10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval
INFO 10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out
INFO 10:45:05,296 QGraph - -------
INFO 10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun
INFO 10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out
INFO 10:45:05,302 QGraph - -------
INFO 10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval
INFO 10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out
INFO 10:45:05,304 QGraph - Dry run completed successfully!
INFO 10:45:05,304 QGraph - Re-run with "-run" to execute the functions.
INFO 10:45:05,304 QCommandLine - Done
QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables.
A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits.
A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions.
See the following example:
class MyScript extends org.broadinstitute.sting.queue.QScript {
// Create an alias 'qscript' for 'MyScript.this'
qscript =>
// This is a script argument
@Argument(doc="message to display")
var message: String = _
// This is a script argument
@Argument(doc="number of times to display")
var count: Int = _
trait ReusableArguments extends MyCommandLineFunction {
// Whenever a function is created 'with' this trait, it will copy the message.
this.commandLineMessage = qscript.message
}
abstract class MyCommandLineFunction extends CommandLineFunction {
// This is a per command line argument
@Argument(doc="message to display")
var commandLineMessage: String = _
}
class MyEchoFunction extends MyCommandLineFunction {
def commandLine = "echo " + commandLineMessage
}
class MyAlsoEchoFunction extends MyCommandLineFunction {
def commandLine = "echo also " + commandLineMessage
}
def script = {
for (i <- 1 to count) {
val echo = new MyEchoFunction with ReusableArguments
val alsoEcho = new MyAlsoEchoFunction with ReusableArguments
add(echo, alsoEcho)
}
}
}
Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.
As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.
Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.
If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.
Try out the Hello World example with -jobRunner GridEngine.
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run
If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.
See QFunction and Command Line Options for more info on Queue options.
If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:
java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2
Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y echo hello world
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=2048M -l h_rss=2458M echo hello world
One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=4096M -l h_rss=4915M echo hello world
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=8192M -l h_rss=9830M echo hello world
qsub -w e -V -b y -N echo_hello_world \
-o test.out -wd $PWD -j y \
-l mem_free=16384M -l h_rss=19960M echo hello world
If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.
We have found it that Queue works best with IntelliJ IDEA Community Edition (free) or Ultimate Edition installed with the Scala Plugin enabled. Once you have downloaded IntelliJ IDEA, follow the instructions below to setup a Sting project with Queue and the Scala Plugin.
[[File:sting_project_libraries.png|300px|thumb|right|Project Libraries]] [[File:sting_module_sources.png|300px|thumb|right|Module Sources]] [[File:sting_module_dependencies.png|300px|thumb|right|Module Dependencies]] [[File:sting_module_scala_facet.png|300px|thumb|right|Scala Facet]]
Build Queue from source from the command line with ant queue, so that:
- The lib folder is initialized including the scala jars
- The queue-extensions for the GATK are generated to the build folder
File > SettingsIDE Settings in the left navigation list select PluginsAvailable tab under pluginsscala pluginNo. The correct scala libraries and compiler are already available in the lib folder from when you built Queue from the command lineRestart IntelliJ to load the scala pluginSelect the menu File... > New Project...
On the first page of "New Project" select
Create project from scratch
Click Next >
On the second page of "New Project" select
Set the project Name: to Sting
Set the Project files location: to the directory where you checked out the Sting git repository, for example /Users/jamie/src/Sting
Uncheck Create Module
Click Finish
The "Project Structure" window should open. If not open it via the menu File > Project Structure
Under the Project Settings in the left panel of "Project Structure" select Project
Make sure that Project SDK is set to a build of 1.6
If the Project SDK only lists <No SDK> add a New > JSDK pointing to /System/Library/Frameworks/JavaVM.framework/Versions/1.6
Under the Project Settings in the left panel of "Project Structure" select Libraries
Click the plus (+) to create a new Project Library
Set the Name: to Sting/lib
Select Attach Jar Directories
Select the path to lib folder under your SVN checkout
Under the Project Settings in the left panel of "Project Structure" select Modules
Click on the + box to add a new module
On the first page of "Add Module" select
Create module from scratch
Click Next \>
On the second page of "Add Module" select
Set the module Name: to Sting
Change the Content root to: <directory where you checked out the Sting SVN repository>
Click Next \>
On the third page
Uncheck all of the other source directories only leaving the java/src directory checked
Click Next \>
On fourth page click Finish
Back in the Project Structure window, under the Module 'Sting', on the Sources tab make sure the following folders are selected
Source Folders (in blue):
public/java/src
public/scala/src
private/java/src (Broad only)
private/scala/src (Broad only)
build/queue-extensions/srcTest Source Folders (in green):
public/java/test
public/scala/test
private/java/test (Broad only)
private/scala/test (Broad only)In the Project Structure window, under the Module 'Sting', on the Module Dependencies tab select
Click on the button Add...
Select the popup menu Library...
Select the Sting/lib library
Click Add selected
Refresh the Project Structure window so that it becomes aware of the Scala library in Sting/lib
Click the OK button
Reopen Project Structure via the menu File > Project Structure
In the second panel, click on the Sting module
Click on the plus (+) button above the second panel module
In the popup menu under Facet select Scala
On the right under Facet 'Scala' set the Compiler library: to Sting/lib
Click OK
File > SettingsProject Settings [Sting] in the left navigation list select Compiler then Annotation ProcessorsEnable annotation processingobtain processors from the classpath selectedOK[[File:queue_debug.png|300px|thumb|right|Queue Remote Debug]]
In IntelliJ 10 open the menu Run > Edit Configurations.
Click the gold [+] button at the upper left to open the Add New Configuration popup menu.
Select Remote from the popup menu.
With the new configuration selected on the left, change the configuration name from 'Unnamed' to something like 'Queue Remote Debug'.
Set the Host to the hostname of your server, and the Port to an unused port. You can try the default port of 5005.
From the Use the following command line arguments for running remote JVM, copy the argument string.
On the server, paste / modify your command line to run with the previously copied text, for example java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 Queue.jar -S myscript.scala ....
If you would like the program to wait for you to attach the debugger before running, change suspend=n to suspend=y.
Back in IntelliJ, click OK to save your changes.
Queue Remote Debug is selected via the configuration drop down or Run > Edit Configurations.Run > Debug.From Stack overflow:
Point IntelliJ to http://download.oracle.com/javase/6/docs/api/.
Go to File -> Project Structure -> SDKs -> Apple 1.x -> DocumentationPaths, and the click specify URL.
In IntelliJ, open File -> Project Structure. Click on "SDKs" under "Platform Settings". Add the following path under the Sourcepath tab: /Library/Java/JavaVirtualMachines/1.6.0_29-b11-402.jdk/Contents/Home/src.jar!/src
Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism. Both techniques and described below.
Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.
The GATK's default downsampler exhibits the following properties:
The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.
By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.
From the command line:
To disable the downsampler, specify -dt NONE.
To change the default coverage per-sample, specify the desired coverage to the -dcov option.
To modify the walker's default behavior:
by=<value>. Override the downsampling depth by changing the toCoverage=<value>.The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:
For each sample:
Select
While the number of existing reads + the number of incoming reads is greater than the target sample size:
Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.
If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.
To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:
@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})
You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:
public class MaxReadLengthFilter implements SamRecordFilter {
@Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
private int maxReadLength;
public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}
Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.
Note that when you specify a read filter, you need to strip the Filter part of its name off! E.g. in the example above, if you want to use MaxReadLengthFilter, you need to call it like this:
--read_filter MaxReadLength
The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:
--read_filter MaxReadLength --maxReadLength 76
You can add as many filters as you like by using multiple copies of the --read_filter parameter:
--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead
The online course Functional Programming Principles in Scala taught by Martin Odersky, creator of Scala, and a Cheat Sheet for that course
Scala by Example (PDF) - also by Martin Odersky
Programming Scala - O'Reilly Media
Scala School - Twitter
A Concise Introduction To Scala
The LocusTraversal now supports passing walkers reads that have deletions spanning the current locus. This is useful in many situation where you want to calculate coverage, call variants and need to avoid calling variants where there are a lot of deletions, etc.
Currently, the system by default will not pass you deletion-spanning reads. In order to see them, you need to overload the function:
/**
* (conceptual static) method that states whether you want to see reads piling up at a locus
* that contain a deletion at the locus.
*
* ref: ATCTGA
* read1: ATCTGA
* read2: AT--GA
*
* Normally, the locus iterator only returns a list of read1 at this locus at position 3, but
* if this function returns true, then the system will return (read1, read2) with offsets
* of (3, -1). The -1 offset indicates a deletion in the read.
*
* @return false if you don't want to see deletions, or true if you do
*/
public boolean includeReadsWithDeletionAtLoci() { return true; }
in your walker. Now you will start seeing deletion-spanning reads in your walker. These reads are flagged with offsets of -1, so that you can:
for ( int i = 0; i < context.getReads().size(); i++ ) {
SAMRecord read = context.getReads().get(i);
int offset = context.getOffsets().get(i);
if ( offset == -1 )
nDeletionReads++;
else
nCleanReads++;
}
There are also two convenience functions in AlignmentContext to extract subsets of the reads with and without spanning deletions:
/**
* Returns only the reads in ac that do not contain spanning deletions of this locus
*
* @param ac
* @return
*/
public static AlignmentContext withoutSpanningDeletions( AlignmentContext ac );
/**
* Returns only the reads in ac that do contain spanning deletions of this locus
*
* @param ac
* @return
*/
public static AlignmentContext withSpanningDeletions( AlignmentContext ac );
The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.
Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.
For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.

For developers there are two classes that are important:
Feature
This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.
FeatureCodec
This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.
To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:
public interface Feature {
/**
* Return the features reference sequence name, e.g chromosome or contig
*/
public String getChr();
/**
* Return the start position in 1-based coordinates (first base is 1)
*/
public int getStart();
/**
* Return the end position following 1-based fully closed conventions. The length of a feature is
* end - start + 1;
*/
public int getEnd();
}
And the interface for FeatureCodec:
/**
* the base interface for classes that read in features.
* @param <T> The feature type this codec reads
*/
public interface FeatureCodec<T extends Feature> {
/**
* Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
*
* @param line the input line to decode
* @return Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
* a comment)
*/
public Feature decodeLoc(String line);
/**
* Decode a line as a Feature.
*
* @param line the input line to decode
* @return Return the Feature encoded by the line, or null if the line does not represent a feature (e.g. is
* a comment)
*/
public T decode(String line);
/**
* This function returns the object the codec generates. This is allowed to be Feature in the case where
* conditionally different types are generated. Be as specific as you can though.
*
* This function is used by reflections based tools, so we can know the underlying type
*
* @return the feature type this codec generates.
*/
public Class<T> getFeatureType();
/** Read and return the header, or null if there is no header.
*
* @return header object
*/
public Object readHeader(LineReader reader);
}
The following formats are supported in Tribble:
Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.
Make sure that you've checked your changes into Tribble; unversioned changes will be problematic, so you should always check in so that you have a unique version number to identify your release.
Once you've checked-in Tribble, make sure to svn update, and then run svnversion. This will give you a version number which you can use to name your release. Let's say it was 82. **If it contains an M (i.e. 82M) this means your version isn't clean (you have modifications that are not checked in), don't proceed`.
from the Tribble main directory, run ant clean, then ant (make sure it runs successfully), and ant test (also make sure it completes successfully).
copy dist/tribble-0.1.jar (or whatever the internal Tribble version currently is) to your checkout of the GATK, as the file ./settings/repository/org.broad/tribble-<svnversion>.jar.
Copy the current XML file to the new name, i.e. from the base GATK trunk directory:
cp ./settings/repository/org.broad/tribble-
Edit the ./settings/repository/org.broad/tribble-<svnversion>.xml with the new correct version number and release date (here we rev 81 to 82).
This involves changing:
<ivy-module version="1.0">
<info organisation="org.broad" module="tribble" revision="81" status="integration" publication="20100526124200" />
</ivy-module>
To:
<ivy-module version="1.0">
<info organisation="org.broad" module="tribble" revision="82" status="integration" publication="20100528123456" />
</ivy-module>
Notice the change to the revision number and the publication date.
Notice the change to the revision number and the publication date.
Remove the old files svn remove ./settings/repository/org.broad/tribble-<current_svnversion>.*
Add the new files svn add ./settings/repository/org.broad/tribble-<new_svnversion>.*
Make sure you're using the new libraries to build: remove your ant cache: rm -r ~/.ant/cache.
Run an ant clean, and then make sure to test the build with ant integrationtest and ant test.
Any check-in from the base SVN directory will now rev the Tribble version.
DiffEngine is a summarizing difference engine that allows you to compare two structured files -- such as BAMs and VCFs -- to find what are the differences between them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others.
The GATK contains a summarizing difference engine called DiffEngine that compares hierarchical data structures to emit:
A list of specific differences between the two data structures. This is similar to saying the value in field A in record 1 in file F differs from the value in field A in record 1 in file G.
A summarized list of differences ordered by frequency of the difference. This output is similar to saying field A differed in 50 records between files F and G.
The GATK contains a private walker called DiffObjects that allows you access to the DiffEngine capabilities on the command line. Simply provide the walker with the master and test files and it will emit summarized differences for you.
The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named nodes. Suppose I have two trees:
Tree1=(A=1 B=(C=2 D=3))
Tree2=(A=1 B=(C=3 D=3 E=4))
Tree3=(A=1 B=(C=4 D=3 E=4))
where every node in the tree is named, or is a raw value (here all leaf values are integers). The DiffEngine traverses these data structures by name, identifies equivalent nodes by fully qualified names (Tree1.A is distinct from Tree2.A, and determines where their values are equal (Tree1.A=1, Tree2.A=1, so they are).
These itemized differences are listed as:
Tree1.B.C=2 != Tree2.B.C=3
Tree1.B.C=2 != Tree3.B.C=4
Tree2.B.C=3 != Tree3.B.C=4
Tree1.B.E=MISSING != Tree2.B.E=4
This conceptually very similar to the output of the unix command line tool diff. What's nice about DiffEngine though is that it computes similarity among the itemized differences and displays the count of differences names in the system. In the above example, the field C is not equal three times, while the missing E in Tree1 occurs only once. So the summary is:
*.B.C : 3
*.B.E : 1
where the * operator indicates that any named field matches. This output is sorted by counts, and provides an immediate picture of the commonly occurring differences between the files.
Below is a detailed example of two VCF fields that differ because of a bug in the AC, AF, and AN counting routines, detected by the integrationtest integration (more below). You can see that in the although there are many specific instances of these differences between the two files, the summarized differences provide an immediate picture that the AC, AF, and AN fields are the major causes of the differences.
[testng] path count
[testng] *.*.*.AC 6
[testng] *.*.*.AF 6
[testng] *.*.*.AN 6
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AC 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AF 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AN 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AC 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AF 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AN 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AC 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AF 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AN 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC 1
The DiffEngine codebase that supports these calculations is integrated into the integrationtest framework, so that when a test fails the system automatically summarizes the differences between the master MD5 file and the failing MD5 file, if it is an understood type. When failing you will see in the integration test logs not only the basic information, but the detailed DiffEngine output.
For example, in the output below I broke the GATK BAQ calculation and the integration test DiffEngine clearly identifies that all of the records differ in their BQ tag value in the two BAM files:
/humgen/1kg/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam -o /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp -L 1:10,000,000-10,100,000 -baq RECALCULATE -et NO_ET
[testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse.
[testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse.
[testng] ##### MD5 file is up to date: integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
[testng] Checking MD5 for /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp [calculated=e5147656858fc4a5f470177b94b1fc1b, expected=4ac691bde1ba1301a59857694fda6ae2]
[testng] ##### Test testPrintReadsRecalBAQ is going fail #####
[testng] ##### Path to expected file (MD5=4ac691bde1ba1301a59857694fda6ae2): integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest
[testng] ##### Path to calculated file (MD5=e5147656858fc4a5f470177b94b1fc1b): integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
[testng] ##### Diff command: diff integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
[testng] ##:GATKReport.v0.1 diffences : Summarized differences between the master and test files.
[testng] See http://www.broadinstitute.org/gsa/wiki/index.php/DiffObjectsWalker_and_SummarizedDifferences for more information
[testng] Difference NumberOfOccurrences
[testng] *.*.*.BQ 895
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:2:266:272:361.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:245:474:254.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:255:178:160.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:158:682:495.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:195:591:884.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:165:236:848.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:191:223:910.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:286:279:434.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:2:106:516:354.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:3:102:580:518.BQ 1
[testng]
[testng] Note that the above list is not comprehensive. At most 20 lines of output, and 10 specific differences will be listed. Please use -T DiffObjects -R public/testdata/exampleFASTA.fasta -m integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest -t integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest to explore the differences more freely
The system dynamically finds all classes that implement the following simple interface:
public interface DiffableReader {
@Ensures("result != null")
/**
* Return the name of this DiffableReader type. For example, the VCF reader returns 'VCF' and the
* bam reader 'BAM'
*/
public String getName();
@Ensures("result != null")
@Requires("file != null")
/**
* Read up to maxElementsToRead DiffElements from file, and return them.
*/
public DiffElement readFromFile(File file, int maxElementsToRead);
/**
* Return true if the file can be read into DiffElement objects with this reader. This should
* be uniquely true/false for all readers, as the system will use the first reader that can read the
* file. This routine should never throw an exception. The VCF reader, for example, looks at the
* first line of the file for the ##format=VCF4.1 header, and the BAM reader for the BAM_MAGIC value
* @param file
* @return
*/
@Requires("file != null")
public boolean canRead(File file);
See the VCF and BAMDiffableReaders for example implementations. If you extend this to a new object types both the DiffObjects walker and the integrationtest framework will automatically work with your new file type.
The GATKDocs are what we call "Technical Documentation" in the Guide section of this website. The HTML pages are generated automatically at build time from specific blocks of documentation in the source code.
The best place to look for example documentation for a GATK walker is GATKDocsExample walker in org.broadinstitute.sting.gatk.examples. This is available here.
Below is the reproduction of that file from August 11, 2011:
/**
* [Short one sentence description of this walker]
*
* <p>
* [Functionality of this walker]
* </p>
*
* <h2>Input</h2>
* <p>
* [Input description]
* </p>
*
* <h2>Output</h2>
* <p>
* [Output description]
* </p>
*
* <h2>Examples</h2>
* PRE-TAG
* java
* -jar GenomeAnalysisTK.jar
* -T $WalkerName
* PRE-TAG
*
* @category Walker Category
* @author Your Name
* @since Date created
*/
public class GATKDocsExample extends RodWalker<Integer, Integer> {
/**
* Put detailed documentation about the argument here. No need to duplicate the summary information
* in doc annotation field, as that will be added before this text in the documentation page.
*
* Notes:
* <ul>
* <li>This field can contain HTML as a normal javadoc</li>
* <li>Don't include information about the default value, as gatkdocs adds this automatically</li>
* <li>Try your best to describe in detail the behavior of the argument, as ultimately confusing
* docs here will just result in user posts on the forum</li>
* </ul>
*/
@Argument(fullName="full", shortName="short", doc="Brief summary of argument [~ 80 characters of text]", required=false)
private boolean myWalkerArgument = false;
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { return 0; }
public Integer reduceInit() { return 0; }
public Integer reduce(Integer value, Integer sum) { return value + sum; }
public void onTraversalDone(Integer result) { }
}
Note that the -B flag referred to below is deprecated; these docs need to be updated
The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:
They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.
The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.
The file must have a Tribble RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.
Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the VariantEval module uses the named track eval to get calls for evaluation, and dbsnp as the track containing the database of known variants.
RMD files are extremely easy to get into the GATK using the -B syntax:
java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf
In this example, the GATK will attempt to parse the file calls.vcf using the VCF parser and bind the VCF data to the RMD track named variant.
In general, you can provide as many RMD bindings to the GATK as you like:
java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf
Works just as well. Some modules may require specifically named RMD tracks -- like eval above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like VariantsToVCF.
In this snippet from SNPDensityWalker, we grab the eval track as a VariantContext object, only for the variants that are of type SNP:
public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);
}
From VariantsToVCF we call the helper function tracker.getVariantContexts to look at all of the RMDs and convert what it can to VariantContext objects.
Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);
Here's a totally general code snippet from PileupWalker.java. This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.
private String getReferenceOrderedData( RefMetaDataTracker tracker ) {
ArrayList
DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);
if ( dbsnp != null)
rodString += DbSNPHelper.toMediumString(dbsnp);
if ( !rodString.equals("") )
rodString = "[ROD: " + rodString + "]";
return rodString;
}
Tracks of reference metadata are loaded using the Tribble infrastructure. Tracks are loaded using the feature codec and underlying type information. See the Tribble documentation for more information.
Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the FeatureCodec class. Name resolution occurs using the -B type parameter, i.e. if the user specified:
-B:calls1,VCF calls1.vcf
The GATK looks for a FeatureCodec called VCFCodec.java to decode the record type. Alternately, if the user specified:
-B:calls1,MYAwesomeFormat calls1.maft
THe GATK would look for a codec called MYAwesomeFormatCodec.java. This look-up is not case sensitive, i.e. it will resolve MyAwEsOmEfOrMaT as well, though why you would want to write something so painfully ugly to read is beyond us.
In addition to testing walkers individually, you may want to also run integration tests for your QScript pipelines.
ant target pipelinetest and run under pipelinetestrun.PipelineTest to run under the ant target.PipelineTestSpec and then run it via PipelineTest.exec().When building up a pipeline test spec specify the following variables for your test.
| Variable | Type | Description |
|---|---|---|
args |
String |
The arguments to pass to the Queue test, ex: -S scala/qscript/examples/HelloWorld.scala |
jobQueue |
String |
Job Queue to run the test. Default is null which means use hour. |
fileMD5s |
Map[Path, MD5] |
Expected MD5 results for each file path. |
expectedException |
classOf[Exception] |
Expected exception from the test. |
The following example runs the ExampleCountLoci QScript on a small bam and verifies that the MD5 result is as expected.
It is checked into the Sting repository under scala/test/org/broadinstitute/sting/queue/pipeline/examples/ExampleCountLociPipelineTest.scala
package org.broadinstitute.sting.queue.pipeline.examples
import org.testng.annotations.Test
import org.broadinstitute.sting.queue.pipeline.{PipelineTest, PipelineTestSpec}
import org.broadinstitute.sting.BaseTest
class ExampleCountLociPipelineTest {
@Test
def testCountLoci {
val testOut = "count.out"
val spec = new PipelineTestSpec
spec.name = "countloci"
spec.args = Array(
" -S scala/qscript/examples/ExampleCountLoci.scala",
" -R " + BaseTest.hg18Reference,
" -I " + BaseTest.validationDataLocation + "small_bam_for_countloci.bam",
" -o " + testOut).mkString
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
PipelineTest.executeTest(spec)
}
}
To test if the script is at least compiling with your arguments run ant pipelinetest specifying the name of your class to -Dsingle:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest
Sample output:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour
[testng] => countloci PASSED DRY RUN
[testng] PASSED: testCountLoci
As of July 2011 the pipeline tests run against LSF 7.0.6 and Grid Engine 6.2u5. To include these two packages in your environment use the hidden dotkit .combined_LSF_SGE.
reuse .combined_LSF_SGE
Once you are satisfied that the dry run has completed without error, to actually run the pipeline test run ant pipelinetestrun.
ant pipelinetestrun -Dsingle=ExampleCountLociPipelineTest
Sample output:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e4c42c267b3a6]
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
If you don't know the MD5s yet you can run the command yourself on the command line and then MD5s the outputs yourself, or you can set the MD5s in your test to "" and run the pipeline.
When the MD5s are blank as in:
spec.fileMD5s += testOut -> ""
You run:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run
And the output will look like:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is , equal? = false
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
When a pipeline test fails due to an MD5 mismatch you can use the MD5 database to diff the results.
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### Updating MD5 file: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e0000deadbeef]
[testng] ##### Test countloci is going fail #####
[testng] ##### Path to expected file (MD5=67823e4722495eb10a5e0000deadbeef): integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest
[testng] ##### Path to calculated file (MD5=67823e4722495eb10a5e4c42c267b3a6): integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] ##### Diff command: diff integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] FAILED: testCountLoci
[testng] java.lang.AssertionError: 1 of 1 MD5s did not match.
If you need to examine a number of MD5s which may have changed you can briefly shut off MD5 mismatch failures by setting parameterize = true.
spec.parameterize = true
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
For this run:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run
If there's a match the output will resemble:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e4c42c267b3a6, equal? = true
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
While for a mismatch it will look like this:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e0000deadbeef, equal? = false
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
Most GATK walkers are really too complex to easily test using the standard unit test framework. It's just not feasible to make artificial read piles and then extrapolate from simple tests passing whether the system as a whole is working correctly. However, we need some way to determine whether changes to the core of the GATK are altering the expected output of complex walkers like BaseRecalibrator or SingleSampleGenotyper. In additional to correctness, we want to make sure that the performance of key walkers isn't degrading over time, so that calling snps, cleaning indels, etc., isn't slowly creeping down over time. Since we are now using a bamboo server to automatically build and run unit tests (as well as measure their runtimes) we want to put as many good walker tests into the test framework so we capture performance metrics over time.
To make this testing process easier, we've created a WalkerTest framework that lets you invoke the GATK using command-line GATK commands in the JUnit system and test for changes in your output files by comparing the current ant build results to previous run via an MD5 sum. It's a bit coarse grain, but it will work to ensure that changes to key walkers are detected quickly by the system, and authors can either update the expected MD5s or go track down bugs.
The system is fairly straightforward to use. Ultimately we will end up with JUnit style tests in the unit testing structure. In the piece of code below, we have a piece of code that checks the MD5 of the SingleSampleGenotyper's GELI text output at LOD 3 and LOD 10.
package org.broadinstitute.sting.gatk.walkers.genotyper;
import org.broadinstitute.sting.WalkerTest;
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
import java.util.Arrays;
public class SingleSampleGenotyperTest extends WalkerTest {
@Test
public void testLOD() {
HashMap<Double, String> e = new HashMap<Double, String>();
e.put( 10.0, "e4c51dca6f1fa999f4399b7412829534" );
e.put( 3.0, "d804c24d49669235e3660e92e664ba1a" );
for ( Map.Entry<Double, String> entry : e.entrySet() ) {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout %s --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod " + entry.getKey(), 1,
Arrays.asList(entry.getValue()));
executeTest("testLOD", spec);
}
}
}
The fundamental piece here is to inherit from WalkerTest. This gives you access to the executeTest() function that consumes a WalkerTestSpec:
public WalkerTestSpec(String args, int nOutputFiles, List<String> md5s)
The WalkerTestSpec takes regular, command-line style GATK arguments describing what you want to run, the number of output files the walker will generate, and your expected MD5s for each of these output files. The args string can contain %s String.format specifications, and for each of the nOutputFiles, the executeTest() function will (1) generate a tmp file for output and (2) call String.format on your args to fill in the tmp output files in your arguments string. For example, in the above argument string varout is followed by %s, so our single SingleSampleGenotyper output is the variant output file.
When you add a walkerTest inherited unit test to the GATK, and then build test, you'll see output that looks like:
[junit] WARN 13:29:50,068 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:29:50,068 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit]
[junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit]
[junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a]
[junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a]
[junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED
[junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED
[junit] WARN 13:30:39,409 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:30:39,409 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit]
[junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit]
[junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534]
[junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534]
[junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED
[junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED
[junit] WARN 13:31:30,214 SingleSampleGenotyperTest -
[junit] WARN 13:31:30,214 SingleSampleGenotyperTest -
We keep all of the permenant GATK testing data in:
/humgen/gsa-scr1/GATK_Data/Validation_Data/
A good set of data to use for walker testing is the CEU daughter data from 1000 Genomes:
gsa2 ~/dev/GenomeAnalysisTK/trunk > ls -ltr /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.bam /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.calls
-rw-rw-r--+ 1 depristo wga 51M 2009-09-03 07:56 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam
-rw-rw-r--+ 1 depristo wga 185K 2009-09-04 13:21 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.variants.geli.calls
-rw-rw-r--+ 1 depristo wga 164M 2009-09-04 13:22 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.genotypes.geli.calls
-rw-rw-r--+ 1 depristo wga 24M 2009-09-04 15:00 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SOLID.bam
-rw-rw-r--+ 1 depristo wga 12M 2009-09-04 15:01 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.454.bam
-rw-r--r--+ 1 depristo wga 91M 2009-09-04 15:02 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam
The tests depend on a variety of input files, that are generally constrained to three mount points on the internal Broad network:
*/seq/
*/humgen/1kg/
*/humgen/gsa-hpprojects/GATK/Data/Validation_Data/
To run the unit and integration tests you'll have to have access to these files. They may have different mount points on your machine (say, if you're running remotely over the VPN and have mounted the directories on your own machine).
Every file that generates an MD5 sum as part of the WalkerTest framework will be copied to <MD5>. integrationtest in the integrationtests subdirectory of the GATK trunk. This MD5 database of results enables you to easily examine the results of an integration test as well as compare the results of a test before/after a code change. For example, below is an example test for the UnifiedGenotyper that, due to a code change, where the output VCF differs from the VCF with the expected MD5 value in the test code itself. The test provides provides the path to the two results files as well as a diff command to compare expected to the observed MD5:
[junit] --------------------------------------------------------------------------------
[junit] Executing test testParameter[-genotype] with GATK arguments: -T UnifiedGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05997727998894311741.tmp -L 1:10,000,000-10,010,000 -genotype
[junit] ##### MD5 file is up to date: integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
[junit] Checking MD5 for /tmp/walktest.tmp_param.05997727998894311741.tmp [calculated=ab20d4953b13c3fc3060d12c7c6fe29d, expected=0ac7ab893a3f550cb1b8c34f28baedf6]
[junit] ##### Test testParameter[-genotype] is going fail #####
[junit] ##### Path to expected file (MD5=0ac7ab893a3f550cb1b8c34f28baedf6): integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest
[junit] ##### Path to calculated file (MD5=ab20d4953b13c3fc3060d12c7c6fe29d): integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
[junit] ##### Diff command: diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
Examining the diff we see a few lines that have changed the DP count in the new code
> diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest | head
385,387c385,387
< 1 10000345 . A . 106.54 . AN=2;DP=33;Dels=0.00;MQ=89.17;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78
< 1 10000346 . A . 103.75 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99
< 1 10000347 . A . 109.79 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78.04
---
> 1 10000345 . A . 106.54 . AN=2;DP=32;Dels=0.00;MQ=89.50;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78
> 1 10000346 . A . 103.75 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99
> 1 10000347 . A . 109.79 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78
Whether this is the expected change is up to you to decide, but the system makes it as easy as possible to see the consequences of your code change.
The walker test framework supports an additional syntax for ensuring that a particular java Exception is thrown when a walker executes using a simple alternate version of the WalkerSpec object. Rather than specifying the MD5 of the result, you can provide a single subclass of Exception.class and the testing framework will ensure that when the walker runs an instance (class or subclass) of your expected exception is thrown. The system also flags if no exception is thrown.
For example, the following code tests that the GATK can detect and error out when incompatible VCF and FASTA files are given:
@Test public void fail8() { executeTest("hg18lex-v-b36", test(lexHG18, callsB36)); }
private WalkerTest.WalkerTestSpec test(String ref, String vcf) {
return new WalkerTest.WalkerTestSpec("-T VariantsToTable -M 10 -B:two,vcf "
+ vcf + " -F POS,CHROM -R "
+ ref + " -o %s",
1, UserException.IncompatibleSequenceDictionaries.class);
}
During the integration test this looks like:
[junit] Executing test hg18lex-v-b36 with GATK arguments: -T VariantsToTable -M 10 -B:two,vcf /humgen/gsa-hpprojects/GATK/data/Validation_Data/lowpass.N3.chr1.raw.vcf -F POS,CHROM -R /humgen/gsa-hpprojects/GATK/data/Validation_Data/lexFasta/lex.hg18.fasta -o /tmp/walktest.tmp_param.05541601616101756852.tmp -l WARN -et NO_ET
[junit] [junit] Wanted exception class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries, saw class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries
[junit] => hg18lex-v-b36 PASSED
Please do not put any extremely long tests in the regular ant build test target. We are currently splitting the system into fast and slow tests so that unit tests can be run in \< 3 minutes while saving a test target for long-running regression tests. More information on that will be posted.
An expected MG5 string of "" means don't check for equality between the calculated and expected MD5s. Useful if you are just writing a new test and don't know the true output.
Overload parameterize() { return true; } if you want the system to just run your calculations, not throw an error if your MD5s don't match, across all tests
If your tests all of a sudden stop giving equality MD5s, you can just (1) look at the .tmp output files directly or (2) grab the printed GATK command-line options and explore what is happening.
You can always run a GATK walker on the command line and then run md5sum on its output files to obtain, outside of the testing framework, the MD5 expected results.
Don't worry about the duplication of lines in the output ; it's just an annoyance of having two global loggers. Eventually we'll bug fix this away.
The core concept behind GATK tools is the walker, a class that implements the three core operations: filtering, mapping, and reducing.
filter Reduces the size of the dataset by applying a predicate.
map Applies a function to each individual element in a dataset, effectively mapping it to a new element.
reduce
Inductively combines the elements of a list. The base case is supplied by the reduceInit() function, and the inductive step is performed by the reduce() function.
Users of the GATK will provide a walker to run their analyses. The engine will produce a result by first filtering the dataset, running a map operation, and finally reducing the map operation to a single result.
To be usable by the GATK, the walker must satisfy the following properties:
It must subclass one of the basic walkers in the org.broadinstitute.sting.gatk.walkers package, usually ReadWalker or LociWalker.
Locus walkers present all the reads, reference bases, and reference-ordered data that overlap a single base in the reference. Locus walkers are best used for analyses that look at each locus independently, such as genotyping.
Read walkers present only one read at a time, as well as the reference bases and reference-ordered data that overlap that read.
Besides read walkers and locus walkers, the GATK features several other data access patterns, described here.
The compiled class or jar must be on the current classpath. The Java classpath can be controlled using either the $CLASSPATH environment variable or the JVM's -cp option.
The best way to get started with the GATK is to explore the walkers we've written. Here are the best walkers to look at when getting started:
CountLoci
It is the simplest locus walker in our codebase. It counts the number of loci walked over in a single run of the GATK.
$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLociWalker.java
CountReads
It is the simplest read walker in our codebase. It counts the number of reads walked over in a single run of the GATK.
$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadsWalker.java
GATKPaperGenotyper
This is a more sophisticated example, taken from our recent paper in Genome Research (and using our ReadBackedPileup to select and filter reads). It is an extremely basic Bayesian genotyper that demonstrates how to output data to a stream and execute simple base operations.
$STING_HOME/java/src/org/broadinstitute/sting/gatk/examples/papergenotyper/GATKPaperGenotyper.java
Please note that the walker above is NOT the UnifiedGenotyper. While conceptually similar to the UnifiedGenotyper, the GATKPaperGenotyper uses a much simpler calling model for increased clarity and readability.
The GATK can absorb external walkers placed in a directory of your choosing. By default, that directory is called 'external' and is relative to the Sting git root directory (for example, ~/src/Sting/external). However, you can choose to place that directory anywhere on the filesystem and specify its complete path using the ant external.dir property.
ant -Dexternal.dir=~/src/external
The GATK will check each directory under the external directory (but not the external directory itself!) for small build scripts. These build scripts must contain at least a compile target that compiles your walker and places the resulting class file into the GATK's class file output directory. The following is a sample compile target:
<target name="compile" depends="init">
<javac srcdir="." destdir="${build.dir}" classpath="${gatk.classpath}" />
</target>
As a convenience, the build.dir ant property will be predefined to be the GATK's class file output directory and the gatk.classpath property will be predefined to be the GATK's core classpath. Once this structure is defined, any invocation of the ant build scripts will build the contents of the external directory as well as the GATK itself.
At the Broad, we typically put it somewhere like this:
/home/radon01/depristo/work/local/scala-2.7.5.final
Next, create a symlink from this directory to trunk/scala/installation:
ln -s /home/radon01/depristo/work/local/scala-2.7.5.final trunk/scala/installation
Right now the only way to get scala walkers into the GATK is by explicitly setting your CLASSPATH in your .my.cshrc file:
setenv CLASSPATH /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/FourBaseRecaller.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GenomeAnalysisTK.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/Playground.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/StingUtils.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/bcel-5.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/colt-1.2.0.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/google-collections-0.9.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/javassist-3.7.ga.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/junit-4.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/log4j-1.2.15.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-1.02.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-private-875.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/reflections-0.9.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/sam-1.01.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/simple-xml-2.0.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar
Really this needs to be manually updated whenever any of the libraries are updated. If you see this error:
Caused by: java.lang.RuntimeException: java.util.zip.ZipException: error in opening zip file
at org.reflections.util.VirtualFile.iterable(VirtualFile.java:79)
at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:169)
at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:167)
at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:43)
at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:41)
at org.reflections.util.FluentIterable$ForkIterator.computeNext(FluentIterable.java:81)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
at org.reflections.util.FluentIterable$FilterIterator.computeNext(FluentIterable.java:102)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
at org.reflections.util.FluentIterable$TransformIterator.computeNext(FluentIterable.java:124)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
at org.reflections.Reflections.scan(Reflections.java:69)
at org.reflections.Reflections.<init>(Reflections.java:47)
at org.broadinstitute.sting.utils.PackageUtils.<clinit>(PackageUtils.java:23)
It's because the libraries aren't updated. Basically just do an ls of your trunk/dist directory after the GATK has been build, make this your classpath as above, and tack on:
/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar
A command that almost works (but you'll need to replace the spaces with colons) is:
#setenv CLASSPATH $CLASSPATH `ls /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/*.jar` /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar
All of the Scala source code lives in scala/src, which you build using ant scala
There are already some example Scala walkers in scala/src, so doing a standard checkout, installing scala, settting up your environment, should allow you to run something like:
gsa2 ~/dev/GenomeAnalysisTK/trunk > ant scala
Buildfile: build.xml
init.scala:
scala:
[echo] Sting: Compiling scala!
[scalac] Compiling 2 source files to /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/scala/classes
[scalac] warning: there were deprecation warnings; re-run with -deprecation for details
[scalac] one warning found
[scalac] Compile suceeded with 1 warning; see the compiler output for details.
[delete] Deleting: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar
[jar] Building jar: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar
Until we can include Scala walkers along with the main GATK jar (avoiding the classpath issue too) you have to invoke your scala walkers using this syntax:
java -Xmx2048m org.broadinstitute.sting.gatk.CommandLineGATK -T BaseTransitionTableCalculator -R /broad/1KG/reference/human_b36_both.fasta -I /broad/1KG/DCC_merged/freeze5/NA12878.pilot2.SLX.bam -l INFO -L 1:1-100
Here, the BaseTransitionTableCalculator walker is written in Scala and being loaded into the system by the GATK walker manager. Otherwise everything looks like a normal GATK module.