Developer Zone
All you need to know to write your own tools on top of the GATK

Created 2015-07-06 18:30:35 | Updated 2015-07-07 15:32:56 | Tags: queue gatk build

Comments (0)

TL;DR: mvn -Ddisable.shadepackage verify


In addition to Queue's GATK-wrapper codegen, relatively slow scala compilation, etc. there's still a lot of legacy compatibility from our ant days in the Maven scripts. Our mvn verify behaves more like when one runs ant, and builds everything needed to bundle the GATK.

As of GATK 3.4, by default the build for the "protected" code generates jar files that contains every class needed for running, one for the GATK and one for Queue. This is done by the Maven shade plugin, and are each called the "package jar". But, there's a way to generate a jar file that only contains META-INF/MANIFEST.MF pointers to the dependency jar files, instead of zipping/shading them up. These are each the "executable jar", and FYI are always generated as it takes seconds, not minutes.

Instructions for fast compilation

While developing and recompiling Queue, disable the shaded jar with -Ddisable.shadepackage. Then run java -jar target/executable/Queue.jar ... If you need to transfer this jar to another machine / directory, you can't copy (or rsync) just the jar, you'll need the entire executable directory.

# Total expected time, on a local disk, with Queue:
#   ~5.0 min from clean
#   ~1.5 min per recompile
mvn -Ddisable.shadepackage verify

# always available
java -jar target/executable/Queue.jar --help

# not found when shade disabled
java -jar target/package/Queue.jar --help

If one is only developing for the GATK, skip Queue by adding -P\!queue also.

mvn -Ddisable.shadepackage -P\!queue verify

# always available
java -jar target/executable/GenomeAnalysisTK.jar --help

# not found when queue profile disabled
java -jar target/executable/Queue.jar --help

Created 2012-08-11 05:09:36 | Updated 2012-10-18 15:36:32 | Tags: developer readbackedpileup

Comments (0)

1. Introduction

The AlignmentContext and ReadBackedPileup work together to provide the read data associated with a given locus. This section details the tools the GATK provides for working with collections of aligned reads.

2. What are read backed pileups?

Read backed pileups are objects that contain all of the reads and their offsets that "pile up" at a locus on the genome. They are the basic input data for the GATK LocusWalkers, and underlie most of the locus-based analysis tools like the recalibrator and SNP caller. Unfortunately, there are many ways to view this data, and version one grew unwieldy trying to support all of these approaches. Version two of the ReadBackedPileup presents a consistent and clean interface for working pileup data, as well as supporting the iterable() interface to enable the convenient for ( PileupElement p : pileup ) for-each loop support.

3. How do I get a ReadBackedPileup and/or how do I create one?

The best way is simply to grab the pileup (the underlying representation of the locus data) from your AlignmentContext object in map:

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
    ReadBackedPileup pileup = context.getPileup();

This aligns your calculations with the GATK core infrastructure, and avoids any unnecessary data copying from the engine to your walker.

If you are trying to create your own, the best constructor is:

public ReadBackedPileup(GenomeLoc loc, ArrayList<PileupElement> pileup )

requiring only a list, in order of read / offset in the pileup, of PileupElements.

From List and List

If you happen to have lists of SAMRecords and integer offsets into them you can construct a ReadBackedPileup this way:

public ReadBackedPileup(GenomeLoc loc, List<SAMRecord> reads, List<Integer> offsets )

4. What's the best way to use them?

Best way if you just need reads, bases and quals

for ( PileupElement p : pileup ) {
  System.out.printf("%c %c %d%n", p.getBase(), p.getSecondBase(), p.getQual());
  // you can get the read itself too using p.getRead()

This is the most efficient way to get data, and should be used whenever possible.

I just want a vector of bases and quals

You can use:

public byte[] getBases()
public byte[] getSecondaryBases()
public byte[] getQuals()

To get the bases and quals as a byte[] array, which is the underlying base representation in the SAM-JDK.

All I care about are counts of bases

Use the follow function to get counts of A, C, G, T in order:

public int[] getBaseCounts()

Which returns a int[4] vector with counts according to BaseUtils.simpleBaseToBaseIndex for each base.

Can I view just the reads for a given sample, read group, or any other arbitrary filter?

The GATK can very efficiently stratify pileups by sample, and less efficiently stratify by read group, strand, mapping quality, base quality, or any arbitrary filter function. The sample-specific functions can be called as follows:

pileup.getPileupForSample(String sampleName);

In addition to the rich set of filtering primitives built into the ReadBackedPileup, you can supply your own primitives by implmenting a PileupElementFilter:

public interface PileupElementFilter {
    public boolean allow(final PileupElement pileupElement);

and passing it to ReadBackedPileup's generic filter function:

public ReadBackedPileup getFilteredPileup(PileupElementFilter filter);

See the ReadBackedPileup's java documentation for a complete list of built-in filtering primitives.

Historical: StratifiedAlignmentContext

While ReadBackedPileup is the preferred mechanism for aligned reads, some walkers still use the StratifiedAlignmentContext to carve up selections of reads. If you find functions that you require in StratifiedAlignmentContext that seem to have no analog in ReadBackedPileup, please let us know and we'll port the required functions for you.

Created 2012-08-15 18:49:01 | Updated 2015-12-19 11:02:41 | Tags: dependencies

Comments (2)

Adding Third-party Dependencies

The GATK build system uses the Ivy dependency manager to make it easy for our users to add additional dependencies. Ivy can pull the latest jars and their dependencies from the Maven repository, making adding or updating a dependency as simple as adding a new line to the ivy.xml file.

If your tool is available in the maven repository, add a line to the ivy.xml file similar to the following:

<dependency org="junit" name="junit" rev="4.4" />

If you would like to add a dependency to a tool not available in the maven repository, please email

Updating SAM-JDK and Picard

Because we work so closely with the SAM-JDK/Picard team and are critically dependent on the code they produce, we have a special procedure for updating the SAM/Picard jars. Please use the following procedure to when updating sam-*.jar or picard-*.jar.

  • Download and build the latest versions of Picard public and Picard private (restricted to Broad Institute users) from their respective svns.

  • Get the latest svn versions for picard public and picard private by running the following commands:

    svn info $PICARD_PUBLIC_HOME | grep "Revision" svn info $PICARD_PRIVATE_HOME | grep "Revision"

Updating the Picard public jars

  • Rename the jars and xmls in $STING_HOME/settings/repository/net.sf to {picard|sam}-$PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.PICARD_PUBLIC_SVN_REV.{jar|xml}

  • Update the jars in $STING_HOME/settings/repository/net.sf with their newer equivalents in $PICARD_PUBLIC_HOME/dist/picard_lib.

  • Update the xmls in $STING_HOME/settings/repository/net.sf with the appropriate version number ($PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.$PICARD_PUBLIC_SVN_REV).

Updating the Picard private jar

  • Create the picard private jar with the following command:

    ant clean package -Dexecutable=PicardPrivate -Dpicard.dist.dir=${PICARD_PRIVATE_HOME}/dist

  • Rename picard-private-parts-*.jar in $STING_HOME/settings/repository/ to picard-private-parts-$PICARD_PRIVATE_SVN_REV.jar.

  • Update picard-private-parts-*.jar in $STING_HOME/settings/repository/ with the picard-private-parts.jar in $STING_HOME/dist/packages/picard-private-parts.

  • Update the xml in $STING_HOME/settings/repository/ to reflect the new revision and publication date.

Created 2012-08-14 20:25:10 | Updated 2012-10-18 15:27:03 | Tags: developer output

Comments (0)

1. Analysis output overview

In theory, any class implementing the OutputStream interface. In practice, three types of classes are commonly used: PrintStreams for plain text files, SAMFileWriters for BAM files, and VCFWriters for VCF files.

2. PrintStream

To declare a basic PrintStream for output, use the following declaration syntax:

public PrintStream out;

And use it just as you would any other PrintStream:

out.println("Hello, world!");

By default, @Output streams prepopulate fullName, shortName, required, and doc. required in this context means that the GATK will always fill in the contents of the out field for you. If the user specifies no --out command-line argument, the 'out' field will be prepopulated with a stream pointing to System.out.

If your walker outputs a custom format that requires more than simple concatenation by [Queue]() you should also implement a custom Gatherer.

3. SAMFileWriter

For some applications, you might need to manage their own SAM readers and writers directly from inside your walker. Current best practice for creating these Readers / Writers is to declare arguments of type SAMFileReader or SAMFileWriter as in the following example:

SAMFileWriter outputBamFile = null;

If you do not specify the full name and short name, the writer will provide system default names for these arguments. Creating a SAMFileWriter in this way will create the type of writer most commonly used by members of the GSA group at the Broad Institute -- it will use the same header as the input BAM and require presorted data. To change either of these attributes, use the StingSAMIterator interface instead:

StingSAMFileWriter outputBamFile = null;

and later, in initialize(), run one or both of the following methods:

outputBAMFile.writeHeader(customHeader); outputBAMFile.setPresorted(false);

You can change the header or presorted state until the first alignment is written to the file.

4. VCFWriter

VCFWriter outputs behave similarly to PrintStreams and SAMFileWriters. Declare a VCFWriter as follows:

@Output(doc="File to which variants should be written",required=true) protected VCFWriter writer = null;

5. Debugging Output

The walkers provide a protected logger instance. Users can adjust the debug level of the walkers using the -l command line option.

Turning on verbose logging can produce more output than is really necessary. To selectively turn on logging for a class or package, specify a property file from the command line as follows:

-Dlog4j.configuration=file:///<your development root>/Sting/java/config/

An example file is available in the java/config directory of the Git repository.

Created 2012-08-15 17:00:11 | Updated 2013-03-25 18:25:50 | Tags: gatkdocs developer walkers intermediate

Comments (0)

The GATK discovers walker documentation by reading it out of the Javadoc, Sun's design pattern for providing documentation for packages and classes. This page will provide an extremely brief explanation of how to write Javadoc; more information on writing javadoc comments can be found in Sun's documentation.

1. Adding walker and package descriptions to the help text

The GATK's build system uses the javadoc parser to extract the javadoc for classes and packages and embed the contents of that javadoc in the help system. If you add Javadoc to your package or walker, it will automatically appear in the help. The javadoc parser will pick up on 'standard' javadoc comments, such as the following, taken from PrintReadsWalker:

 * This walker prints out the input reads in SAM format.  Alternatively, the walker can write reads into a specified BAM file.

You can add javadoc to your package by creating a special file,, in the package directory. This file should consist of the javadoc for your package plus a package descriptor line. One such example follows:

 * Miscellaneous walkers (experimental)
package org.broadinstitute.sting.playground.gatk.walkers;

Additionally, the GATK provides two extra custom tags for overriding the information that ultimately makes it into the help.

  • Changes the name of the package as it appears in help. Note that the name of the walker cannot be changed as it is required to be passed verbatim to the -T argument.

  • @help.summary Changes the description which appears on the right-hand column of the help text. This is useful if you'd like to provide a more concise description of the walker that should appear in the help.

  • @help.description Changes the description which appears at the bottom of the help text with -T <your walker> --help is specified. This is useful if you'd like to present a more complete description of your walker.

2. Hiding experimental walkers (use sparingly, please!)

Walkers can be hidden from the documentation system by adding the @Hidden annotation to the top of each walker. @Hidden walkers can still be run from the command-line, but their documentation will not be visible to end users. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

3. Disabling building of help

Because the building of our help text is actually heavyweight and can dramatically increase compile time on some systems, we have a mechanism to disable help generation.

Compile with the following command:


to disable generation of help.

Created 2012-08-11 02:40:28 | Updated 2013-03-25 18:27:55 | Tags: faq queue developer scala

Comments (1)

1. What is Scala?

Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.

The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.

2. Where do I learn more about Scala?

3. What is the difference between var and val?

var is a value you can later modify, while val is similar to final in Java.

4. What is the difference between Scala collections and Java collections? / Why do I get the error: type mismatch?

Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.

   MyQScript.scala:39: error: type mismatch;
     found   : java.util.List[java.lang.String]
     required: scala.List[String]
        val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)

Use the implicit definitions in JavaConversions to automatically convert the basic Java collections to and from Scala collections.

import collection.JavaConversions._

Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.

var str = "A"
str + "B"
println(str) // prints: A
str += "C"
println(str) // prints: AC

var set = Set("A")
set + "B"
println(set) // prints: Set(A)
set += "C"
println(set) // prints: Set(A, C)

5. How do I append to a list?

Use the :+ operator for a single value.

  var myList = List.empty[String]
  myList :+= "a"
  myList :+= "b"
  myList :+= "c"

Use ++ for appending a list.

  var myList = List.empty[String]
  myList ++= List("a", "b", "c")

6. How do I add to a set?

Use the + operator.

  var mySet = Set.empty[String]
  mySet += "a"
  mySet += "b"
  mySet += "c"

7. How do I add to a map?

Use the + and -> operators.

  var myMap = Map.empty[String,Int]
  myMap += "a" -> 1
  myMap += "b" -> 2
  myMap += "c" -> 3

8. What are Option, Some, and None?

Option is a Scala generic type that can either be some generic value or None. Queue often uses it to represent primitives that may be null.

  var myNullableInt1: Option[Int] = Some(1)
  var myNullableInt2: Option[Int] = None

9. What is _ / What is the underscore?

François Armand's slide deck is a good introduction:

To quote from his slides:

Give me a variable name but
- I don't care of what it is
- and/or
- don't want to pollute my namespace with it

10. How do I format a String?

Use the .format() method.

This Java snippet:

String formatted = String.format("%s %i", myString, myInt);

In Scala would be:

val formatted = "%s %i".format(myString, myInt)

11. Can I use Scala Enumerations as QScript @Arguments?

No. Currently Scala's Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java enums if for example you are importing a Java based walker's enum type.

If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.

Created 2013-01-29 16:01:00 | Updated 2013-03-25 18:28:58 | Tags: coding-standards

Comments (0)


This document describes the current GATK coding standards for documentation and unit testing. The overall goal is that all functions be well documented, have unit tests, and conform to the coding conventions described in this guideline. It is primarily meant as an internal reference for team members, but we are making it public to provide an example of how we work. There are a few mentions of specific team member responsibilities and who to contact with questions; please just disregard those as they will not be applicable to you.

Coding conventions

General conventions

The Genome Analysis Toolkit generally follows Java coding standards and good practices, which can be viewed at Sun's site.

The original coding standard document for the GATK was developed in early 2009. It remains a reasonable starting point but may be superseded by statements on this page (available as a PDF).

Size of functions and functional programming style

Code in the GATK should be structured into clear, simple, and testable functions. Clear means that the function takes a limited number of arguments, most of which are values not modified, and in general should return newly allocated results, as opposed to directly modifying the input arguments (functional style). The max. size of functions should be approximately one screen's worth of real estate (no more than 80 lines), including inline comments. If you are writing functions that are much larger than this, you must refactor your code into modular components.

Code duplication

Do not duplicate code. If you are finding yourself wanting to make a copy of functionality, refactor the code you want to duplicate and enhance it. Duplicating code introduces bugs, makes the system harder to maintain, and will require more work since you will have a new function that must be tested, as opposed to expanding the tests on the existing functionality.


Functions must be documented following the javadoc conventions. That means that the first line of the comment should be a simple statement of the purpose of the function. Following that is an expanded description of the function, such as edge case conditions, requirements on the argument, state changes, etc. Finally comes the @param and @return fields, that should describe the meaning of each function argument, restrictions on the values allowed or returned. In general, the return field should be about types and ranges of those values, not the meaning of the result, as this should be in the body of the documentation.

Testing for valid inputs and contracts

The GATK uses Contracts for Java to help us enforce code quality during testing. See CoFoJa for more information. If you've never programmed with contracts, read their excellent description Adding contracts to a stack. Contracts are only enabled when we are testing the code (unittests and integration tests) and not during normal execution, so contracts can be reasonably expensive to compute. They are best used to enforce assumptions about the status of class variables and return results.

Contracts are tricky when it comes to input arguments. The best practice is simple:

  • Public functions with arguments should explicitly test those input arguments for good values with live java code (such as in the example below). Because the function is public, you don't know what the caller will be passing in, so you have to check and ensure quality.
  • Private functions with arguments should use contracts instead. Because the function is private, the author of the code controls use of the function, and the contracts enforce good use. But in principal the quality of the inputs should be assumed at runtime since only the author controlled calls to the function and input QC should have happened elsewhere

Below is an example private function that makes good use of input argument contracts:

 * Helper function to write out a IGV formatted line to out, at loc, with values
 * @param out a non-null PrintStream where we'll write our line
 * @param loc the location of values
 * @param featureName string name of this feature (see IGV format)
 * @param values the floating point values to associate with loc and feature name in out
        "out != null",
        "loc != null",
        "values.length > 0"
private void printIGVFormatRow(final PrintStream out, final GenomeLoc loc, final String featureName, final double ... values) {
    // note that start and stop are 0 based, but the stop is exclusive so we don't subtract 1
    out.printf("%s\t%d\t%d\t%s", loc.getContig(), loc.getStart() - 1, loc.getStop(), featureName);
    for ( final double value : values )
        out.print(String.format("\t%.3f", value));

Final variables

Final java fields cannot be reassigned once set. Nearly all variables you write should be final, unless they are obviously accumulator results or other things you actually want to modify. Nearly all of your function arguments should be final. Being final stops incorrect reassigns (a major bug source) as well as more clearly captures the flow of information through the code.

An example high-quality GATK function

 * Get the reference bases from referenceReader spanned by the extended location of this active region,
 * including additional padding bp on either side.  If this expanded region would exceed the boundaries
 * of the active region's contig, the returned result will be truncated to only include on-genome reference
 * bases
 * @param referenceReader the source of the reference genome bases
 * @param padding the padding, in BP, we want to add to either side of this active region extended region
 * @param genomeLoc a non-null genome loc indicating the base span of the bp we'd like to get the reference for
 * @return a non-null array of bytes holding the reference bases in referenceReader
@Ensures("result != null")
public byte[] getReference( final IndexedFastaSequenceFile referenceReader, final int padding, final GenomeLoc genomeLoc ) {
    if ( referenceReader == null ) throw new IllegalArgumentException("referenceReader cannot be null");
    if ( padding < 0 ) throw new IllegalArgumentException("padding must be a positive integer but got " + padding);
    if ( genomeLoc == null ) throw new IllegalArgumentException("genomeLoc cannot be null");
    if ( genomeLoc.size() == 0 ) throw new IllegalArgumentException("GenomeLoc must have size > 0 but got " + genomeLoc);

    final byte[] reference =  referenceReader.getSubsequenceAt( genomeLoc.getContig(),
            Math.max(1, genomeLoc.getStart() - padding),
            Math.min(referenceReader.getSequenceDictionary().getSequence(genomeLoc.getContig()).getSequenceLength(), genomeLoc.getStop() + padding) ).getBases();

    return reference;

Unit testing

All classes and methods in the GATK should have unit tests to ensure that they work properly, and to protect yourself and others who may want to extend, modify, enhance, or optimize you code. That GATK development team assumes that anything that isn't unit tested is broken. Perhaps right now they aren't broken, but with a team of 10 people they will become broken soon if you don't ensure they are correct going forward with unit tests.

Walkers are a particularly complex issue. UnitTesting the map and reduce results is very hard, and in my view largely unnecessary. That said, you should write your walkers and supporting classes in such a way that all of the complex data processing functions are separated from the map and reduce functions, and those should be unit tested properly.

Code coverage tells you how much of your class, at the statement or function level, has unit testing coverage. The GATK development standard is to reach something >80% method coverage (and ideally >80% statement coverage). The target is flexible as some methods are trivial (they just call into another method) so perhaps don't need coverage. At the statement level, you get deducted from 100% for branches that check for things that perhaps you don't care about, such as illegal arguments, so reaching 100% statement level coverage is unrealistic for most clases.

You can find out more information about generating code coverage results at Analyzing coverage with clover

We've created a unit testing example template in the GATK codebase that provides examples of creating core GATK data structures from scratch for unit testing. The code is in class ExampleToCopyUnitTest and can be viewed here in github directly ExampleToCopyUnitTest.

The GSA-Workflow

As of GATK 2.5, we are moving to a full code review process, which has the following benefits:

  • Reducing obvious coding bugs seen by other eyes
  • Reducing code duplication, as reviewers will be able to see duplicated code within the commit and potentially across the codebase
  • Ensure that coding quality standards are met (style and unit testing)
  • Setting a higher code quality standard for the master GATK unstable branch
  • Providing detailed coding feedback to newer developers, so they can improve their skills over time

The GSA workflow in words :

  • Create a new branch to start any work. Never work on master.
    • branch names have to follow the convention of [author prefix][feature name][JIRA ticket] (e.g. rp_pairhmm_GSA-232)
  • Make frequent commits.
  • Push frequently your branch to origin (branch -> branch)
  • When you're done -- rewrite your commit history to tell a compelling story Git Tools Rewriting History
  • Push your rewritten history, and request a code review.
    • The entire GSA team will review your code
    • Mark DePristo assigns the reviewer responsible for making the judgment based on all reviews and merge your code into master.
  • If your pull-request gets rejected, follow the comments from the team to fix it and repeat the workflow until you're ready to submit a new pull request.
  • If your pull-request is accepted, the reviewer will merge and remove your remote branch.

Example GSA workflow in the command line:

# starting a new feature
git checkout -b rp_pairhmm_GSA-332
git commit -av 
git push -u origin rp_pairhmm_GSA-332

# doing work on existing feature
git commit -av
git push

# ready to submit pull-request
git fetch origin
git rebase -i origin/master
git push -f

# after being accepted, delete your branch
git checkout master 
git pull
git branch -d rp_pairhmm_GSA-332
(the reviewer will remove your github branch)

Commit histories and rebasing

You must commit your code in small commit blocks with commit messages that follow the git best practices, which require the first line of the commit to summarize the purpose of the commit, followed by -- lines that describe the changes in more detail. For example, here's a recent commit that meets this criteria that added unit tests to the GenomeLocParser:

Refactoring and unit testing GenomeLocParser

-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts.  The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit

Now, git encourages you to commit code often, and develop your code in whatever order or what is best for you. So it's common to end up with 20 commits, all with strange, brief commit messages, that you want to push into the master branch. It is not acceptable to push such changes. You need to use the git command rebase to reorganize your commit history so satisfy the small number of clear commits with clear messages.

Here is a recommended git workflow using rebase:

  1. Start every project by creating a new branch for it. From your master branch, type the following command (replacing "myBranch" with an appropriate name for the new branch):

    git checkout -b myBranch

    Note that you only include the -b when you're first creating the branch. After a branch is already created, you can switch to it by typing the checkout command without the -b: "git checkout myBranch"

    Also note that since you're always starting a new branch from master, you should keep your master branch up-to-date by occasionally doing a "git pull" while your master branch is checked out. You shouldn't do any actual work on your master branch, however.

  2. When you want to update your branch with the latest commits from the central repo, type this while your branch is checked out:

    git fetch && git rebase origin/master

    If there are conflicts while updating your branch, git will tell you what additional commands to use.

    If you need to combine or reorder your commits, add "-i" to the above command, like so:

    git fetch && git rebase -i origin/master

    If you want to edit your commits without also retrieving any new commits, omit the "git fetch" from the above command.

If you find the above commands cumbersome or hard to remember, create aliases for them using the following commands:

    git config --global alias.up '!git fetch && git rebase origin/master'
    git config --global alias.edit '!git fetch && git rebase -i origin/master'
    git config --global alias.done '!git push origin HEAD:master'

Then you can type "git up" to update your branch, "git edit" to combine/reorder commits, and "git done" to push your branch.

Here are more useful tutorials on how to use rebase:

If you need help with rebasing, talk to Mauricio or David and they will help you out.

Created 2013-02-13 18:06:06 | Updated 2014-06-24 09:45:27 | Tags: tribble picard sam-jdk variantcontext htsjdk

Comments (5)

The picard repository on github contains all picard public tools. Libraries live under the htsjdk, which includes the samtools-jdk, tribble, and variant packages (which includes VariantContext and associated classes as well as the VCF/BCF codecs).

If you just need to check out the sources and don't need to make any commits into the picard repository, the command is:

git clone

Then within the picard directory, clone the htsjdk.

cd picard
git clone

Then you can attach the picard/src/java and picard/htsjdk/src/java directories in IntelliJ as a source directory (File -> Project Structure -> Libraries -> Click the plus sign -> "Attach Files or Directories" in the latest IntelliJ).

To build picard and the htsjdk all at once, type ant from within the picard directory. To run tests, type ant test

If you do need to make commits into the picard repository, first you'll need to create a github account, fork picard or htsjdk, make your changes, and then issue a pull request. For more info on pull requests, see:

Created 2015-09-30 03:47:55 | Updated | Tags: maven

Comments (5)

GATK 3.x releases are not currently published to Central. But it is possible to install the GATK into your local repository, where Maven can then pick up the GATK as a dependency.

TL;DR Clone GATK 3.4, mvn install, then use the GATK as any other artifact.

The repository you should use depends on what is your goal.

If you want to build your own analysis tools on top of the GATK engine (not including the GATK analysis tools), with the option of distributing your project to others, you should clone the gatk repo.

If you want to integrate the full GATK into a project for in-house purposes (redistribution is not allowed under the licensing terms), in which your tools can call GATK tools directly internally, you should clone gatk-protected. This can be done by running the following code:

: 'GATK 3.4 code has known issues with the Java 8 compiler. Make sure you are using Java 7.'
java -version

: 'The entire GATK repo is relatively large. This only downloads 3.4.'
git clone -b 3.4 --depth 1 gatk-protected-3.4
cd gatk-protected-3.4

: 'Install the gatk into a the local ~/.m2/repository, where your project can then refer to the GATK.'
mvn install

: 'Build the "external example" as a demo of using the GATK as a library.'
cd public/external-example
mvn verify
java -jar target/external-example-1.0-SNAPSHOT.jar -T MyExampleWalker --help

After the GATK is installed, add this dependency to your Maven artifact, and all other GATK dependencies will be included as well.


One thing you might run into is that the GATK artifacts, and hence the external-example, transitively depend on artifacts that are also not in Central. They are instead committed under the path public/repo. Like in the public/external-example/pom.xml, your Maven project may need to include this directory as an additional repository. That being said mvn install should copy the artifacts to ~/.m2/repository for you. For example, after the install, you should have a directory ~/.m2/repository/com/google/code/cofoja/cofoja.

If you somehow need to add the GATK's public repo as a repository, use a repository element like the one below:

        <name>GATK Public Local Repository</name>

Since the GATK is not in Central, each developer will need to install the GATK 3.4 once. Or, as an advanced step, your may also want to explore publishing the GATK on one of your shared local systems. If you have a shared filesystem you'd like to use as a repository, publish the GATK 3.4 to the directory using mvn install -Dmaven.repo.local=/mount/path/to/shared/repo, and then add a repository element to your Maven project. If your team is using a Maven repository such as Artifactory or Nexus, we can't provide guidance for publishing "third party" artifacts. But it should theoretically be possible, with instructions hopefully available through either Maven or the repository manager's help forums.

Created 2013-06-26 19:01:08 | Updated 2013-06-26 19:05:10 | Tags: developer parallelism multithreading nct nt

Comments (0)

This document provides an overview of what are the steps required to make a walker multi-threadable using the -nct and the -nt arguments, which make use of the NanoSchedulable and TreeReducible interfaces, respectively.

NanoSchedulable / -nct

Providing -nct support requires that you certify that your walker's map() method is thread-safe -- eg., if any data structures are shared across map() calls, access to these must be properly synchronized. Once your map() method is thread-safe, you can implement the NanoSchedulable interface, an empty interface with no methods that just marks your walker as having a map() method that's safe to parallelize:

 * Root parallelism interface.  Walkers that implement this
 * declare that their map function is thread-safe and so multiple
 * map calls can be run in parallel in the same JVM instance.
public interface NanoSchedulable {

TreeReducible / -nt

Providing -nt support requires that both map() and reduce() be thread-safe, and you also need to implement the TreeReducible interface. Implementing TreeReducible requires you to write a treeReduce() method that tells the engine how to combine the results of multiple reduce() calls:

public interface TreeReducible<ReduceType> {
     * A composite, 'reduce of reduces' function.
     * @param lhs 'left-most' portion of data in the composite reduce.
     * @param rhs 'right-most' portion of data in the composite reduce.
     * @return The composite reduce type.
    ReduceType treeReduce(ReduceType lhs, ReduceType rhs);

This method differs from reduce() in that while reduce() adds the result of a single map() call onto a running total, treeReduce() takes the aggregated results from multiple map/reduce tasks that have been run in parallel and combines them. So, lhs and rhs might each represent the final result from several hundred map/reduce calls.

Example treeReduce() implementation from the UnifiedGenotyper:

public UGStatistics treeReduce(UGStatistics lhs, UGStatistics rhs) {
    lhs.nBasesCallable += rhs.nBasesCallable;
    lhs.nBasesCalledConfidently += rhs.nBasesCalledConfidently;
    lhs.nBasesVisited += rhs.nBasesVisited;
    lhs.nCallsMade += rhs.nCallsMade;
    return lhs;

Created 2012-08-11 06:12:39 | Updated 2012-10-18 15:34:05 | Tags: inputs developer

Comments (2)

1. Naming walkers

Users identify which GATK walker to run by specifying a walker name via the --analysis_type command-line argument. By default, the GATK will derive the walker name from a walker by taking the name of the walker class and removing packaging information from the start of the name, and removing the trailing text Walker from the end of the name, if it exists. For example, the GATK would, by default, assign the name PrintReads to the walker class org.broadinstitute.sting.gatk.walkers.PrintReadsWalker. To override the default walker name, annotate your walker class with @WalkerName("<my name>").

2. Requiring / allowing primary inputs

Walkers can flag exactly which primary data sources are allowed and required for a given walker. Reads, the reference, and reference-ordered data are currently considered primary data sources. Different traversal types have different default requirements for reads and reference, but currently no traversal types require reference-ordered data by default. You can add requirements to your walker with the @Requires / @Allows annotations as follows:


By default, all parameters are allowed unless you lock them down with the @Allows attribute. The command:


will only allow the reads and the reference. Any other primary data sources will cause the system to exit with an error.

Note that as of August 2011, the GATK no longer supports RMD the @Requires and @Allows syntax, as these have moved to the standard @Argument system.

3. Command-line argument tagging

Any command-line argument can be tagged with a comma-separated list of freeform tags.

The syntax for tags is as follows:

-<argument>:<tag1>,<tag2>,<tag3> <argument value>

for example:

-I:tumor <my tumor data>.bam
-eval,VCF yri.trio.chr1.vcf

There is currently no mechanism in the GATK to validate either the number of tags supplied or the content of those tags.

Tags can be accessed from within a walker by calling getToolkit().getTags(argumentValue), where argumentValue is the parsed contents of the command-line argument to inspect.


The GATK currently has comprehensive support for tags on two built-in argument types:

  • -I,--input_file <input_file>

    Input BAM files and BAM file lists can be tagged with any type. When a BAM file list is tagged, the tag is applied to each listed BAM file.

From within a walker, use the following code to access the supplied tag or tags:

  • Input RODs, e.g. `-V ' or '-eval '

    Tags are used to specify ROD name and ROD type. There is currently no support for adding additional tags. See the ROD system documentation for more details.

4. Adding additional command-line arguments

Users can create command-line arguments for walkers by creating public member variables annotated with @Argument in the walker. The @Argument annotation takes a number of differentparameters:

  • fullName

    The full name of this argument. Defaults to the toLowerCase()’d member name. When specifying fullName on the command line, prefix with a double dash (--).

  • shortName

    The alternate, short name for this argument. Defaults to the first letter of the member name. When specifying shortName on the command line, prefix with a single dash (-).

  • doc

    Documentation for this argument. Will appear in help output when a user either requests help with the –-help (-h) argument or when a user specifies an invalid set of arguments. Documentation is the only argument that is always required.

  • required

    Whether the argument is required when used with this walker. Default is required = true.

  • exclusiveOf

    Specifies that this argument is mutually exclusive of another argument in the same walker. Defaults to not mutually exclusive of any other arguments.

  • validation

    Specifies a regular expression used to validate the contents of the command-line argument. If the text provided by the user does not match this regex, the GATK will abort with an error.

By default, all command-line arguments will appear in the help system. To prevent new and debugging arguments from appearing in the help system, you can add the @Hidden tag below the @Argument annotation, hiding it from the help system but allowing users to supply it on the command-line. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

Passing Command-Line Arguments

Arguments can be passed to the walker using either the full name or the short name. If passing arguments using the full name, the syntax is −−<arg full name> <value>.

--myint 6

If passing arguments using the short name, the syntax is -<arg short name> <value>. Note that there is a space between the short name and the value:

-m 6

Boolean (class) and boolean (primitive) arguments are a special in that they require no argument. The presence of a boolean indicates true, and its absence indicates false. The following example sets a flag to true.


Supplemental command-line argument annotations

Two additional annotations can influence the behavior of command-line arguments.

  • @Hidden

    Adding this annotation to an @Argument tells the help system to avoid displaying any evidence that this argument exists. This can be used to add additional debugging arguments that aren't suitable for mass consumption.

  • @Deprecated

    Forces the GATK to throw an exception if this argument is supplied on the command-line. This can be used to supply extra documentation to the user as command-line parameters change for walkers that are in flux.


Create an required int parameter with full name –myint, short name -m. Pass this argument by adding –myint 6 or -m 6 to the command line.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(doc="my integer")
    public int myInt;

Create an optional float parameter with full name –myFloatingPointArgument, short name -m. Pass this argument by adding –myFloatingPointArgument 2.71 or -m 2.71.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(fullName="myFloatingPointArgument",doc="a floating point argument",required=false)
    public float myFloat;

The GATK will parse the argument differently depending on the type of the public member variable’s type. Many different argument types are supported, including primitives and their wrappers, arrays, typed and untyped collections, and any type with a String constructor. When the GATK cannot completely infer the type (such as in the case of untyped collections), it will assume that the argument is a String. GATK is aware of concrete implementations of some interfaces and abstract classes. If the argument’s member variable is of type List or Set, the GATK will fill the member variable with a concrete ArrayList or TreeSet, respectively. Maps are not currently supported.

5. Additional argument types: @Input, @Output

Besides @Argument, the GATK provides two additional types for command-line arguments: @Input and @Output. These two inputs are very similar to @Argument but act as flags to indicate dataflow to Queue, our pipeline management software.

  • The @Input tag indicates that the contents of the tagged field represents a file that will be read by the walker.

  • The @Output tag indicates that the contents of the tagged field represents a file that will be written by the walker, for consumption by downstream walkers.

We're still determining the best way to model walker dependencies in our pipeline. As we determine best practices, we'll post them here.

6. Getting access to Reference Ordered Data (RMD) with @Input and RodBinding

As of August 2011, the GATK now provides a clean mechanism for creating walker @Input arguments and using these arguments to access Reference Meta Data provided by the RefMetaDataTracker in the map() call. This mechanism is preferred to the old implicit string-based mechanism, which has been retired.

At a very high level, the new RodBindings provide a handle for a walker to obtain the Feature records from Tribble from a map() call, specific to a command line binding provided by the user. This can be as simple as a single ROD file argument|one-to-one binding between a command line argument and a track, or as complex as an argument argument accepting multiple command line arguments, each with a specific name. The RodBindings are generic and type specific, so you can require users to provide files that emit VariantContexts, BedTables, etc, or simply the root type Feature from Tribble. Critically, the RodBindings interact nicely with the GATKDocs system, so you can provide summary and detailed documentation for each RodBinding accepted by your walker.

A single ROD file argument

Suppose you have a walker that uses a single track of VariantContexts, such as SelectVariants, in its calculation. You declare a standard GATK-style @Input argument in the walker, of type RodBinding<VariantContext>:

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

This will require the user to provide a command line option --variant:vcf my.vcf to your walker. To get access to your variants, in the map() function you provide the variants variable to the tracker, as in:

Collection<VariantContext> vcs = tracker.getValues(variants, context.getLocation());

which returns all of the VariantContexts in variants that start at context.getLocation(). See RefMetaDataTracker in the javadocs to see the full range of getter routines.

Note that, as with regular tribble tracks, you have to provide the Tribble type of the file as a tag to the argument (:vcf). The system now checks up front that the corresponding Tribble codec produces Features that are type-compatible with the type of the RodBinding<T>.

RodBindings are generic

The RodBinding class is generic, parameterized as RodBinding<T extends Feature>. This T class describes the type of the Feature required by the walker. The best practice for declaring a RodBinding is to choose the most general Feature type that will allow your walker to work. For example, if all you really care about is whether a Feature overlaps the site in map, you can use Feature itself, which supports this, and will allow any Tribble type to be provided, using a RodBinding<Feature>. If you are manipulating VariantContexts, you should declare a RodBinding<VariantContext>, which will restrict automatically the user to providing Tribble types that can create a object consistent with the VariantContext class (a VariantContext itself or subclass).

Note that in multi-argument RodBindings, as List<RodBinding<T>> arg, the system will require all files provided here to provide an object of type T. So List<RodBinding<VariantContext>> arg requires all -arg command line arguments to bind to files that produce VariantContexts.

An argument that can be provided any number of times

The RodBinding system supports the standard @Argument style of allowing a vararg argument by wrapping it in a Java collection. For example, if you want to allow users to provide any number of comp tracks to your walker, simply declare a List<RodBinding<VariantContext>> field:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

With this declaration, your walker will accept any number of -comp arguments, as in:

-comp:vcf 1.vcf -comp:vcf 2.vcf -comp:vcf 3.vcf

For such a command line, the comps field would be initialized to the List with three RodBindings, the first binding to 1.vcf, the second to 2.vcf and finally the third to 3.vcf.

Because this is a required argument, at least one -comp must be provided. Vararg @Input RodBindings can be optional, but you should follow proper varargs style to get the best results.

Proper handling of optional arguments

If you want to make a RodBinding optional, you first need to tell the @Input argument that its options (required=false):

@Input(fullName="discordance", required=false)
private RodBinding<VariantContext> discordanceTrack;

The GATK automagically sets this field to the value of the special static constructor method makeUnbound(Class c) to create a special "unbound" RodBinding here. This unbound object is type safe, can be safely passed to the RefMetaDataTracker get methods, and is guaranteed to never return any values. It also returns false when the isBound() method is called.

An example usage of isBound is to conditionally add header lines, as in:

if ( mask.isBound() ) {
    hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));

The case for vararg style RodBindings is slightly different. If you want, as above, users to be able to omit the -comp track entirely, you should initialize the value of the collection to the appropriate emptyList/emptySet in Collections:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=false)
public List<RodBinding<VariantContext>> comps = Collections.emptyList();

which will ensure that comps.isEmpty() is true when no -comp is provided.

Implicit and explicit names for RodBindings

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

By default, the getName() method in RodBinding returns the fullName of the @Input. This can be overloaded on the command-line by providing not one but two tags. The first tag is interpreted as the name for the binding, and the second as the type. As in:

-variant:vcf foo.vcf     => getName() == "variant"
-variant:foo,vcf foo.vcf => getName() == "foo"

This capability is useful when users need to provide more meaningful names for arguments, especially with variable arguments. For example, in VariantEval, there's a List<RodBinding<VariantContext>> comps, which may be dbsnp, hapmap, etc. This would be declared as:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

where a normal command line usage would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

In the code, you might have a loop that looks like:

for ( final RodBinding comp : comps )
    for ( final VariantContext vc : tracker.getValues(comp, context.getLocation())
        out.printf("%s has a binding at %s%n", comp.getName(), getToolkit().getGenomeLocParser.createGenomeLoc(vc)); 

which would print out lines that included things like:

hapmap has a binding at 1:10
omni has a binding at 1:20
hapmap has a binding at 1:30
1000g has a binding at 1:30

This last example begs the question -- what happens with getName() when explicit names are not provided? The system goes out of its way to provide reasonable names for the variables:

  • The first occurrence is named for the fullName, where comp

  • Subsequent occurrences are postfixed with an integer count, starting at 2, so comp2, comp3, etc.

In the above example, the command line

-comp:vcf hapmap.vcf -comp:vcf omni.vcf -comp:vcf 1000g.vcf

would emit

comp has a binding at 1:10
comp2 has a binding at 1:20
comp has a binding at 1:30
comp3 has a binding at 1:30

Dynamic type resolution

The new RodBinding system supports a simple form of dynamic type resolution. If the input filetype can be specially associated with a single Tribble type (as VCF can), then you can omit the type entirely from the the command-line binding of a RodBinding!

So whereas a full command line would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

because these are VCF files they could technically be provided as:

-comp:hapmap hapmap.vcf -comp:omni omni.vcf -comp:1000g 1000g.vcf

If you don't care about naming, you can now say:

-comp hapmap.vcf -comp omni.vcf -comp 1000g.vcf

Best practice for documenting a RodBinding

The best practice is simple: use a javadoc style comment above the @Input annotation, with the standard first line summary and subsequent detailed discussion of the meaning of the argument. These are then picked up by the GATKdocs system and added to the standard walker docs, following the standard structure of GATKDocs @Argument docs. Below is a best practice documentation example from SelectVariants, which accepts a required variant track and two optional discordance and concordance tracks.

public class SelectVariants extends RodWalker<Integer, Integer> {
     * Variants from this file are sent through the filtering and modifying routines as directed
     * by the arguments to SelectVariants, and finally are emitted.
    @Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
    public RodBinding<VariantContext> variants;

     * A site is considered discordant if there exists some sample in eval that has a non-reference genotype
     * and either the site isn't present in this track, the sample isn't present in this track,
     * or the sample is called reference in this track.
    @Input(fullName="discordance", shortName = "disc", doc="Output variants that were not called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> discordanceTrack;

     * A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
     * in both variants and concordance tracks or (2) every sample present in eval is present in the concordance
     * track and they have the sample genotype call.
    @Input(fullName="concordance", shortName = "conc", doc="Output variants that were also called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> concordanceTrack;

Note how much better the above version is compared to the old pre-Rodbinding syntax (code below). Below you have a required argument variant that doesn't show up as a formal argument in the GATK, different from the conceptually similar @Arguments for discordanceRodName and concordanceRodName, which have no type restrictions. There's no place to document the variant argument as well, so the system is effectively blind to this essential argument.

@Requires(value={},referenceMetaData=@RMD(name="variant", type=VariantContext.class))
public class SelectVariants extends RodWalker<Integer, Integer> {
    @Argument(fullName="discordance", shortName =  "disc", doc="Output variants that were not called on a ROD comparison track. Use -disc ROD_NAME", required=false)
    private String discordanceRodName = "";

    @Argument(fullName="concordance", shortName =  "conc", doc="Output variants that were also called on a ROD comparison track. Use -conc ROD_NAME", required=false)
    private String concordanceRodName = "";

RodBinding examples

In these examples, we have declared two RodBindings in the Walker

@Input(fullName="mask", doc="Input ROD mask", required=false)
public RodBinding<Feature> mask = RodBinding.makeUnbound(Feature.class);

@Input(fullName="comp", doc="Comparison track", required=false)
public List<RodBinding<VariantContext>> comps = new ArrayList<VariantContext>();
  • Get the first value

    Feature f = tracker.getFirstValue(mask)

  • Get all of the values at a location

    Collection<Feature> fs = tracker.getValues(mask, thisGenomeLoc)

  • Get all of the features here, regardless of track

    Collection<Feature> fs = tracker.getValues(Feature.class)

  • Determining if an optional RodBinding was provided . if ( mask.isBound() ) // writes out the mask header line, if one was provided hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));

    if ( ! comps.isEmpty() )"At least one comp was provided")

Example usage in Queue scripts

In QScripts when you need to tag a file use the class TaggedFile which extends from

Example in the QScript on the Command Line
Untagged VCF myWalker.variant = new File("my.vcf") -V my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF") -V:VCF my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF,custom=value") -V:VCF,custom=value my.vcf
Labeling a tumor myWalker.input_file :+= new TaggedFile("mytumor.bam", "tumor") -I:tumor mytumor.bam


No longer need to (or can) use @Requires and @Allows for ROD data. This system is now retired.

Created 2012-08-15 18:37:57 | Updated 2012-10-18 15:20:32 | Tags: developer walkers

Comments (0)

The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented.

Locus walkers

Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.

1. Switching between covered and uncovered loci

The @By attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:


2. Filtering defaults

By default, the following filters are automatically added to every locus walker.

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments.
  • Duplicate reads.
  • Reads failing vendor quality checks.

ROD walkers

These walkers walk over the data set one location at a time, but only those locations covered by reference-ordered data. They are essentially a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome at sites where there is ROD information. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to RefWalkers<M,T> traversals with the following few quirks.

1. Differences from a RefWalker

  • RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.

  • Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided:

    nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location

In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null ReferenceContext and RefMetaDataTracker objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval.

2. Filtering defaults

ROD walkers inherit the same filters as locus walkers:

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments.
  • Duplicate reads.
  • Reads failing vendor quality checks.

3. Example change over of VariantEval

Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a RodWalker from its old RefWalker state:

//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
public class VariantEvalWalker extends RodWalker<Integer, Integer> {

The map function must now capture the number of skipped bases and protect itself from the final interval map calls:

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    nMappedSites += context.getSkippedBases();

    if ( ref == null ) { // we are seeing the last site
        return 0;


That's all there is to it!

4. Performance improvements

A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs. Here is a comparison of ROD vs. Ref walker implementation of VariantEval:

RODWalker RefWalker
dbSNP and 1KG Pilot 2 SNP calls on chr1 164u (s) 768u (s)
Just 1KG Pilot 2 SNP calls on chr1 54u (s) 666u (s)

Read walkers

Read walkers walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.

Filtering defaults

By default, the following filters are automatically added to every read walker.

  • Reads with nonsensical alignments

Read pair walkers

Read pair walkers walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.

Filtering defaults

By default, the following filters are automatically added to every read pair walker.

  • Reads with nonsensical alignments

Duplicate walkers

Duplicate walkers walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.

Filtering defaults

By default, the following filters are automatically added to every duplicate walker.

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments

Created 2013-11-05 22:10:00 | Updated 2014-02-17 21:29:11 | Tags: private maven build broadies

Comments (2)


We're replacing Ant with Maven. To build, run mvn verify.


In the early days of the Genome Analysis Toolkit (GATK), the code base separated the GATK genomics engine from the core java utilities, encompassed in a wider project called Sting. During this time, the build tool of choice was the relatively flexible Java build tool Apache Ant, run via the command ant.

As our code base expanded to more and more packages, groups internal and external to GSA, and the Broad, have expressed interest in using portions of Sting/GATK as modules in larger projects. Unfortunately over time, many parts of the GATK and Sting intermingled, producing the current situation where developers finds it easier to copy the monolithic GATK instead, or individual java files, instead of using the tools as libraries.

The goal of this first stage is to split the parts of the monolithic Sting/GATK into easily recognizable sub artifacts. The tool used to accomplish this task is Apache Maven, also known as Maven, and run via the command mvn. Maven convention encourages developers to separate code, and accompanying resources, into a hierarchical structure of reusable artifacts. Maven attempts to avoid build configuration, preferring source repositories to lay out code in a conventional structure. When needed, a Maven configuration file called pom.xml specifies each artifact's build configuration, that one may think of as similar to an Ant build.xml.

The actual migration consisted of zero changes to the contents of existing Java source files, easing git merges and rebasing. The Java files from public, protected, and private have all moved into Maven conventional child artifacts, with each artifact containing a separate pom.xml.


Obtaining the GATK with Maven support

Clone the repository:

git clone ssh:// cd gsa-unstable

Building GATK and Queue

Clone the repository:

git clone ssh:// cd gsa-unstable

If running on a Broad server, add maven to your environment via the dotkit:

reuse Maven-3.0.3

Build all of Sting, including packaged versions of the GATK and Queue:

mvn verify

The packaged, executable jar files will be output to:

public/gatk-package/target/gatk-package-2.8-SNAPSHOT.jar public/queue-package/target/queue-package-2.8-SNAPSHOT.jar

Find equivalent maven commands for existing ant targets:

./ <target> <properties>

Example output:

$ ./ fasttest -Dsingle=GATKKeyUnitTest Equivalent maven command mvn verify -Dsting.committests.skipped=false -pl private/gatk-private -am -Dresource.bundle.skip=true -Dit.test=disabled -Dtest=GATKKeyUnitTest $

Running the GATK and Queue

To run the GATK, or copy the compiled jar, find the packaged jar under public/gatk-package/target


To run Queue, the jar is under the similarly named public/queue-package/target


NOTE: Unlike builds with Ant, you cannot execute the jar file built by the gatk-framework module. This is because maven does not include dependent artifacts in the target folder with assembled framework jar. Instead, use the packaged jars, listed above, that contain all the classes and resources needed to run the GATK, or Queue.

Excluding Queue

NOTE: If you make changes to sting-utils, gatk-framework, or any other dependencies and disable queue, you may accidentally end up breaking the full repository build without knowing.

The Queue build contributes a majority portion of the Sting project build time. To exclude Queue from your build, run maven with either (the already shell escaped) -P\!queue or -Ddisable.queue. Currently the latter property also disables the maven queue profile. This allows one other semi-permanent option to disable building Queue as part of the Sting repository. Configure your local Maven settings to always pass the property -Ddisable.queue by adding and activating a custom profile in your local ~/.m2/settings.xml

```$ cat ~/.m2/settings.xml

disable.queue true disable.queue


Using the GATK framework as a module

Currently the GATK artifacts are not available via any centralized repository. To build code using the GATK you must still have a checkout of the GATK source code, and install the artifacts to your local mvn repository (by default ~/.m2/repository). The installation copies the artifacts to your local repo such that it may be used by your external project. The checkout of the local repo provides several artifacts under public/repo that will be required for your project.

After updating to the latest version of the Sting source code, install the Sting artifacts via:

mvn install

After the GATK has been installed locally, in your own source repository, include the artifact gatk-framework as a library.

In Apache Maven add this dependency:




For Apache Ivy, you may need to specify ~/.m2/repository as a local repo. Once the local repository has been configured, ivy may find the dependency via:

<dependency org="org.broadinstitute.sting" name="gatk-framework" rev="2.8-SNAPSHOT" />

If you decide to also use Maven to build your project, your source code should go under the conventional directory src/main/java. The pom.xml contains any special configuration for your project. To see an example pom.xml and maven conventional project structure in:


Moved directories

If you have an old git branch that needs to be merged, you may need to know where to move files in order for your classes to now build with Maven. In general, most directories were moved with minimal or no changes.

Old directory New maven directory
private/java/src/ private/gatk-private/src/main/java/
private/R/scripts/ private/gatk-private/src/main/resources/
private/java/test/ private/gatk-private/src/test/java/
private/testdata/ private/gatk-private/src/test/resources/
private/scala/qscript/ private/queue-private/src/main/qscripts/
private/scala/src/ private/queue-private/src/main/scala/
private/scala/test/ private/queue-private/src/test/scala/
protected/java/src/ protected/gatk-protected/src/main/java/
protected/java/test/ protected/gatk-protected/src/test/java/
public/java/src/ public/gatk-framework/src/main/java/
public/java/test/ public/gatk-framework/src/test/java/
public/testdata/ public/gatk-framework/src/test/resources/
public/scala/qscript/ public/queue-framework/src/main/qscripts/
public/scala/src/ public/queue-framework/src/main/scala/
public/scala/test/ public/queue-framework/src/test/scala/

Future Directions

Further segregate source code

Currently, the artifacts sting-utils and the gatk-framework contain intertwined code bases. This leads to the current setup where all sting-utils code is actually found in the gatk-framework artifact, including generic utilities that could be used by other software modules. In the future, all elements under org.broadinstitute.sting.gatk will be located the gatk-framework, while all other packages under org.broadinstitut.sting will be evaluated and then separated under the gatk-framework or sting-utils artifacts.

Publishing artifacts

Tangentially related to segregating sting-utils and the gatk-framework, the current Sting and GATK artifacts are ineligible to be pushed to the Maven Central Repository, due to several other issues:

  • Need to provide trivial workflow for Picard, and possibly SnpEff, to submit to central
  • Missing meta files for the jars:
    • *-sources.jar
    • *-javadoc.jar
    • *.md5
    • *.sha1

NOTE: Artifact jars do NOT need to actually be in Central, and may be available as pom reference only, for example Oracle ojdbc.

In the near term, we could use a private repos based on Artifactory or Nexus (comparison). After more work of adding, cleaning up, or centrally publishing all the dependencies for Sting, we may then publish into the basic Central repo. Or, we could move to a social service like BinTray (think GitHub vs. Git).

Status Updates

February 13, 2014

Maven is now the default in gsa-unstable's master branch. For GATK developers, the git migration is effectively complete. Software engineers are resolving a few remaining issues related to the automated build and testing infrastructure, but the basic workflow for developers should now be up to date.

January 30, 2014

The migration to to maven has begun in the gsa-unstable repository on the ks_new_maven_build_system branch.

November 5, 2013

The maven port of the existing ant build resides in the gsa-qc repository.

This is an old branch of Sting/GATK, with the existing files relocated to Maven appropriate locations, pom.xml files added, along with basic resources to assist in artifact generation.

Created 2012-08-11 06:36:39 | Updated 2012-10-18 15:32:05 | Tags: developer output

Comments (0)

1. Introduction

When running either single-threaded or in shared-memory parallelism mode, the GATK guarantees that output written to an output stream created via the @Argument mechanism will ultimately be assembled in genomic order. In order to assemble the final output file, the GATK will write the output generated from each thread into a temporary output file, ultimately assembling the data via a central coordinating thread. There are three major elements in the GATK that facilitate this functionality:

  • Stub

    The front-end interface to the output management system. Stubs will be injected into the walker by the command-line argument system and relay information from the walker to the output management system. There will be one stub per invocation of the GATK.

  • Storage

    The back end interface, responsible for creating, writing and deleting temporary output files as well as merging their contents back into the primary output file. One Storage object will exist per shard processed in the GATK.

  • OutputTracker

    The dispatcher; ultimately connects the stub object's output creation request back to the most appropriate storage object to satisfy that request. One OutputTracker will exist per GATK invocation.

2. Basic Mechanism

Stubs are directly injected into the walker through the GATK's command-line argument parser as a go-between from walker to output management system. When a walker calls into the stub it's first responsibility is to call into the output tracker to retrieve an appropriate storage object. The behavior of the OutputTracker from this point forward depends mainly on the parallelization mode of this traversal of the GATK.

If the traversal is single-threaded:

  • the OutputTracker (implemented as DirectOutputTracker) will create the storage object if necessary and return it to the stub.

  • The stub will forward the request to the provided storage object.

  • At the end of the traversal, the microscheduler will request that the OutputTracker finalize and close the file.

If the traversal is multi-threaded using shared-memory parallelism:

  • The OutputTracker (implemented as ThreadLocalOutputTracker) will look for a storage object associated with this thread via a ThreadLocal.

  • If no such storage object exists, it will be created pointing to a temporary file.

  • At the end of each shard processed, that file will be closed and an OutputMergeTask will be created so that the shared-memory parallelism code can merge the output at its leisure.

  • The shared-memory parallelism code will merge when a fixed number of temporary files appear in the input queue. The constant used to determine this frequency is fixed at compile time (see HierarchicalMicroScheduler.MAX_OUTSTANDING_OUTPUT_MERGES).

3. Using output management

To use the output management system, declare a field in your walker of one of the existing core output types, coupled with either an @Argument or @Output annotation.

@Output(doc="Write output to this BAM filename instead of STDOUT")
SAMFileWriter out;

Currently supported output types are SAM/BAM (declare SAMFileWriter), VCF (declare VCFWriter), and any non-buffering stream extending from OutputStream.

4. Implementing a new output type

To create a new output type, three types must be implemented: Stub, Storage, and ArgumentTypeDescriptor.

To implement Stub

Create a new Stub class, extending/inheriting the core output type's interface and implementing the Stub interface.

OutputStreamStub extends OutputStream implements Stub<OutputStream> {

Implement a register function so that the engine can provide the stub with the session's OutputTracker.

public void register( OutputTracker outputTracker ) {
    this.outputTracker = outputTracker;

Add as fields any parameters necessary for the storage object to create temporary storage.

private final File targetFile;
public File getOutputFile() { return targetFile; }

Implement/override every method in the core output type's interface to pass along calls to the appropriate storage object via the OutputTracker.

public void write( byte[] b, int off, int len ) throws IOException {
    outputTracker.getStorage(this).write(b, off, len);

To implement Storage

Create a Storage class, again extending inheriting the core output type's interface and implementing the Storage interface.

public class OutputStreamStorage extends OutputStream implements Storage<OutputStream> {

Implement constructors that will accept just the Stub or Stub + alternate file path and create a repository for data, and a close function that will close that repository.

public OutputStreamStorage( OutputStreamStub stub ) { ... }
public OutputStreamStorage( OutputStreamStub stub, File file ) { ... }
public void close() { ... }

Implement a mergeInto function capable of reconstituting the file created by the constructor, dumping it back into the core output type's interface, and removing the source file.

public void mergeInto( OutputStream targetStream ) { ... }

Add a block to StorageFactory.createStorage() capable of creating the new storage object. TODO: use reflection to generate the storage classes.

    if(stub instanceof OutputStreamStub) {
        if( file != null )
            storage = new OutputStreamStorage((OutputStreamStub)stub,file);
            storage = new OutputStreamStorage((OutputStreamStub)stub);

To implement ArgumentTypeDescriptor

Create a new object inheriting from type ArgumentTypeDescriptor. Note that the ArgumentTypeDescriptor does NOT need to support the core output type's interface.

public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {

Implement a truth function indicating which types this ArgumentTypeDescriptor can service.

 public boolean supports( Class type ) {
     return SAMFileWriter.class.equals(type) || StingSAMFileWriter.class.equals(type);

Implement a parse function that constructs the new Stub object. The function should register this type as an output by caling engine.addOutput(stub).

 public Object parse( ParsingEngine parsingEngine, ArgumentSource source, Type type, ArgumentMatches matches )  {
     OutputStreamStub stub = new OutputStreamStub(new File(fileName));
     return stub;

Add a creator for this new ArgumentTypeDescriptor in CommandLineExecutable.getArgumentTypeDescriptors().

 protected Collection<ArgumentTypeDescriptor> getArgumentTypeDescriptors() {
     return Arrays.asList( new VCFWriterArgumentTypeDescriptor(engine,System.out,argumentSources),
                           new SAMFileWriterArgumentTypeDescriptor(engine,System.out),
                           new OutputStreamArgumentTypeDescriptor(engine,System.out) );

After creating these three objects, the new output type should be ready for usage as described above.

5. Outstanding issues

  • Only non-buffering iterators are currently supported by the GATK. Of particular note, PrintWriter will appear to drop records if created by the command-line argument system; use PrintStream instead.

  • For efficiency, the GATK does not reduce output files together following the tree pattern used by shared-memory parallelism; output merges happen via an independent queue. Because of this, output merges happening during a treeReduce may not behave correctly.

Created 2012-08-15 17:15:34 | Updated 2012-10-18 15:24:35 | Tags: developer walkers

Comments (4)

1. Introduction

The LocusTraversal now supports passing walkers reads that have deletions spanning the current locus. This is useful in many situation where you want to calculate coverage, call variants and need to avoid calling variants where there are a lot of deletions, etc.

Currently, the system by default will not pass you deletion-spanning reads. In order to see them, you need to overload the function:

 * (conceptual static) method that states whether you want to see reads piling up at a locus
 * that contain a deletion at the locus.
 * ref:   ATCTGA
 * read1: ATCTGA
 * read2: AT--GA
 * Normally, the locus iterator only returns a list of read1 at this locus at position 3, but
 * if this function returns true, then the system will return (read1, read2) with offsets
 * of (3, -1).  The -1 offset indicates a deletion in the read.
 * @return false if you don't want to see deletions, or true if you do
public boolean includeReadsWithDeletionAtLoci() { return true; }

in your walker. Now you will start seeing deletion-spanning reads in your walker. These reads are flagged with offsets of -1, so that you can:

    for ( int i = 0; i < context.getReads().size(); i++ ) {
        SAMRecord read = context.getReads().get(i);
        int offset = context.getOffsets().get(i);

       if ( offset == -1 ) 

There are also two convenience functions in AlignmentContext to extract subsets of the reads with and without spanning deletions:

 * Returns only the reads in ac that do not contain spanning deletions of this locus
 * @param ac
 * @return
public static AlignmentContext withoutSpanningDeletions( AlignmentContext ac );

 * Returns only the reads in ac that do contain spanning deletions of this locus
 * @param ac
 * @return
public static AlignmentContext withSpanningDeletions( AlignmentContext ac );

Created 2014-04-04 18:53:24 | Updated 2015-07-28 15:53:10 | Tags: intellij maven build

Comments (28)


Since GATK 3.0, we use Apache Maven (instead of Ant) as our build system, and IntelliJ as our IDE (Integrated Development Environment). This document describes how to get set up to use Maven as well as how to create an IntelliJ project around our Maven project structure.

Before you start

  • Ensure that you have git clones of our repositories on your machine. See this document for details on obtaining the GATK source code from our Git repos.

Setting up Maven

  1. Check whether you can run mvn --version on your machine. If you can't, install Maven from here.

  2. Ensure that the JAVA_HOME environment variable is properly set. If it's not, add the appropriate line to your shell's startup file:

    for tcsh:

    setenv JAVA_HOME  \`/usr/libexec/java_home\`

    for bash:

    export JAVA_HOME=\`/usr/libexec/java_home\`

Note that the commands above use backticks, not single quotes.

Basic Maven usage

  1. To compile everything, type:

    mvn verify
  2. To compile the GATK but not Queue (much faster!), the command is:

    mvn verify -P\!queue

    Note that the ! needs to be escaped with a backslash to avoid interpretation by the shell.

  3. To obtain a clean working directory, type:

    mvn clean
  4. If you're used to using ant to compile the GATK, you should be able to feed your old ant commands to the script in the root directory. For example:

    ./ test -Dsingle=MyTestClass

Setting up IntelliJ

  1. Run mvn test-compile in your git clone's root directory.

  2. Open IntelliJ

  3. File -> import project, select your git clone directory, then click "ok"

  4. On the next screen, select "import project from external model", then "maven", then click "next"

  5. Click "next" on the next screen without changing any defaults -- in particular:

    • DON'T check "Import maven projects automatically"
    • DON'T check "Create module groups for multi-module maven projects"
  6. On the "Select Profiles" screen, make sure private and protected ARE checked, then click "next".

  7. On the next screen, the "gatk-aggregator" project should already be checked for you -- if not, then check it. Click "next".

  8. Select the 1.7 SDK, then click "next".

  9. Select an appropriate project name (can be anything), then click "next" (or "finish", depending on your version of IntelliJ).

  10. Click "Finish" to create the new IntelliJ project.

  11. That's it! Due to Maven magic, everything else will be set up for you automatically, including modules, libraries, Scala facets, etc.

  12. You will see a popup "Maven projects need to be imported" on every IntelliJ startup. You should click import unless you're working on the actual pom files that make up the build system.

Created 2014-05-16 16:31:03 | Updated 2014-09-29 17:57:56 | Tags: maven build sting

Comments (2)


The GATK 3.2 source code uses new java package names, directory paths, and executable jars. Post GATK 3.2, any patches submitted via pull requests should also include classes moved to the appropriate artifact.

Note that the document includes references to the private module, which is part of our internal development codebase but is not available to the general public.


A long term ideal of the GATK is to separate out reusable parts and eventually make them available as compiled libraries via centralized binary repositories. Ahead of publishing a number of steps must be completed. One of the larger steps has been completed for GATK 3.2, where the code base rebranded all references of Sting to GATK.

Currently implemented changes include:

  • Java/Scala package names changed from org.broadinstitute.sting to org.broadinstitute.gatk
  • Renamed Maven artifacts including new directories

As of May 16, 2014, remaining TODOs ahead of publishing to central include:

  • Uploading all transitive GATK dependencies to central repositories
  • Separating a bit more of the intertwined utility, engine, and tool classes

Now that the new package names and Maven artifacts are available, any pull request should include ensuring that updated classes are also moved into the correct GATK Maven artifact. While there are a significant number of classes, cleaning up as we go along will allow the larger task to be completed in a distributed fashion.

The full lists of new Maven artifacts and renamed packages are below under [Renamed Artifact Directories]. For those developers in the middle of a git rebase around commits before and after 3.2, here is an abridged mapping of renamed directories for those trying to locate files:

Old Maven Artifact New Maven Artifact
public/sting-root public/gatk-root
public/sting-utils public/gatk-utils
public/gatk-framework public/gatk-tools-public
public/queue-framework public/gatk-queue
protected/gatk-protected protected/gatk-tools-protected
private/gatk-private private/gatk-tools-private
private/queue-private private/gatk-queue-private

QScripts are no longer located with the Queue engine, and instead are now located with the GATK wrappers implemented as Queue extensions. See [Separated Queue Extensions] for more info.


Separating the GATK Engine and Tools

Starting with GATK 3.2, separate Maven utility artifacts exist to separate reusable portions of the GATK engine apart from tool specific implementations. The biggest impact this will have on developers is the separation of the walkers packages.

In GATK versions <= 3.1 there was one package for both the base classes and the implementations of walkers:

  • org.broadinstitute.sting.gatk.walkers

In GATK versions >= 3.2 threre are two packages. The first contains the base interfaces, annotations, etc. The latter package is for the concrete tools implemented as walkers:

  • org.broadinstitute.gatk.engine.walkers

    • Ex: ReadWalker, LocusWalker, @PartitionBy, @Requires, etc.
    • Ex: PrintReads, VariantEval, IndelRealigner, HaplotypeCaller, etc.

Renamed Binary Packages

Previously, depending on how the source code was compiled, the executable gatk-package-3.1.jar and queue-package-3.1.jar (aka GenomeAnalysisTK.jar and Queue.jar) contained various mixes of public/protected/private code. For example, if the private directory was present when the source code was compiled, the same artifact named gatk-package-3.1.jar might, or might not contain private code.

Starting with 3.2, there are two versions of the jar created, each with specific file contents.

New Maven Artifact Alias in the /target folder Packaged contents
gatk-package-distribution-3.2.jar GenomeAnalysisTK.jar public,protected
gatk-package-internal-3.2.jar GenomeAnalysisTK-internal.jar public,protected,private
gatk-queue-package-distribution-3.2.jar Queue.jar public,protected
gatk-queue-package-internal-3.2.jar Queue-internal.jar public,protected,private

Separated Queue Extensions

When creating a packaged version of Queue, the GATKExtensionsGenerator builds Queue engine compatible command line wrappers around each GATK walker. Previously, the wrappers were generated during the compilation of the Queue framework. Similar to the binary packages, depending on who built the source code, queue-framework-3.1.jar would contain various mixes of public/protected/private wrappers.

Starting with GATK 3.2, the gatk-queue-3.2.jar only contains code for the Queue engine. Generated and manually created extensions for wrapping any other command line programs are all included in separate artifacts. Due to a current limitation regarding how the generator uses reflection, the generator cannot build wrappers for just private classes without also generating protected and public classes. Thus, there are three different Maven artifacts generated, that contain different mixes of public, protected and private wrappers.

Extensions Artifact Generated wrappers for GATK tools
gatk-queue-extensions-public-3.2.jar public only
gatk-queue-extensions-distribution-3.2.jar public,protected
gatk-queue-extensions-internal-3.2.jar public,protected,private

As for QScripts that used to be located with the framework, they are now located with the generated wrappers.

Old QScripts Artifact Directory New QScripts Artifact Directory
public/queue-framework/src/main/qscripts public/gatk-queue-extensions-public/src/main/qscripts
private/queue-private/src/main/qscripts private/gatk-queue-extensions-internal/src/main/qscripts

Renamed Artifact Directories

The following list shows the mapping of artifact names pre and post GATK 3.2. In addition to the engine changes, the packaging updates and extensions changes above also affected Maven artifact refactoring. The packaging artifacts have split from a single public to protected and private versions, and new queue extensions artifacts have been added as well.

Maven Artifact <= GATK 3.1 Maven Artifact >= GATK 3.2
/pom.xml (sting-aggregator) /pom.xml _(gatkaggregator)
public/sting-root public/gatk-root
public/sting-utils public/gatk-utils
none public/gatk-engine
public/gatk-framework public/gatk-tools-public
public/queue-framework public/gatk-queue
public/gatk-queue-extgen public/gatk-queue-extensions-generator
protected/gatk-protected protected/gatk-tools-protected
private/gatk-private private/gatk-tools-private
private/queue-private private/gatk-queue-private
public/gatk-package protected/gatk-package-distribution
public/queue-package protected/gatk-queue-package-distribution
none private/gatk-package-internal
none private/gatk-queue-package-internal
none public/gatk-queue-extensions-public
none protected/gatk-queue-extensions-distribution
none private/gatk-queue-extensions-internal

A note regarding the aggregator:

The aggregator is the pom.xml in the top directory level of the GATK source code. When someone clones the GATK source code and runs mvn in the top level directory, the aggregator the pom.xml executed.

The root is a pom.xml that contains all common Maven configuration. There are a couple dependent pom.xml files that inherit configuration from the root, but are NOT aggregated during normal source compilation.

As of GATK 3.2, these un-aggregated child artifacts are VectorPairHMM and picard-maven. They should not run by default with each instance of mvn run on the GATK source code.

For more clarification on Maven Inheritance vs. Aggregation, see the Maven introduction to the pom.

Renamed Java/Scala Package Names

In GATK 3.2, except for classes with Sting in the name, all file names are still the same. To locate migrated files under new java package names, developers should either use Intellij IDEA Navigation or /bin/find to locate the same file they used previously.

The biggest change most developers will face is the new package names for GATK classes. Code entanglement does not permit simply moving the classes into the correct Maven artifacts, as a few number of lines of code must be edited inside a large number of files. So post renaming only a very small number of classes were moved out of the incorrect Maven artifacts as examples.

As of the May 16, 2014, the migrated GATK package distribution is as follows. This list includes only main classes. The table excludes all tests, renamed files such as StingException, certain private Queue wrappers, and qscripts renamed to end in *.scala.

Scope Type <= 3.1 Artifact <= 3.1 Package >= GATK 3.2 Artifact >= 3.2 GATK Package Files
public java gatk-framework o.b.s gatk-utils o.b.g 4
public java gatk-framework o.b.s.gatk gatk-engine o.b.g.engine 2
public java gatk-framework o.b.s gatk-tools-public o.b.g 202
public java gatk-framework o.b.s gatk-tools-public o.b.g.utils 49
public java gatk-framework o.b.s gatk-tools-public o.b.g.engine 34
public java gatk-framework o.b.s.gatk gatk-tools-public o.b.g.engine 244
public java gatk-framework o.b.s.gatk gatk-tools-public 134
public java gatk-framework o.b.s.gatk gatk-tools-public 2
protected java gatk-protected o.b.s gatk-tools-protected o.b.g 44
protected java gatk-protected o.b.s.gatk gatk-tools-protected o.b.g.engine 1
protected java gatk-protected o.b.s.gatk gatk-tools-protected 209
private java gatk-private o.b.s gatk-tools-private o.b.g 23
private java gatk-private o.b.s gatk-tools-private o.b.g.utils 7
private java gatk-private o.b.s.gatk gatk-tools-private o.b.g.engine 5
private java gatk-private o.b.s.gatk gatk-tools-private 133
public java queue-framework o.b.s gatk-queue o.b.g 2
public scala queue-framework o.b.s gatk-queue o.b.g 72
public scala queue-framework o.b.s gatk-queue-extensions-public o.b.g 31
public qscripts queue-framework o.b.s gatk-queue-extensions-public o.b.g 12
private scala queue-private o.b.s gatk-queue-private o.b.g 2
private qscripts queue-private o.b.s gatk-queue-extensions-internal o.b.g 118

During all future code modifications and pull requests, classes should be refactored to correct artifacts and package as follows.

All non-engine tools should be in the tools artifacts, with appropriate sub-package names.

Scope Type Artifact Package(s)
public java gatk-utils o.b.g.utils
public java gatk-engine o.b.g.engine
public java gatk-tools-public
public java gatk-tools-public*
protected java gatk-tools-protected
protected java gatk-tools-protected*
private java gatk-tools-private
private java gatk-tools-private*
public java gatk-queue o.b.g.queue
public scala gatk-queue o.b.g.queue
public scala gatk-queue-extensions-public o.b.g.queue.extensions
public qscripts gatk-queue-extensions-public o.b.g.queue.qscripts
private scala gatk-queue-private o.b.g.queue
private qscripts gatk-queue-extensions-internal o.b.g.queue.qscripts

Renamed Classes

The following class names were updated to replace Sting with GATK.

Old Sting class New GATK class
ArtificialStingSAMFileWriter ArtificialGATKSAMFileWriter
ReviewedStingException ReviewedGATKException
StingException GATKException
StingSAMFileWriter GATKSAMFileWriter
StingSAMIterator GATKSAMIterator
StingSAMIteratorAdapter GATKSAMIteratorAdapter
StingSAMRecordIterator GATKSAMRecordIterator
StingTextReporter GATKTextReporter

Common Git/Maven Issues

Renamed files

The 3.2 renaming patch is actually split into two commits. The first commit renames the files without making any content changes, while the second changes the contents of the files without changing any file paths.

When dealing with renamed files, it is best to work with a clean directory during rebasing. It will be easier for you track files that you may not have added to git.

After running a git rebase or merge, you may first run into problems with files that you renamed and were moved during the GATK 3.2 package renaming. As a general rule, the renaming only changes directory names. The exception to this rule are classes such as StingException that are renamed to GATKException, and are listed under [Renamed Classes]. The workflow for resolving these merge issues is to find the list of your renamed files, put your content in the correct location, then register the changes with git.

To obtain the list of renamed directories and files:

  1. Use git status to get a list of affected files
  2. Find the common old directory and file name under "both deleted"
  3. Find your new file name under "added by them" (yes, you are "them")
  4. Find the new directory under "added by us"

Then, to resolve the issue for each file:

  1. Move your copy of your renamed file to the new directory
  2. git rm the old paths as appropriate
  3. git add the new path
  4. Repeat for other files until git status shows "all conflicts fixed"

Upon first rebasing you will see a lot of text. At this moment, you can ignore most of it, and use git status instead.

For the purposes of illustration, while running git rebase it is perfectly normal to see something similar to:

$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: <<< Your first commit message here >>>
Using index info to reconstruct a base tree...
A   protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/
A   protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/
<<<Other files that you renamed.>>>
warning: squelched 12 whitespace errors
warning: 34 lines add whitespace errors.
Falling back to patching base and 3-way merge...
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/"->"protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/" in branch "HEAD" rename "protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/"->"protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/" in "<<< Your first commit message here >>>"
CONFLICT (rename/rename): Rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/"->"protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/" in branch "HEAD" rename "protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/"->"protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/" in "<<< Your first commit message here >>>"
Failed to merge in the changes.
Patch failed at 0001 Example conflict.
The copy of the patch that failed is found in:

When you have resolved this problem, run "git rebase --continue".
If you prefer to skip this patch, run "git rebase --skip" instead.
To check out the original branch and stop rebasing, run "git rebase --abort".


While everything you need to resolve the issue is technically in the message above, it may be much easier to track what's going on using git status.

$ git status
rebase in progress; onto cba4321
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
  (fix conflicts and then run "git rebase --continue")
  (use "git rebase --skip" to skip this patch)
  (use "git rebase --abort" to check out the original branch)

Unmerged paths:
  (use "git reset HEAD <file>..." to unstage)
  (use "git add/rm <file>..." as appropriate to mark resolution)

    added by them:      protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/
    both deleted:       protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/
    added by them:      protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/
    both deleted:       protected/gatk-protected/src/test/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/
    added by us:        protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/
    added by us:        protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/

Untracked files:
  (use "git add <file>..." to include in what will be committed)

<<< possible untracked files if your working directory is not clean>>>

no changes added to commit (use "git add" and/or "git commit -a")

Let's look at the main java file as an example. If you are having issues figuring out the new directory and new file name, they are all listed in the output.

Path in the common ancestor branch:
 |      old source directory       |                     old package name                     |   old file name     |

Path in the new master branch before merge:
 |           new source directory             |                 new package name                    |   old file name     |

Path in your branch before merge:
 |      old source directory       |                     old package name                     |           new file name            |

Path in your branch post merge:
 |           new source directory             |                 new package name                    |           new file name            |

After identifying the new paths for use post merge, use the following workflow for each file:

  1. Move or copy your version of the renamed file to the new directory
  2. git rm the three old file paths: common ancestor, old directory with new file name, and new directory with old file name
  3. git add the new file name in the new directory

After you process all files correctly, in the output of git status you should see the "all conflicts fixed" and all your files renamed.

$ git status
rebase in progress; onto cba4321
You are currently rebasing branch 'zz_renaming_haplotypecallergenotypingengine' on 'cba4321'.
  (all conflicts fixed: run "git rebase --continue")

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    renamed:    protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/ -> protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/
    renamed:    protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/ -> protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/

Untracked files:
  (use "git add <file>..." to include in what will be committed)

<<< possible untracked files if your working directory is not clean>>>


Continue your rebase, handling other merges as normal.

$ git rebase --continue

Fixing imports

Because all the packages names are different in 3.2, while rebasing you may run into conflicts due to imports you also changed. Use your favorite editor to fix the imports within the files. Then try recompiling, and repeat as necessary until your code works.

While editing the files with conflicts with a basic text editor may work, IntelliJ IDEA also offers a special merge tool that may help via the menu:

VCS > Git > Resolve Conflicts...

For each file, click on the "Merge" button in the first dialog. Use the various buttons in the Conflict Resolution Tool to automatically accept any changes that are not in conflict. Then find any edit any remaining conflicts that require further manual intervention.

Once you begin editing the import statements in the three way merge tool, another IntelliJ IDEA 13.1 feature that may speed up repairing blocks of import statements is Multiple Selections. Find a block of import lines that need the same changes. Hold down the option key as you drag your cursor vertically down the edit point on each import line. Then begin typing or deleting text from the multiple lines.

Switching branches

Even after a successful merge, you may still run into stale GATK code or links from modifications before and after the 3.2 package renaming. To significantly reduce these chances, run mvn clean before and then again after switching branches.

If this doesn't work, run mvn clean && git status, looking for any directories you don't that shouldn't be in the current branch. It is possible that some files were not correctly moved, including classes or test resources. Find the file still in the old directories via a command such as find public/gatk-framework -type f. Then move them to the correct new directories and commit them into git.

Slow Builds with Queue and Private

Due to the [Renamed Binary Packages], the separate artifacts including and excluding private code are now packaged during the Maven package build lifecycle.

When building packages, to significantly speed up the default packaging time, if you only require the GATK tools run mvn verify -P\!queue.

Alternatively, if you do not require building private source, then disable private compiling via mvn verify -P\!private.

The two may be combined as well via: mvn verify -P\!queue,\!private.

The exclamation mark is a shell command that must be escaped, in the above case with a backslash. Shell quotes may also be used: mvn verify -P'!queue,!private'.

Alternatively, developers with access to private may often want to disable packaging the protected distributions. In this case, use the gsadev profile. This may be done via mvn verify -Pgsadev or, excluding Queue, mvn verify -Pgsadev,\!queue.

Stale symlinks

Users see errors from maven when an unclean repo in git is updated. Because currently hardcodes relative paths to "public/testdata", maven creates these symbolic links all over the file system to help the various tests in different modules find the relative path "/public/testdata".

However, our Maven support has evolved from 2.8, to 3.0, to now the 3.2 renaming, each time has changed the symbolic link's target directory. Whenever a stale symbolic link to an old testdata directory remains in the users folder, maven is saying it will not remove the link, because maven basically doesn't know why the link is pointing to the wrong folder (answer, the link is from an old git checkout) and thinks it's a bug in the build.

If one doesn't have an stale / unclean maven repo when updating git via merge/rebase/checkout, you will never see this issue.

The script that can remove the stale symlinks, public/src/main/scripts/shell/, should run automatically during a mvn test-compile or mvn verify.

Created 2012-08-15 18:16:01 | Updated 2014-09-17 13:20:21 | Tags: developer tribble intermediate picard htsjdk

Comments (6)

1. Overview

The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.

2. Architecture Overview

Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.

3. Developer Overview

For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.

For developers there are two classes that are important:

  • Feature

    This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.

  • FeatureCodec

    This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.

To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:

public interface Feature {

     * Return the features reference sequence name, e.g chromosome or contig
    public String getChr();

     * Return the start position in 1-based coordinates (first base is 1)
    public int getStart();

     * Return the end position following 1-based fully closed conventions.  The length of a feature is
     * end - start + 1;
    public int getEnd();

And the interface for FeatureCodec:

 * the base interface for classes that read in features.
 * @param <T> The feature type this codec reads
public interface FeatureCodec<T extends Feature> {
     * Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
     * @param line the input line to decode
     * @return  Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
     * a comment)
    public Feature decodeLoc(String line);

     * Decode a line as a Feature.
     * @param line the input line to decode
     * @return  Return the Feature encoded by the line,  or null if the line does not represent a feature (e.g. is
     * a comment)
    public T decode(String line);

     * This function returns the object the codec generates.  This is allowed to be Feature in the case where
     * conditionally different types are generated.  Be as specific as you can though.
     * This function is used by reflections based tools, so we can know the underlying type
     * @return the feature type this codec generates.
    public Class<T> getFeatureType();

    /**  Read and return the header, or null if there is no header.
     * @return header object
    public Object readHeader(LineReader reader);

4. Supported Formats

The following formats are supported in Tribble:

  • VCF Format
  • DbSNP Format
  • BED Format
  • GATK Interval Format

5. Updating the Tribble, htsjdk, and/or Picard library

Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.

NOTE: Any directory starting with ~ may be different on your machine, depending on where you cloned the various repositories for gsa-unstable, picard, and htsjdk.

A Maven script to install picard into the local repository is located under gsa-unstable/private/picard-maven. To operate, it requires a symbolic link named picard pointing to a working checkout of the picard github repository. NOTE: compiling picard requires an htsjdk github repository checkout available at picard/htsjdk, either as a subdirectory or another symbolic link. The final full path should be gsa-unstable/private/picard-maven/picard/htsjdk.

cd ~/src/gsa-unstable
cd private/picard-maven
ln -s ~/src/picard picard

Create a git branch of Picard and/or htsjdk and make your changes. To install your changes into the GATK you must run mvn install in the private/picard-maven directory. This will compile and copy the jars into gsa-unstable/public/repo, and update gsa-unstable/gatk-root/pom.xml with the corresponding version. While making changes your revision of picard and htslib will be labeled with -SNAPSHOT.

cd ~/src/gsa-unstable
cd private/picard-maven
mvn install

Continue testing in the GATK. Once your changes and updated tests for picard/htsjdk are complete, push your branch and submit your pull request to the Picard and/or htsjdk github. After your Picard/htsjdk patches are accepted, switch your Picard/htsjdk branches back to the master branch. NOTE: Leave your gsa-unstable branch on your development branch!

cd ~/src/picard
ant clean
git checkout master
git fetch
git rebase
cd htsjdk
git checkout master
git fetch
git rebase

NOTE: The version number of old and new Picard/htsjdk will vary, and during active development will end with -SNAPSHOT. While, if needed, you may push -SNAPSHOT version for testing on Bamboo, you should NOT submit a pull request with a -SNAPSHOT version. -SNAPSHOT indicates your local changes are not reproducible from source control.

When ready, run mvn install once more to create the non -SNAPSHOT versions under gsa-unstable/public/repo. In that directory, git add the new version, and git rm the old versions.

cd ~/src/gsa-unstable
cd public/repo
git add picard/picard/1.115.1499/
git add samtools/htsjdk/1.115.1509/
git rm -r picard/picard/1.112.1452/
git rm -r samtools/htsjdk/1.112.1452/

Commit and then push your gsa-unstable branch, then issue a pull request for review.

Created 2012-08-10 16:36:23 | Updated 2012-10-18 15:43:46 | Tags: test developer

Comments (0)

1. What is DiffEngine?

DiffEngine is a summarizing difference engine that allows you to compare two structured files -- such as BAMs and VCFs -- to find what are the differences between them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others.

2. The summarized differences

The GATK contains a summarizing difference engine called DiffEngine that compares hierarchical data structures to emit:

  • A list of specific differences between the two data structures. This is similar to saying the value in field A in record 1 in file F differs from the value in field A in record 1 in file G.

  • A summarized list of differences ordered by frequency of the difference. This output is similar to saying field A differed in 50 records between files F and G.

3. The DiffObjects walker

The GATK contains a private walker called DiffObjects that allows you access to the DiffEngine capabilities on the command line. Simply provide the walker with the master and test files and it will emit summarized differences for you.

4. Understanding the output

The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named nodes. Suppose I have two trees:

Tree1=(A=1 B=(C=2 D=3)) 
Tree2=(A=1 B=(C=3 D=3 E=4))
Tree3=(A=1 B=(C=4 D=3 E=4))

where every node in the tree is named, or is a raw value (here all leaf values are integers). The DiffEngine traverses these data structures by name, identifies equivalent nodes by fully qualified names (Tree1.A is distinct from Tree2.A, and determines where their values are equal (Tree1.A=1, Tree2.A=1, so they are).

These itemized differences are listed as:

Tree1.B.C=2 != Tree2.B.C=3
Tree1.B.C=2 != Tree3.B.C=4
Tree2.B.C=3 != Tree3.B.C=4
Tree1.B.E=MISSING != Tree2.B.E=4

This conceptually very similar to the output of the unix command line tool diff. What's nice about DiffEngine though is that it computes similarity among the itemized differences and displays the count of differences names in the system. In the above example, the field C is not equal three times, while the missing E in Tree1 occurs only once. So the summary is:

*.B.C : 3
*.B.E : 1

where the * operator indicates that any named field matches. This output is sorted by counts, and provides an immediate picture of the commonly occurring differences between the files.

Below is a detailed example of two VCF fields that differ because of a bug in the AC, AF, and AN counting routines, detected by the integrationtest integration (more below). You can see that in the although there are many specific instances of these differences between the two files, the summarized differences provide an immediate picture that the AC, AF, and AN fields are the major causes of the differences.

[testng] path                                                              count
[testng] *.*.*.AC                                                         6
[testng] *.*.*.AF                                                         6
[testng] *.*.*.AN                                                         6
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AC  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AF  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AN  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AC  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AF  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AN  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AC  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AF  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AN  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC  1

5. Integration tests

The DiffEngine codebase that supports these calculations is integrated into the integrationtest framework, so that when a test fails the system automatically summarizes the differences between the master MD5 file and the failing MD5 file, if it is an understood type. When failing you will see in the integration test logs not only the basic information, but the detailed DiffEngine output.

For example, in the output below I broke the GATK BAQ calculation and the integration test DiffEngine clearly identifies that all of the records differ in their BQ tag value in the two BAM files:

/humgen/1kg/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam -o /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp -L 1:10,000,000-10,100,000 -baq RECALCULATE -et NO_ET
   [testng] WARN  22:59:22,875 TextFormattingUtils - Unable to load help text.  Help output will be sparse.
   [testng] WARN  22:59:22,875 TextFormattingUtils - Unable to load help text.  Help output will be sparse.
   [testng] ##### MD5 file is up to date: integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
   [testng] Checking MD5 for /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp [calculated=e5147656858fc4a5f470177b94b1fc1b, expected=4ac691bde1ba1301a59857694fda6ae2]
   [testng] ##### Test testPrintReadsRecalBAQ is going fail #####
   [testng] ##### Path to expected   file (MD5=4ac691bde1ba1301a59857694fda6ae2): integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest
   [testng] ##### Path to calculated file (MD5=e5147656858fc4a5f470177b94b1fc1b): integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
   [testng] ##### Diff command: diff integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
   [testng] ##:GATKReport.v0.1 diffences : Summarized differences between the master and test files.
   [testng] See for more information
   [testng] Difference                                                                               NumberOfOccurrences
   [testng] *.*.*.BQ                                                                                 895
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:2:266:272:361.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:245:474:254.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:255:178:160.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:158:682:495.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:195:591:884.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:165:236:848.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:191:223:910.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:286:279:434.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:2:106:516:354.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:3:102:580:518.BQ  1
   [testng] Note that the above list is not comprehensive.  At most 20 lines of output, and 10 specific differences will be listed.  Please use -T DiffObjects -R public/testdata/exampleFASTA.fasta -m integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest -t integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest to explore the differences more freely

6. Adding your own DiffableObjects to the system

The system dynamically finds all classes that implement the following simple interface:

public interface DiffableReader {
    @Ensures("result != null")
     * Return the name of this DiffableReader type.  For example, the VCF reader returns 'VCF' and the
     * bam reader 'BAM'
    public String getName();

    @Ensures("result != null")
    @Requires("file != null")
     * Read up to maxElementsToRead DiffElements from file, and return them.
    public DiffElement readFromFile(File file, int maxElementsToRead);

     * Return true if the file can be read into DiffElement objects with this reader. This should
     * be uniquely true/false for all readers, as the system will use the first reader that can read the
     * file.  This routine should never throw an exception.  The VCF reader, for example, looks at the
     * first line of the file for the ##format=VCF4.1 header, and the BAM reader for the BAM_MAGIC value
     * @param file
     * @return
    @Requires("file != null")
    public boolean canRead(File file);

See the VCF and BAMDiffableReaders for example implementations. If you extend this to a new object types both the DiffObjects walker and the integrationtest framework will automatically work with your new file type.

Created 2012-08-11 05:31:52 | Updated 2012-10-18 15:35:49 | Tags: gatkdocs developer walkers intermediate

Comments (0)

The GATKDocs are what we call "Technical Documentation" in the Guide section of this website. The HTML pages are generated automatically at build time from specific blocks of documentation in the source code.

The best place to look for example documentation for a GATK walker is GATKDocsExample walker in org.broadinstitute.sting.gatk.examples. This is available here.

Below is the reproduction of that file from August 11, 2011:

 * [Short one sentence description of this walker]
 * <p>
 * [Functionality of this walker]
 * </p>
 * <h2>Input</h2>
 * <p>
 * [Input description]
 * </p>
 * <h2>Output</h2>
 * <p>
 * [Output description]
 * </p>
 * <h2>Examples</h2>
 *    java
 *      -jar GenomeAnalysisTK.jar
 *      -T $WalkerName
 * @category Walker Category
 * @author Your Name
 * @since Date created
public class GATKDocsExample extends RodWalker<Integer, Integer> {
     * Put detailed documentation about the argument here.  No need to duplicate the summary information
     * in doc annotation field, as that will be added before this text in the documentation page.
     * Notes:
     * <ul>
     *     <li>This field can contain HTML as a normal javadoc</li>
     *     <li>Don't include information about the default value, as gatkdocs adds this automatically</li>
     *     <li>Try your best to describe in detail the behavior of the argument, as ultimately confusing
     *          docs here will just result in user posts on the forum</li>
     * </ul>
    @Argument(fullName="full", shortName="short", doc="Brief summary of argument [~ 80 characters of text]", required=false)
    private boolean myWalkerArgument = false;

    public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { return 0; }
    public Integer reduceInit() { return 0; }
    public Integer reduce(Integer value, Integer sum) { return value + sum; }
    public void onTraversalDone(Integer result) { }

Created 2012-08-15 18:19:11 | Updated 2012-10-18 15:23:11 | Tags: developer tribble rod intermediate

Comments (2)

Brief introduction to reference metadata (RMDs)

Note that the -B flag referred to below is deprecated; these docs need to be updated

The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:

  • They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.

  • The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.

  • The file must have a Tribble RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.

Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the VariantEval module uses the named track eval to get calls for evaluation, and dbsnp as the track containing the database of known variants.

How do I get reference metadata files into my walker?

RMD files are extremely easy to get into the GATK using the -B syntax:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf

In this example, the GATK will attempt to parse the file calls.vcf using the VCF parser and bind the VCF data to the RMD track named variant.

In general, you can provide as many RMD bindings to the GATK as you like:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf

Works just as well. Some modules may require specifically named RMD tracks -- like eval above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like VariantsToVCF.

1. Directly getting access to a single named track

In this snippet from SNPDensityWalker, we grab the eval track as a VariantContext object, only for the variants that are of type SNP:

public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);

2. Grabbing anything that's convertable to a VariantContext

From VariantsToVCF we call the helper function tracker.getVariantContexts to look at all of the RMDs and convert what it can to VariantContext objects.

Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);

3. Looking at all of the RMDs

Here's a totally general code snippet from This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.

private String getReferenceOrderedData( RefMetaDataTracker tracker ) { ArrayList rodStrings = new ArrayList(); for ( GATKFeature datum : tracker.getAllRods() ) { if ( datum != null && ! (datum.getUnderlyingObject() instanceof DbSNPFeature)) { rodStrings.add(((ReferenceOrderedDatum)datum.getUnderlyingObject()).toSimpleString()); // TODO: Aaron: this line still survives, try to remove it } } String rodString = Utils.join(", ", rodStrings);

        DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);

        if ( dbsnp != null)
            rodString += DbSNPHelper.toMediumString(dbsnp);

        if ( !rodString.equals("") )
            rodString = "[ROD: " + rodString + "]";

        return rodString;

How do I write my own RMD types?

Tracks of reference metadata are loaded using the Tribble infrastructure. Tracks are loaded using the feature codec and underlying type information. See the Tribble documentation for more information.

Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the FeatureCodec class. Name resolution occurs using the -B type parameter, i.e. if the user specified:

-B:calls1,VCF calls1.vcf

The GATK looks for a FeatureCodec called to decode the record type. Alternately, if the user specified:

-B:calls1,MYAwesomeFormat calls1.maft

THe GATK would look for a codec called This look-up is not case sensitive, i.e. it will resolve MyAwEsOmEfOrMaT as well, though why you would want to write something so painfully ugly to read is beyond us.

Created 2012-08-14 18:16:53 | Updated 2012-10-18 15:28:56 | Tags: test developer walkers

Comments (0)

1. Testing core walkers is critical

Most GATK walkers are really too complex to easily test using the standard unit test framework. It's just not feasible to make artificial read piles and then extrapolate from simple tests passing whether the system as a whole is working correctly. However, we need some way to determine whether changes to the core of the GATK are altering the expected output of complex walkers like BaseRecalibrator or SingleSampleGenotyper. In additional to correctness, we want to make sure that the performance of key walkers isn't degrading over time, so that calling snps, cleaning indels, etc., isn't slowly creeping down over time. Since we are now using a bamboo server to automatically build and run unit tests (as well as measure their runtimes) we want to put as many good walker tests into the test framework so we capture performance metrics over time.

2. The WalkerTest framework

To make this testing process easier, we've created a WalkerTest framework that lets you invoke the GATK using command-line GATK commands in the JUnit system and test for changes in your output files by comparing the current ant build results to previous run via an MD5 sum. It's a bit coarse grain, but it will work to ensure that changes to key walkers are detected quickly by the system, and authors can either update the expected MD5s or go track down bugs.

The system is fairly straightforward to use. Ultimately we will end up with JUnit style tests in the unit testing structure. In the piece of code below, we have a piece of code that checks the MD5 of the SingleSampleGenotyper's GELI text output at LOD 3 and LOD 10.

package org.broadinstitute.sting.gatk.walkers.genotyper;

import org.broadinstitute.sting.WalkerTest;
import org.junit.Test;

import java.util.HashMap;
import java.util.Map;
import java.util.Arrays;

public class SingleSampleGenotyperTest extends WalkerTest {
    public void testLOD() {
        HashMap<Double, String> e = new HashMap<Double, String>();
        e.put( 10.0, "e4c51dca6f1fa999f4399b7412829534" );
        e.put( 3.0, "d804c24d49669235e3660e92e664ba1a" );

        for ( Map.Entry<Double, String> entry : e.entrySet() ) {
            WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
                   "-T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout %s --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod " + entry.getKey(), 1,
            executeTest("testLOD", spec);

The fundamental piece here is to inherit from WalkerTest. This gives you access to the executeTest() function that consumes a WalkerTestSpec:

    public WalkerTestSpec(String args, int nOutputFiles, List<String> md5s)

The WalkerTestSpec takes regular, command-line style GATK arguments describing what you want to run, the number of output files the walker will generate, and your expected MD5s for each of these output files. The args string can contain %s String.format specifications, and for each of the nOutputFiles, the executeTest() function will (1) generate a tmp file for output and (2) call String.format on your args to fill in the tmp output files in your arguments string. For example, in the above argument string varout is followed by %s, so our single SingleSampleGenotyper output is the variant output file.

3. Example output

When you add a walkerTest inherited unit test to the GATK, and then build test, you'll see output that looks like:

[junit] WARN  13:29:50,068 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:29:50,068 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit] WARN  13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit] WARN  13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a] 
[junit] WARN  13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a] 
[junit] WARN  13:30:39,408 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:30:39,408 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:30:39,409 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:30:39,409 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit] WARN  13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit] WARN  13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534] 
[junit] WARN  13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534] 
[junit] WARN  13:31:30,213 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:31:30,213 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:31:30,214 SingleSampleGenotyperTest -  
[junit] WARN  13:31:30,214 SingleSampleGenotyperTest -  

4. Recommended location for GATK testing data

We keep all of the permenant GATK testing data in:


A good set of data to use for walker testing is the CEU daughter data from 1000 Genomes:

gsa2 ~/dev/GenomeAnalysisTK/trunk > ls -ltr /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.bam /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.calls
-rw-rw-r--+ 1 depristo wga  51M 2009-09-03 07:56 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam
-rw-rw-r--+ 1 depristo wga 185K 2009-09-04 13:21 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.variants.geli.calls
-rw-rw-r--+ 1 depristo wga 164M 2009-09-04 13:22 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.genotypes.geli.calls
-rw-rw-r--+ 1 depristo wga  24M 2009-09-04 15:00 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SOLID.bam
-rw-rw-r--+ 1 depristo wga  12M 2009-09-04 15:01 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.454.bam
-rw-r--r--+ 1 depristo wga  91M 2009-09-04 15:02 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam

5. Test dependencies

The tests depend on a variety of input files, that are generally constrained to three mount points on the internal Broad network:


To run the unit and integration tests you'll have to have access to these files. They may have different mount points on your machine (say, if you're running remotely over the VPN and have mounted the directories on your own machine).

6. MD5 database and comparing MD5 results

Every file that generates an MD5 sum as part of the WalkerTest framework will be copied to <MD5>. integrationtest in the integrationtests subdirectory of the GATK trunk. This MD5 database of results enables you to easily examine the results of an integration test as well as compare the results of a test before/after a code change. For example, below is an example test for the UnifiedGenotyper that, due to a code change, where the output VCF differs from the VCF with the expected MD5 value in the test code itself. The test provides provides the path to the two results files as well as a diff command to compare expected to the observed MD5:

[junit] --------------------------------------------------------------------------------    
[junit] Executing test testParameter[-genotype] with GATK arguments: -T UnifiedGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05997727998894311741.tmp -L 1:10,000,000-10,010,000 -genotype    
[junit] ##### MD5 file is up to date: integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest    
[junit] Checking MD5 for /tmp/walktest.tmp_param.05997727998894311741.tmp [calculated=ab20d4953b13c3fc3060d12c7c6fe29d, expected=0ac7ab893a3f550cb1b8c34f28baedf6]    
[junit] ##### Test testParameter[-genotype] is going fail #####    
[junit] ##### Path to expected   file (MD5=0ac7ab893a3f550cb1b8c34f28baedf6): integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest    
[junit] ##### Path to calculated file (MD5=ab20d4953b13c3fc3060d12c7c6fe29d): integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest    
[junit] ##### Diff command: diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest

Examining the diff we see a few lines that have changed the DP count in the new code

> diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest  | head
< 1     10000345        .       A       .       106.54  .       AN=2;DP=33;Dels=0.00;MQ=89.17;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:25:-0.09,-7.57,-75.74:74.78
< 1     10000346        .       A       .       103.75  .       AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:24:-0.07,-7.27,-76.00:71.99
< 1     10000347        .       A       .       109.79  .       AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:26:-0.05,-7.85,-84.74:78.04
> 1     10000345        .       A       .       106.54  .       AN=2;DP=32;Dels=0.00;MQ=89.50;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:25:-0.09,-7.57,-75.74:74.78
> 1     10000346        .       A       .       103.75  .       AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:24:-0.07,-7.27,-76.00:71.99
> 1     10000347        .       A       .       109.79  .       AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:26:-0.05,-7.85,-84.74:78

Whether this is the expected change is up to you to decide, but the system makes it as easy as possible to see the consequences of your code change.

7. Testing for Exceptions

The walker test framework supports an additional syntax for ensuring that a particular java Exception is thrown when a walker executes using a simple alternate version of the WalkerSpec object. Rather than specifying the MD5 of the result, you can provide a single subclass of Exception.class and the testing framework will ensure that when the walker runs an instance (class or subclass) of your expected exception is thrown. The system also flags if no exception is thrown.

For example, the following code tests that the GATK can detect and error out when incompatible VCF and FASTA files are given:

@Test public void fail8() { executeTest("hg18lex-v-b36", test(lexHG18, callsB36)); }

private WalkerTest.WalkerTestSpec test(String ref, String vcf) {
    return new WalkerTest.WalkerTestSpec("-T VariantsToTable -M 10 -B:two,vcf "
            + vcf + " -F POS,CHROM -R "
            + ref +  " -o %s",
            1, UserException.IncompatibleSequenceDictionaries.class);


During the integration test this looks like:

[junit] Executing test hg18lex-v-b36 with GATK arguments: -T VariantsToTable -M 10 -B:two,vcf /humgen/gsa-hpprojects/GATK/data/Validation_Data/lowpass.N3.chr1.raw.vcf -F POS,CHROM -R /humgen/gsa-hpprojects/GATK/data/Validation_Data/lexFasta/lex.hg18.fasta -o /tmp/walktest.tmp_param.05541601616101756852.tmp -l WARN -et NO_ET
[junit]    [junit] Wanted exception class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries, saw class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries
[junit]   => hg18lex-v-b36 PASSED

8. Miscellaneous information

  • Please do not put any extremely long tests in the regular ant build test target. We are currently splitting the system into fast and slow tests so that unit tests can be run in \< 3 minutes while saving a test target for long-running regression tests. More information on that will be posted.

  • An expected MG5 string of "" means don't check for equality between the calculated and expected MD5s. Useful if you are just writing a new test and don't know the true output.

  • Overload parameterize() { return true; } if you want the system to just run your calculations, not throw an error if your MD5s don't match, across all tests

  • If your tests all of a sudden stop giving equality MD5s, you can just (1) look at the .tmp output files directly or (2) grab the printed GATK command-line options and explore what is happening.

  • You can always run a GATK walker on the command line and then run md5sum on its output files to obtain, outside of the testing framework, the MD5 expected results.

  • Don't worry about the duplication of lines in the output ; it's just an annoyance of having two global loggers. Eventually we'll bug fix this away.

Created 2012-08-10 17:51:14 | Updated 2013-03-06 19:55:08 | Tags: developer walkers

Comments (3)

1. Introduction

The core concept behind GATK tools is the walker, a class that implements the three core operations: filtering, mapping, and reducing.

  • filter Reduces the size of the dataset by applying a predicate.

  • map Applies a function to each individual element in a dataset, effectively mapping it to a new element.

  • reduce Inductively combines the elements of a list. The base case is supplied by the reduceInit() function, and the inductive step is performed by the reduce() function.

Users of the GATK will provide a walker to run their analyses. The engine will produce a result by first filtering the dataset, running a map operation, and finally reducing the map operation to a single result.

2. Creating a Walker

To be usable by the GATK, the walker must satisfy the following properties:

  • It must subclass one of the basic walkers in the org.broadinstitute.sting.gatk.walkers package, usually ReadWalker or LociWalker.

    • Locus walkers present all the reads, reference bases, and reference-ordered data that overlap a single base in the reference. Locus walkers are best used for analyses that look at each locus independently, such as genotyping.

    • Read walkers present only one read at a time, as well as the reference bases and reference-ordered data that overlap that read.

    • Besides read walkers and locus walkers, the GATK features several other data access patterns, described here.
  • The compiled class or jar must be on the current classpath. The Java classpath can be controlled using either the $CLASSPATH environment variable or the JVM's -cp option.

3. Examples

The best way to get started with the GATK is to explore the walkers we've written. Here are the best walkers to look at when getting started:

  • CountLoci

    It is the simplest locus walker in our codebase. It counts the number of loci walked over in a single run of the GATK.


  • CountReads

    It is the simplest read walker in our codebase. It counts the number of reads walked over in a single run of the GATK.


  • GATKPaperGenotyper

    This is a more sophisticated example, taken from our recent paper in Genome Research (and using our ReadBackedPileup to select and filter reads). It is an extremely basic Bayesian genotyper that demonstrates how to output data to a stream and execute simple base operations.


Please note that the walker above is NOT the UnifiedGenotyper. While conceptually similar to the UnifiedGenotyper, the GATKPaperGenotyper uses a much simpler calling model for increased clarity and readability.

4. External walkers and the 'external' directory

The GATK can absorb external walkers placed in a directory of your choosing. By default, that directory is called 'external' and is relative to the Sting git root directory (for example, ~/src/Sting/external). However, you can choose to place that directory anywhere on the filesystem and specify its complete path using the ant external.dir property.

ant -Dexternal.dir=~/src/external

The GATK will check each directory under the external directory (but not the external directory itself!) for small build scripts. These build scripts must contain at least a compile target that compiles your walker and places the resulting class file into the GATK's class file output directory. The following is a sample compile target:

<target name="compile" depends="init">
    <javac srcdir="." destdir="${build.dir}" classpath="${gatk.classpath}" />

As a convenience, the build.dir ant property will be predefined to be the GATK's class file output directory and the gatk.classpath property will be predefined to be the GATK's core classpath. Once this structure is defined, any invocation of the ant build scripts will build the contents of the external directory as well as the GATK itself.

Created 2012-08-15 22:36:36 | Updated 2012-10-18 15:17:47 | Tags: developer scala walkers

Comments (0)

1. Install scala somewhere

At the Broad, we typically put it somewhere like this:


Next, create a symlink from this directory to trunk/scala/installation:

ln -s /home/radon01/depristo/work/local/ trunk/scala/installation

2. Setting up your path

Right now the only way to get scala walkers into the GATK is by explicitly setting your CLASSPATH in your .my.cshrc file:

setenv CLASSPATH /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/FourBaseRecaller.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GenomeAnalysisTK.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/Playground.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/StingUtils.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/bcel-5.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/colt-1.2.0.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/google-collections-0.9.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/

Really this needs to be manually updated whenever any of the libraries are updated. If you see this error:

Caused by: java.lang.RuntimeException: error in opening zip file
        at org.reflections.util.VirtualFile.iterable(
        at org.reflections.util.VirtualFile$5.transform(
        at org.reflections.util.VirtualFile$5.transform(
        at org.reflections.util.FluentIterable$3.transform(
        at org.reflections.util.FluentIterable$3.transform(
        at org.reflections.util.FluentIterable$ForkIterator.computeNext(
        at org.reflections.util.FluentIterable$FilterIterator.computeNext(
        at org.reflections.util.FluentIterable$TransformIterator.computeNext(
        at org.reflections.Reflections.scan(
        at org.reflections.Reflections.<init>(
        at org.broadinstitute.sting.utils.PackageUtils.<clinit>(

It's because the libraries aren't updated. Basically just do an ls of your trunk/dist directory after the GATK has been build, make this your classpath as above, and tack on:


A command that almost works (but you'll need to replace the spaces with colons) is:

#setenv CLASSPATH $CLASSPATH `ls /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/*.jar` /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/

3. Building scala code

All of the Scala source code lives in scala/src, which you build using ant scala

There are already some example Scala walkers in scala/src, so doing a standard checkout, installing scala, settting up your environment, should allow you to run something like:

gsa2 ~/dev/GenomeAnalysisTK/trunk > ant scala
Buildfile: build.xml


     [echo] Sting: Compiling scala!
   [scalac] Compiling 2 source files to /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/scala/classes
   [scalac] warning: there were deprecation warnings; re-run with -deprecation for details
   [scalac] one warning found
   [scalac] Compile suceeded with 1 warning; see the compiler output for details.
   [delete] Deleting: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar
      [jar] Building jar: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar

4. Invoking a scala walker

Until we can include Scala walkers along with the main GATK jar (avoiding the classpath issue too) you have to invoke your scala walkers using this syntax:

java -Xmx2048m org.broadinstitute.sting.gatk.CommandLineGATK -T BaseTransitionTableCalculator -R /broad/1KG/reference/human_b36_both.fasta -I /broad/1KG/DCC_merged/freeze5/NA12878.pilot2.SLX.bam -l INFO -L 1:1-100

Here, the BaseTransitionTableCalculator walker is written in Scala and being loaded into the system by the GATK walker manager. Otherwise everything looks like a normal GATK module.