Tagged with #developer
40 documentation articles | 2 announcements | 11 forum discussions


Comments (0)

This document provides an overview of what are the steps required to make a walker multi-threadable using the -nct and the -nt arguments, which make use of the NanoSchedulable and TreeReducible interfaces, respectively.


NanoSchedulable / -nct

Providing -nct support requires that you certify that your walker's map() method is thread-safe -- eg., if any data structures are shared across map() calls, access to these must be properly synchronized. Once your map() method is thread-safe, you can implement the NanoSchedulable interface, an empty interface with no methods that just marks your walker as having a map() method that's safe to parallelize:

/**
 * Root parallelism interface.  Walkers that implement this
 * declare that their map function is thread-safe and so multiple
 * map calls can be run in parallel in the same JVM instance.
 */
public interface NanoSchedulable {
}

TreeReducible / -nt

Providing -nt support requires that both map() and reduce() be thread-safe, and you also need to implement the TreeReducible interface. Implementing TreeReducible requires you to write a treeReduce() method that tells the engine how to combine the results of multiple reduce() calls:

public interface TreeReducible<ReduceType> {
    /**
     * A composite, 'reduce of reduces' function.
     * @param lhs 'left-most' portion of data in the composite reduce.
     * @param rhs 'right-most' portion of data in the composite reduce.
     * @return The composite reduce type.
     */
    ReduceType treeReduce(ReduceType lhs, ReduceType rhs);
}

This method differs from reduce() in that while reduce() adds the result of a single map() call onto a running total, treeReduce() takes the aggregated results from multiple map/reduce tasks that have been run in parallel and combines them. So, lhs and rhs might each represent the final result from several hundred map/reduce calls.

Example treeReduce() implementation from the UnifiedGenotyper:

public UGStatistics treeReduce(UGStatistics lhs, UGStatistics rhs) {
    lhs.nBasesCallable += rhs.nBasesCallable;
    lhs.nBasesCalledConfidently += rhs.nBasesCalledConfidently;
    lhs.nBasesVisited += rhs.nBasesVisited;
    lhs.nCallsMade += rhs.nCallsMade;
    return lhs;
}
Comments (0)

1. Install scala somewhere

At the Broad, we typically put it somewhere like this:

/home/radon01/depristo/work/local/scala-2.7.5.final

Next, create a symlink from this directory to trunk/scala/installation:

ln -s /home/radon01/depristo/work/local/scala-2.7.5.final trunk/scala/installation

2. Setting up your path

Right now the only way to get scala walkers into the GATK is by explicitly setting your CLASSPATH in your .my.cshrc file:

setenv CLASSPATH /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/FourBaseRecaller.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GenomeAnalysisTK.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/Playground.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/StingUtils.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/bcel-5.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/colt-1.2.0.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/google-collections-0.9.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/javassist-3.7.ga.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/junit-4.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/log4j-1.2.15.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-1.02.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/picard-private-875.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/reflections-0.9.2.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/sam-1.01.63.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/simple-xml-2.0.4.jar:/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar

Really this needs to be manually updated whenever any of the libraries are updated. If you see this error:

Caused by: java.lang.RuntimeException: java.util.zip.ZipException: error in opening zip file
        at org.reflections.util.VirtualFile.iterable(VirtualFile.java:79)
        at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:169)
        at org.reflections.util.VirtualFile$5.transform(VirtualFile.java:167)
        at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:43)
        at org.reflections.util.FluentIterable$3.transform(FluentIterable.java:41)
        at org.reflections.util.FluentIterable$ForkIterator.computeNext(FluentIterable.java:81)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
        at org.reflections.util.FluentIterable$FilterIterator.computeNext(FluentIterable.java:102)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
        at org.reflections.util.FluentIterable$TransformIterator.computeNext(FluentIterable.java:124)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:132)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:127)
        at org.reflections.Reflections.scan(Reflections.java:69)
        at org.reflections.Reflections.<init>(Reflections.java:47)
        at org.broadinstitute.sting.utils.PackageUtils.<clinit>(PackageUtils.java:23)

It's because the libraries aren't updated. Basically just do an ls of your trunk/dist directory after the GATK has been build, make this your classpath as above, and tack on:

/humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar

A command that almost works (but you'll need to replace the spaces with colons) is:

#setenv CLASSPATH $CLASSPATH `ls /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/*.jar` /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar:/humgen/gsa-scr1/depristo/local/scala-2.7.5.final/lib/scala-library.jar

3. Building scala code

All of the Scala source code lives in scala/src, which you build using ant scala

There are already some example Scala walkers in scala/src, so doing a standard checkout, installing scala, settting up your environment, should allow you to run something like:

gsa2 ~/dev/GenomeAnalysisTK/trunk > ant scala
Buildfile: build.xml

init.scala:

scala:
     [echo] Sting: Compiling scala!
   [scalac] Compiling 2 source files to /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/scala/classes
   [scalac] warning: there were deprecation warnings; re-run with -deprecation for details
   [scalac] one warning found
   [scalac] Compile suceeded with 1 warning; see the compiler output for details.
   [delete] Deleting: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar
      [jar] Building jar: /humgen/gsa-scr1/depristo/dev/GenomeAnalysisTK/trunk/dist/GATKScala.jar

4. Invoking a scala walker

Until we can include Scala walkers along with the main GATK jar (avoiding the classpath issue too) you have to invoke your scala walkers using this syntax:

java -Xmx2048m org.broadinstitute.sting.gatk.CommandLineGATK -T BaseTransitionTableCalculator -R /broad/1KG/reference/human_b36_both.fasta -I /broad/1KG/DCC_merged/freeze5/NA12878.pilot2.SLX.bam -l INFO -L 1:1-100

Here, the BaseTransitionTableCalculator walker is written in Scala and being loaded into the system by the GATK walker manager. Otherwise everything looks like a normal GATK module.

Comments (2)

In addition to testing walkers individually, you may want to also run integration tests for your QScript pipelines.

1. Brief comparison to the Walker integration tests

  • Pipeline tests should use the standard location for testing data.
  • Pipeline tests use the same test dependencies.
  • Pipeline tests which generate MD5 results will have the results stored in the MD5 database].
  • Pipeline tests, like QScripts, are written in Scala.
  • Pipeline tests dry-run under the ant target pipelinetest and run under pipelinetestrun.
  • Pipeline tests class names must end in PipelineTest to run under the ant target.
  • Pipeline tests should instantiate a PipelineTestSpec and then run it via PipelineTest.exec().

2. PipelineTestSpec

When building up a pipeline test spec specify the following variables for your test.

Variable Type Description
args String The arguments to pass to the Queue test, ex: -S scala/qscript/examples/HelloWorld.scala
jobQueue String Job Queue to run the test. Default is null which means use hour.
fileMD5s Map[Path, MD5] Expected MD5 results for each file path.
expectedException classOf[Exception] Expected exception from the test.

3. Example PipelineTest

The following example runs the ExampleCountLoci QScript on a small bam and verifies that the MD5 result is as expected.

It is checked into the Sting repository under scala/test/org/broadinstitute/sting/queue/pipeline/examples/ExampleCountLociPipelineTest.scala

package org.broadinstitute.sting.queue.pipeline.examples

import org.testng.annotations.Test
import org.broadinstitute.sting.queue.pipeline.{PipelineTest, PipelineTestSpec}
import org.broadinstitute.sting.BaseTest

class ExampleCountLociPipelineTest {
  @Test
  def testCountLoci {
    val testOut = "count.out"
    val spec = new PipelineTestSpec
    spec.name = "countloci"
    spec.args = Array(
      " -S scala/qscript/examples/ExampleCountLoci.scala",
      " -R " + BaseTest.hg18Reference,
      " -I " + BaseTest.validationDataLocation + "small_bam_for_countloci.bam",
      " -o " + testOut).mkString
    spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
    PipelineTest.executeTest(spec)
  }
}

3. Running Pipeline Tests

Dry Run

To test if the script is at least compiling with your arguments run ant pipelinetest specifying the name of your class to -Dsingle:

ant pipelinetest -Dsingle=ExampleCountLociPipelineTest

Sample output:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour
   [testng]   => countloci PASSED DRY RUN
   [testng] PASSED: testCountLoci

Run

As of July 2011 the pipeline tests run against LSF 7.0.6 and Grid Engine 6.2u5. To include these two packages in your environment use the hidden dotkit .combined_LSF_SGE.

reuse .combined_LSF_SGE

Once you are satisfied that the dry run has completed without error, to actually run the pipeline test run ant pipelinetestrun.

ant pipelinetestrun -Dsingle=ExampleCountLociPipelineTest

Sample output:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e4c42c267b3a6]
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

Generating initial MD5s

If you don't know the MD5s yet you can run the command yourself on the command line and then MD5s the outputs yourself, or you can set the MD5s in your test to "" and run the pipeline.

When the MD5s are blank as in:

spec.fileMD5s += testOut -> ""

You run:

ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run

And the output will look like:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is , equal? = false
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

Checking MD5s

When a pipeline test fails due to an MD5 mismatch you can use the MD5 database to diff the results.

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### Updating MD5 file: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e0000deadbeef]
   [testng] ##### Test countloci is going fail #####
   [testng] ##### Path to expected   file (MD5=67823e4722495eb10a5e0000deadbeef): integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest
   [testng] ##### Path to calculated file (MD5=67823e4722495eb10a5e4c42c267b3a6): integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] ##### Diff command: diff integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] FAILED: testCountLoci
   [testng] java.lang.AssertionError: 1 of 1 MD5s did not match.

If you need to examine a number of MD5s which may have changed you can briefly shut off MD5 mismatch failures by setting parameterize = true.

spec.parameterize = true
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"

For this run:

ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run

If there's a match the output will resemble:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e4c42c267b3a6, equal? = true
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci

While for a mismatch it will look like this:

   [testng] --------------------------------------------------------------------------------
   [testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
   [testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
   [testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e0000deadbeef, equal? = false
   [testng]   => countloci PASSED
   [testng] PASSED: testCountLoci
Comments (2)

Adding Third-party Dependencies

The GATK build system uses the Ivy dependency manager to make it easy for our users to add additional dependencies. Ivy can pull the latest jars and their dependencies from the Maven repository, making adding or updating a dependency as simple as adding a new line to the ivy.xml file.

If your tool is available in the maven repository, add a line to the ivy.xml file similar to the following:

<dependency org="junit" name="junit" rev="4.4" />

If you would like to add a dependency to a tool not available in the maven repository, please email gsahelp@broadinstitute.org

Updating SAM-JDK and Picard

Because we work so closely with the SAM-JDK/Picard team and are critically dependent on the code they produce, we have a special procedure for updating the SAM/Picard jars. Please use the following procedure to when updating sam-*.jar or picard-*.jar.

  • Download and build the latest versions of Picard public and Picard private (restricted to Broad Institute users) from their respective svns.

  • Get the latest svn versions for picard public and picard private by running the following commands:

    svn info $PICARD_PUBLIC_HOME | grep "Revision" svn info $PICARD_PRIVATE_HOME | grep "Revision"

Updating the Picard public jars

  • Rename the jars and xmls in $STING_HOME/settings/repository/net.sf to {picard|sam}-$PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.PICARD_PUBLIC_SVN_REV.{jar|xml}

  • Update the jars in $STING_HOME/settings/repository/net.sf with their newer equivalents in $PICARD_PUBLIC_HOME/dist/picard_lib.

  • Update the xmls in $STING_HOME/settings/repository/net.sf with the appropriate version number ($PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.$PICARD_PUBLIC_SVN_REV).

Updating the Picard private jar

  • Create the picard private jar with the following command:

    ant clean package -Dexecutable=PicardPrivate -Dpicard.dist.dir=${PICARD_PRIVATE_HOME}/dist

  • Rename picard-private-parts-*.jar in $STING_HOME/settings/repository/edu.mit.broad to picard-private-parts-$PICARD_PRIVATE_SVN_REV.jar.

  • Update picard-private-parts-*.jar in $STING_HOME/settings/repository/edu.mit.broad with the picard-private-parts.jar in $STING_HOME/dist/packages/picard-private-parts.

  • Update the xml in $STING_HOME/settings/repository/edu.mit.broad to reflect the new revision and publication date.

Comments (0)

The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented.

Locus walkers

Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.

1. Switching between covered and uncovered loci

The @By attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:

@By(DataSource.REFERENCE)
@By(DataSource.READS)

2. Filtering defaults

By default, the following filters are automatically added to every locus walker.

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments.
  • Duplicate reads.
  • Reads failing vendor quality checks.

ROD walkers

These walkers walk over the data set one location at a time, but only those locations covered by reference-ordered data. They are essentially a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome at sites where there is ROD information. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to RefWalkers<M,T> traversals with the following few quirks.

1. Differences from a RefWalker

  • RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.

  • Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided:

    nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location

In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null ReferenceContext and RefMetaDataTracker objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval.

2. Filtering defaults

ROD walkers inherit the same filters as locus walkers:

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments.
  • Duplicate reads.
  • Reads failing vendor quality checks.

3. Example change over of VariantEval

Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a RodWalker from its old RefWalker state:

//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
public class VariantEvalWalker extends RodWalker<Integer, Integer> {

The map function must now capture the number of skipped bases and protect itself from the final interval map calls:

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    nMappedSites += context.getSkippedBases();

    if ( ref == null ) { // we are seeing the last site
        return 0;
    }

    nMappedSites++;

That's all there is to it!

4. Performance improvements

A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs. Here is a comparison of ROD vs. Ref walker implementation of VariantEval:

RODWalker RefWalker
dbSNP and 1KG Pilot 2 SNP calls on chr1 164u (s) 768u (s)
Just 1KG Pilot 2 SNP calls on chr1 54u (s) 666u (s)

Read walkers

Read walkers walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.

Filtering defaults

By default, the following filters are automatically added to every read walker.

  • Reads with nonsensical alignments

Read pair walkers

Read pair walkers walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.

Filtering defaults

By default, the following filters are automatically added to every read pair walker.

  • Reads with nonsensical alignments

Duplicate walkers

Duplicate walkers walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.

Filtering defaults

By default, the following filters are automatically added to every duplicate walker.

  • Reads with nonsensical alignments
  • Unmapped reads
  • Non-primary alignments
Comments (2)

Brief introduction to reference metadata (RMDs)

Note that the -B flag referred to below is deprecated; these docs need to be updated

The GATK allows you to process arbitrary numbers of reference metadata (RMD) files inside of walkers (previously we called this reference ordered data, or ROD). Common RMDs are things like dbSNP, VCF call files, and refseq annotations. The only real constraints on RMD files is that:

  • They must contain information necessary to provide contig and position data for each element to the GATK engine so it knows with what loci to associate the RMD element.

  • The file must be sorted with regard to the reference fasta file so that data can be accessed sequentially by the engine.

  • The file must have a Tribble RMD parsing class associated with the file type so that elements in the RMD file can be parsed by the engine.

Inside of the GATK the RMD system has the concept of RMD tracks, which associate an arbitrary string name with the data in the associated RMD file. For example, the VariantEval module uses the named track eval to get calls for evaluation, and dbsnp as the track containing the database of known variants.

How do I get reference metadata files into my walker?

RMD files are extremely easy to get into the GATK using the -B syntax:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:variant,VCF calls.vcf

In this example, the GATK will attempt to parse the file calls.vcf using the VCF parser and bind the VCF data to the RMD track named variant.

In general, you can provide as many RMD bindings to the GATK as you like:

java -jar GenomeAnalysisTK.jar -R Homo_sapiens_assembly18.fasta -T PrintRODs -B:calls1,VCF calls1.vcf -B:calls2,VCF calls2.vcf

Works just as well. Some modules may require specifically named RMD tracks -- like eval above -- and some are happy to just assess all RMD tracks of a certain class and work with those -- like VariantsToVCF.

1. Directly getting access to a single named track

In this snippet from SNPDensityWalker, we grab the eval track as a VariantContext object, only for the variants that are of type SNP:

public Pair<VariantContext, GenomeLoc> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
    VariantContext vc = tracker.getVariantContext(ref, "eval", EnumSet.of(VariantContext.Type.SNP), context.getLocation(), false);
}

2. Grabbing anything that's convertable to a VariantContext

From VariantsToVCF we call the helper function tracker.getVariantContexts to look at all of the RMDs and convert what it can to VariantContext objects.

Allele refAllele = new Allele(Character.toString(ref.getBase()), true);
Collection<VariantContext> contexts = tracker.getVariantContexts(INPUT_RMD_NAME, ALLOWED_VARIANT_CONTEXT_TYPES, context.getLocation(), refAllele, true, false);

3. Looking at all of the RMDs

Here's a totally general code snippet from PileupWalker.java. This code, as you can see, iterates over all of the GATKFeature objects in the reference ordered data, converting each RMD to a string and capturing these strings in a list. It finally grabs the dbSNP binding specifically for a more detailed string conversion, and then binds them all up in a single string for display along with the read pileup.

private String getReferenceOrderedData( RefMetaDataTracker tracker ) { ArrayList rodStrings = new ArrayList(); for ( GATKFeature datum : tracker.getAllRods() ) { if ( datum != null && ! (datum.getUnderlyingObject() instanceof DbSNPFeature)) { rodStrings.add(((ReferenceOrderedDatum)datum.getUnderlyingObject()).toSimpleString()); // TODO: Aaron: this line still survives, try to remove it } } String rodString = Utils.join(", ", rodStrings);

        DbSNPFeature dbsnp = tracker.lookup(DbSNPHelper.STANDARD_DBSNP_TRACK_NAME, DbSNPFeature.class);

        if ( dbsnp != null)
            rodString += DbSNPHelper.toMediumString(dbsnp);

        if ( !rodString.equals("") )
            rodString = "[ROD: " + rodString + "]";

        return rodString;
}

How do I write my own RMD types?

Tracks of reference metadata are loaded using the Tribble infrastructure. Tracks are loaded using the feature codec and underlying type information. See the Tribble documentation for more information.

Tribble codecs that are in the classpath are automatically found; the GATK discovers all classes that implement the FeatureCodec class. Name resolution occurs using the -B type parameter, i.e. if the user specified:

-B:calls1,VCF calls1.vcf

The GATK looks for a FeatureCodec called VCFCodec.java to decode the record type. Alternately, if the user specified:

-B:calls1,MYAwesomeFormat calls1.maft

THe GATK would look for a codec called MYAwesomeFormatCodec.java. This look-up is not case sensitive, i.e. it will resolve MyAwEsOmEfOrMaT as well, though why you would want to write something so painfully ugly to read is beyond us.

Comments (6)

1. Overview

The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library.

2. Architecture Overview

Tribble provides a lightweight interface and API for querying features and creating indexes from feature files, while allowing iteration over know feature files that we're unable to create indexes for. The main entry point for external users is the BasicFeatureReader class. It takes in a codec, an index file, and a file containing the features to be processed. With an instance of a BasicFeatureReader, you can query for features that span a specific location, or get an iterator over all the records in the file.

3. Developer Overview

For developers, there are two important classes to implement: the FeatureCodec, which decodes lines of text and produces features, and the feature class, which is your underlying record type.

For developers there are two classes that are important:

  • Feature

    This is the genomicly oriented feature that represents the underlying data in the input file. For instance in the VCF format, this is the variant call including quality information, the reference base, and the alternate base. The required information to implement a feature is the chromosome name, the start position (one based), and the stop position. The start and stop position represent a closed, one-based interval. I.e. the first base in chromosome one would be chr1:1-1.

  • FeatureCodec

    This class takes in a line of text (from an input source, whether it's a file, compressed file, or a http link), and produces the above feature.

To implement your new format into Tribble, you need to implement the two above classes (in an appropriately named subfolder in the Tribble check-out). The Feature object should know nothing about the file representation; it should represent the data as an in-memory object. The interface for a feature looks like:

public interface Feature {

    /**
     * Return the features reference sequence name, e.g chromosome or contig
     */
    public String getChr();

    /**
     * Return the start position in 1-based coordinates (first base is 1)
     */
    public int getStart();

    /**
     * Return the end position following 1-based fully closed conventions.  The length of a feature is
     * end - start + 1;
     */
    public int getEnd();
}

And the interface for FeatureCodec:

/**
 * the base interface for classes that read in features.
 * @param <T> The feature type this codec reads
 */
public interface FeatureCodec<T extends Feature> {
    /**
     * Decode a line to obtain just its FeatureLoc for indexing -- contig, start, and stop.
     *
     * @param line the input line to decode
     * @return  Return the FeatureLoc encoded by the line, or null if the line does not represent a feature (e.g. is
     * a comment)
     */
    public Feature decodeLoc(String line);

    /**
     * Decode a line as a Feature.
     *
     * @param line the input line to decode
     * @return  Return the Feature encoded by the line,  or null if the line does not represent a feature (e.g. is
     * a comment)
     */
    public T decode(String line);

    /**
     * This function returns the object the codec generates.  This is allowed to be Feature in the case where
     * conditionally different types are generated.  Be as specific as you can though.
     *
     * This function is used by reflections based tools, so we can know the underlying type
     *
     * @return the feature type this codec generates.
     */
    public Class<T> getFeatureType();


    /**  Read and return the header, or null if there is no header.
     *
     * @return header object
     */
    public Object readHeader(LineReader reader);
}

4. Supported Formats

The following formats are supported in Tribble:

  • VCF Format
  • DbSNP Format
  • BED Format
  • GATK Interval Format

5. Updating the Tribble, htsjdk, and/or Picard library

Updating the revision of Tribble on the system is a relatively straightforward task if the following steps are taken.

NOTE: Any directory starting with ~ may be different on your machine, depending on where you cloned the various repositories for gsa-unstable, picard, and htsjdk.

A Maven script to install picard into the local repository is located under gsa-unstable/private/picard-maven. To operate, it requires a symbolic link named picard pointing to a working checkout of the picard github repository. NOTE: compiling picard requires an htsjdk github repository checkout available at picard/htsjdk, either as a subdirectory or another symbolic link. The final full path should be gsa-unstable/private/picard-maven/picard/htsjdk.

cd ~/src/gsa-unstable
cd private/picard-maven
ln -s ~/src/picard picard

Create a git branch of Picard and/or htsjdk and make your changes. To install your changes into the GATK you must run mvn install in the private/picard-maven directory. This will compile and copy the jars into gsa-unstable/public/repo, and update gsa-unstable/gatk-root/pom.xml with the corresponding version. While making changes your revision of picard and htslib will be labeled with -SNAPSHOT.

cd ~/src/gsa-unstable
cd private/picard-maven
mvn install

Continue testing in the GATK. Once your changes and updated tests for picard/htsjdk are complete, push your branch and submit your pull request to the Picard and/or htsjdk github. After your Picard/htsjdk patches are accepted, switch your Picard/htsjdk branches back to the master branch. NOTE: Leave your gsa-unstable branch on your development branch!

cd ~/src/picard
ant clean
git checkout master
git fetch
git rebase
cd htsjdk
git checkout master
git fetch
git rebase

NOTE: The version number of old and new Picard/htsjdk will vary, and during active development will end with -SNAPSHOT. While, if needed, you may push -SNAPSHOT version for testing on Bamboo, you should NOT submit a pull request with a -SNAPSHOT version. -SNAPSHOT indicates your local changes are not reproducible from source control.

When ready, run mvn install once more to create the non -SNAPSHOT versions under gsa-unstable/public/repo. In that directory, git add the new version, and git rm the old versions.

cd ~/src/gsa-unstable
cd public/repo
git add picard/picard/1.115.1499/
git add samtools/htsjdk/1.115.1509/
git rm -r picard/picard/1.112.1452/
git rm -r samtools/htsjdk/1.112.1452/

Commit and then push your gsa-unstable branch, then issue a pull request for review.

Comments (2)

1. Introduction

The LocusTraversal now supports passing walkers reads that have deletions spanning the current locus. This is useful in many situation where you want to calculate coverage, call variants and need to avoid calling variants where there are a lot of deletions, etc.

Currently, the system by default will not pass you deletion-spanning reads. In order to see them, you need to overload the function:

/**
 * (conceptual static) method that states whether you want to see reads piling up at a locus
 * that contain a deletion at the locus.
 *
 * ref:   ATCTGA
 * read1: ATCTGA
 * read2: AT--GA
 *
 * Normally, the locus iterator only returns a list of read1 at this locus at position 3, but
 * if this function returns true, then the system will return (read1, read2) with offsets
 * of (3, -1).  The -1 offset indicates a deletion in the read.
 *
 * @return false if you don't want to see deletions, or true if you do
 */
public boolean includeReadsWithDeletionAtLoci() { return true; }

in your walker. Now you will start seeing deletion-spanning reads in your walker. These reads are flagged with offsets of -1, so that you can:

    for ( int i = 0; i < context.getReads().size(); i++ ) {
        SAMRecord read = context.getReads().get(i);
        int offset = context.getOffsets().get(i);

       if ( offset == -1 ) 
               nDeletionReads++;
        else 
               nCleanReads++;
    }

There are also two convenience functions in AlignmentContext to extract subsets of the reads with and without spanning deletions:

/**
 * Returns only the reads in ac that do not contain spanning deletions of this locus
 * 
 * @param ac
 * @return
 */
public static AlignmentContext withoutSpanningDeletions( AlignmentContext ac );

/**
 * Returns only the reads in ac that do contain spanning deletions of this locus
 * 
 * @param ac
 * @return
 */
public static AlignmentContext withSpanningDeletions( AlignmentContext ac );
Comments (0)

The GATK discovers walker documentation by reading it out of the Javadoc, Sun's design pattern for providing documentation for packages and classes. This page will provide an extremely brief explanation of how to write Javadoc; more information on writing javadoc comments can be found in Sun's documentation.

1. Adding walker and package descriptions to the help text

The GATK's build system uses the javadoc parser to extract the javadoc for classes and packages and embed the contents of that javadoc in the help system. If you add Javadoc to your package or walker, it will automatically appear in the help. The javadoc parser will pick up on 'standard' javadoc comments, such as the following, taken from PrintReadsWalker:

/**
 * This walker prints out the input reads in SAM format.  Alternatively, the walker can write reads into a specified BAM file.
 */

You can add javadoc to your package by creating a special file, package-info.java, in the package directory. This file should consist of the javadoc for your package plus a package descriptor line. One such example follows:

/**
 * @help.display.name Miscellaneous walkers (experimental)
 */
package org.broadinstitute.sting.playground.gatk.walkers;

Additionally, the GATK provides two extra custom tags for overriding the information that ultimately makes it into the help.

  • @help.display.name Changes the name of the package as it appears in help. Note that the name of the walker cannot be changed as it is required to be passed verbatim to the -T argument.

  • @help.summary Changes the description which appears on the right-hand column of the help text. This is useful if you'd like to provide a more concise description of the walker that should appear in the help.

  • @help.description Changes the description which appears at the bottom of the help text with -T <your walker> --help is specified. This is useful if you'd like to present a more complete description of your walker.

2. Hiding experimental walkers (use sparingly, please!)

Walkers can be hidden from the documentation system by adding the @Hidden annotation to the top of each walker. @Hidden walkers can still be run from the command-line, but their documentation will not be visible to end users. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

3. Disabling building of help

Because the building of our help text is actually heavyweight and can dramatically increase compile time on some systems, we have a mechanism to disable help generation.

Compile with the following command:

ant -Ddisable.help=true

to disable generation of help.

Comments (0)

1. Analysis output overview

In theory, any class implementing the OutputStream interface. In practice, three types of classes are commonly used: PrintStreams for plain text files, SAMFileWriters for BAM files, and VCFWriters for VCF files.

2. PrintStream

To declare a basic PrintStream for output, use the following declaration syntax:

@Output
public PrintStream out;

And use it just as you would any other PrintStream:

out.println("Hello, world!");

By default, @Output streams prepopulate fullName, shortName, required, and doc. required in this context means that the GATK will always fill in the contents of the out field for you. If the user specifies no --out command-line argument, the 'out' field will be prepopulated with a stream pointing to System.out.

If your walker outputs a custom format that requires more than simple concatenation by Queue you should also implement a custom Gatherer.

3. SAMFileWriter

For some applications, you might need to manage their own SAM readers and writers directly from inside your walker. Current best practice for creating these Readers / Writers is to declare arguments of type SAMFileReader or SAMFileWriter as in the following example:

@Output
SAMFileWriter outputBamFile = null;

If you do not specify the full name and short name, the writer will provide system default names for these arguments. Creating a SAMFileWriter in this way will create the type of writer most commonly used by members of the GSA group at the Broad Institute -- it will use the same header as the input BAM and require presorted data. To change either of these attributes, use the StingSAMIterator interface instead:

@Output
StingSAMFileWriter outputBamFile = null;

and later, in initialize(), run one or both of the following methods:

outputBAMFile.writeHeader(customHeader); outputBAMFile.setPresorted(false);

You can change the header or presorted state until the first alignment is written to the file.

4. VCFWriter

VCFWriter outputs behave similarly to PrintStreams and SAMFileWriters. Declare a VCFWriter as follows:

@Output(doc="File to which variants should be written",required=true) protected VCFWriter writer = null;

5. Debugging Output

The walkers provide a protected logger instance. Users can adjust the debug level of the walkers using the -l command line option.

Turning on verbose logging can produce more output than is really necessary. To selectively turn on logging for a class or package, specify a log4j.properties property file from the command line as follows:

-Dlog4j.configuration=file:///<your development root>/Sting/java/config/log4j.properties

An example log4j.properties file is available in the java/config directory of the Git repository.

Comments (0)

1. Testing core walkers is critical

Most GATK walkers are really too complex to easily test using the standard unit test framework. It's just not feasible to make artificial read piles and then extrapolate from simple tests passing whether the system as a whole is working correctly. However, we need some way to determine whether changes to the core of the GATK are altering the expected output of complex walkers like BaseRecalibrator or SingleSampleGenotyper. In additional to correctness, we want to make sure that the performance of key walkers isn't degrading over time, so that calling snps, cleaning indels, etc., isn't slowly creeping down over time. Since we are now using a bamboo server to automatically build and run unit tests (as well as measure their runtimes) we want to put as many good walker tests into the test framework so we capture performance metrics over time.

2. The WalkerTest framework

To make this testing process easier, we've created a WalkerTest framework that lets you invoke the GATK using command-line GATK commands in the JUnit system and test for changes in your output files by comparing the current ant build results to previous run via an MD5 sum. It's a bit coarse grain, but it will work to ensure that changes to key walkers are detected quickly by the system, and authors can either update the expected MD5s or go track down bugs.

The system is fairly straightforward to use. Ultimately we will end up with JUnit style tests in the unit testing structure. In the piece of code below, we have a piece of code that checks the MD5 of the SingleSampleGenotyper's GELI text output at LOD 3 and LOD 10.

package org.broadinstitute.sting.gatk.walkers.genotyper;

import org.broadinstitute.sting.WalkerTest;
import org.junit.Test;

import java.util.HashMap;
import java.util.Map;
import java.util.Arrays;

public class SingleSampleGenotyperTest extends WalkerTest {
    @Test
    public void testLOD() {
        HashMap<Double, String> e = new HashMap<Double, String>();
        e.put( 10.0, "e4c51dca6f1fa999f4399b7412829534" );
        e.put( 3.0, "d804c24d49669235e3660e92e664ba1a" );

        for ( Map.Entry<Double, String> entry : e.entrySet() ) {
            WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
                   "-T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout %s --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod " + entry.getKey(), 1,
                    Arrays.asList(entry.getValue()));
            executeTest("testLOD", spec);
        }
    }
}

The fundamental piece here is to inherit from WalkerTest. This gives you access to the executeTest() function that consumes a WalkerTestSpec:

    public WalkerTestSpec(String args, int nOutputFiles, List<String> md5s)

The WalkerTestSpec takes regular, command-line style GATK arguments describing what you want to run, the number of output files the walker will generate, and your expected MD5s for each of these output files. The args string can contain %s String.format specifications, and for each of the nOutputFiles, the executeTest() function will (1) generate a tmp file for output and (2) call String.format on your args to fill in the tmp output files in your arguments string. For example, in the above argument string varout is followed by %s, so our single SingleSampleGenotyper output is the variant output file.

3. Example output

When you add a walkerTest inherited unit test to the GATK, and then build test, you'll see output that looks like:

[junit] WARN  13:29:50,068 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:29:50,068 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit]  
[junit] WARN  13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit]  
[junit] WARN  13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a] 
[junit] WARN  13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a] 
[junit] WARN  13:30:39,408 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:30:39,408 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:30:39,409 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:30:39,409 WalkerTest - -------------------------------------------------------------------------------- 
[junit] WARN  13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit]  
[junit] WARN  13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit]  
[junit] WARN  13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534] 
[junit] WARN  13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534] 
[junit] WARN  13:31:30,213 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:31:30,213 WalkerTest -   => testLOD PASSED 
[junit] WARN  13:31:30,214 SingleSampleGenotyperTest -  
[junit] WARN  13:31:30,214 SingleSampleGenotyperTest -  

4. Recommended location for GATK testing data

We keep all of the permenant GATK testing data in:

/humgen/gsa-scr1/GATK_Data/Validation_Data/

A good set of data to use for walker testing is the CEU daughter data from 1000 Genomes:

gsa2 ~/dev/GenomeAnalysisTK/trunk > ls -ltr /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.bam /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.calls
-rw-rw-r--+ 1 depristo wga  51M 2009-09-03 07:56 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam
-rw-rw-r--+ 1 depristo wga 185K 2009-09-04 13:21 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.variants.geli.calls
-rw-rw-r--+ 1 depristo wga 164M 2009-09-04 13:22 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.genotypes.geli.calls
-rw-rw-r--+ 1 depristo wga  24M 2009-09-04 15:00 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SOLID.bam
-rw-rw-r--+ 1 depristo wga  12M 2009-09-04 15:01 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.454.bam
-rw-r--r--+ 1 depristo wga  91M 2009-09-04 15:02 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam

5. Test dependencies

The tests depend on a variety of input files, that are generally constrained to three mount points on the internal Broad network:

*/seq/
*/humgen/1kg/
*/humgen/gsa-hpprojects/GATK/Data/Validation_Data/

To run the unit and integration tests you'll have to have access to these files. They may have different mount points on your machine (say, if you're running remotely over the VPN and have mounted the directories on your own machine).

6. MD5 database and comparing MD5 results

Every file that generates an MD5 sum as part of the WalkerTest framework will be copied to <MD5>. integrationtest in the integrationtests subdirectory of the GATK trunk. This MD5 database of results enables you to easily examine the results of an integration test as well as compare the results of a test before/after a code change. For example, below is an example test for the UnifiedGenotyper that, due to a code change, where the output VCF differs from the VCF with the expected MD5 value in the test code itself. The test provides provides the path to the two results files as well as a diff command to compare expected to the observed MD5:

[junit] --------------------------------------------------------------------------------    
[junit] Executing test testParameter[-genotype] with GATK arguments: -T UnifiedGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05997727998894311741.tmp -L 1:10,000,000-10,010,000 -genotype    
[junit] ##### MD5 file is up to date: integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest    
[junit] Checking MD5 for /tmp/walktest.tmp_param.05997727998894311741.tmp [calculated=ab20d4953b13c3fc3060d12c7c6fe29d, expected=0ac7ab893a3f550cb1b8c34f28baedf6]    
[junit] ##### Test testParameter[-genotype] is going fail #####    
[junit] ##### Path to expected   file (MD5=0ac7ab893a3f550cb1b8c34f28baedf6): integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest    
[junit] ##### Path to calculated file (MD5=ab20d4953b13c3fc3060d12c7c6fe29d): integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest    
[junit] ##### Diff command: diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest

Examining the diff we see a few lines that have changed the DP count in the new code

> diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest  | head
385,387c385,387
< 1     10000345        .       A       .       106.54  .       AN=2;DP=33;Dels=0.00;MQ=89.17;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:25:-0.09,-7.57,-75.74:74.78
< 1     10000346        .       A       .       103.75  .       AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:24:-0.07,-7.27,-76.00:71.99
< 1     10000347        .       A       .       109.79  .       AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:26:-0.05,-7.85,-84.74:78.04
---
> 1     10000345        .       A       .       106.54  .       AN=2;DP=32;Dels=0.00;MQ=89.50;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:25:-0.09,-7.57,-75.74:74.78
> 1     10000346        .       A       .       103.75  .       AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:24:-0.07,-7.27,-76.00:71.99
> 1     10000347        .       A       .       109.79  .       AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00   GT:DP:GL:GQ     0/0:26:-0.05,-7.85,-84.74:78

Whether this is the expected change is up to you to decide, but the system makes it as easy as possible to see the consequences of your code change.

7. Testing for Exceptions

The walker test framework supports an additional syntax for ensuring that a particular java Exception is thrown when a walker executes using a simple alternate version of the WalkerSpec object. Rather than specifying the MD5 of the result, you can provide a single subclass of Exception.class and the testing framework will ensure that when the walker runs an instance (class or subclass) of your expected exception is thrown. The system also flags if no exception is thrown.

For example, the following code tests that the GATK can detect and error out when incompatible VCF and FASTA files are given:

@Test public void fail8() { executeTest("hg18lex-v-b36", test(lexHG18, callsB36)); }

private WalkerTest.WalkerTestSpec test(String ref, String vcf) {
    return new WalkerTest.WalkerTestSpec("-T VariantsToTable -M 10 -B:two,vcf "
            + vcf + " -F POS,CHROM -R "
            + ref +  " -o %s",
            1, UserException.IncompatibleSequenceDictionaries.class);

}

During the integration test this looks like:

[junit] Executing test hg18lex-v-b36 with GATK arguments: -T VariantsToTable -M 10 -B:two,vcf /humgen/gsa-hpprojects/GATK/data/Validation_Data/lowpass.N3.chr1.raw.vcf -F POS,CHROM -R /humgen/gsa-hpprojects/GATK/data/Validation_Data/lexFasta/lex.hg18.fasta -o /tmp/walktest.tmp_param.05541601616101756852.tmp -l WARN -et NO_ET
[junit]    [junit] Wanted exception class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries, saw class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries
[junit]   => hg18lex-v-b36 PASSED

8. Miscellaneous information

  • Please do not put any extremely long tests in the regular ant build test target. We are currently splitting the system into fast and slow tests so that unit tests can be run in \< 3 minutes while saving a test target for long-running regression tests. More information on that will be posted.

  • An expected MG5 string of "" means don't check for equality between the calculated and expected MD5s. Useful if you are just writing a new test and don't know the true output.

  • Overload parameterize() { return true; } if you want the system to just run your calculations, not throw an error if your MD5s don't match, across all tests

  • If your tests all of a sudden stop giving equality MD5s, you can just (1) look at the .tmp output files directly or (2) grab the printed GATK command-line options and explore what is happening.

  • You can always run a GATK walker on the command line and then run md5sum on its output files to obtain, outside of the testing framework, the MD5 expected results.

  • Don't worry about the duplication of lines in the output ; it's just an annoyance of having two global loggers. Eventually we'll bug fix this away.

Comments (9)

This script can be used for sorting an input file based on a reference.

#!/usr/bin/perl -w

use strict;
use Getopt::Long;

sub usage {

    print "\nUsage:\n";
    print "sortByRef.pl [--k POS] INPUT REF_DICT\n\n";

    print " Sorts lines of the input file INFILE according\n";
    print " to the reference contig order specified by the\n";
    print " reference dictionary REF_DICT (.fai file).\n";
    print " The sort is stable. If -k option is not specified,\n";
    print " it is assumed that the contig name is the first\n";
    print " field in each line.\n\n";
    print "  INPUT      input file to sort. If '-' is specified, \n";
    print "             then reads from STDIN.\n";
    print "  REF_DICT   .fai file, or ANY file that has contigs, in the\n";
    print "             desired soting order, as its first column.\n";
    print "  --k POS :  contig name is in the field POS (1-based)\n";
    print "             of input lines.\n\n";

    exit(1);
}

my $pos = 1;
GetOptions( "k:i" => \$pos );

$pos--;

usage() if ( scalar(@ARGV) == 0 );

if ( scalar(@ARGV) != 2 ) {
    print "Wrong number of arguments\n";
    usage();
}

my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];


open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");

my %ref_order;

my $n = 0;
while ( <DICT> ) {
    chomp;
    my ($contig, $rest) = split "\t";
    die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );

    $ref_order{$contig} = $n;
    $n++;
}

close DICT;
#we have loaded contig ordering now

my $INPUT;
if ($input_file eq "-" ) {
    $INPUT = "STDIN";
} else {
    open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}

my %temp_outputs;

while ( <$INPUT> ) {

    my @fields = split '\s';
    die("Specified field position exceeds the number of fields:\n$_") 
        if ( $pos >= scalar(@fields) );

    my $contig = $fields[$pos];
    if ( $contig =~ m/:/ ) {
        my @loc = split(/:/, $contig);
        # print $contig . " " . $loc[0] . "\n";
        $contig = $loc[0]
    }
    chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line

    my $order;
    if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
    else {
        $order = $n; # input line has contig that was not in the dict; 
        $n++; # this contig will go at the end of the output, 
              # after all known contigs
    }

    my $fhandle;
    if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
    else {
        #print "opening $order $$ $_\n";
        open( $fhandle, " > /tmp/sortByRef.$$.$order.tmp" ) or
            die ( "Can not open temporary file $order: $!");
        $temp_outputs{$order} = $fhandle;
    }

    # we got the handle to the temp file that keeps all 
    # lines with contig $contig

    print $fhandle $_; # send current line to its corresponding temp file
}

close $INPUT;

foreach my $f ( values %temp_outputs ) { close $f; }

# now collect back into single output stream:

for ( my $i = 0 ; $i < $n ; $i++ ) {
    # if we did not have any lines on contig $i, then there's 
    # no temp file and nothing to do
    next if ( ! defined $temp_outputs{$i} ) ; 

    my $f; 
    open ( $f, "< /tmp/sortByRef.$$.$i.tmp" );
    while ( <$f> ) { print ; }
    close $f;

    unlink "/tmp/sortByRef.$$.$i.tmp";


}
Comments (0)

1. Introduction

When running either single-threaded or in shared-memory parallelism mode, the GATK guarantees that output written to an output stream created via the @Argument mechanism will ultimately be assembled in genomic order. In order to assemble the final output file, the GATK will write the output generated from each thread into a temporary output file, ultimately assembling the data via a central coordinating thread. There are three major elements in the GATK that facilitate this functionality:

  • Stub

    The front-end interface to the output management system. Stubs will be injected into the walker by the command-line argument system and relay information from the walker to the output management system. There will be one stub per invocation of the GATK.

  • Storage

    The back end interface, responsible for creating, writing and deleting temporary output files as well as merging their contents back into the primary output file. One Storage object will exist per shard processed in the GATK.

  • OutputTracker

    The dispatcher; ultimately connects the stub object's output creation request back to the most appropriate storage object to satisfy that request. One OutputTracker will exist per GATK invocation.

2. Basic Mechanism

Stubs are directly injected into the walker through the GATK's command-line argument parser as a go-between from walker to output management system. When a walker calls into the stub it's first responsibility is to call into the output tracker to retrieve an appropriate storage object. The behavior of the OutputTracker from this point forward depends mainly on the parallelization mode of this traversal of the GATK.

If the traversal is single-threaded:

  • the OutputTracker (implemented as DirectOutputTracker) will create the storage object if necessary and return it to the stub.

  • The stub will forward the request to the provided storage object.

  • At the end of the traversal, the microscheduler will request that the OutputTracker finalize and close the file.

If the traversal is multi-threaded using shared-memory parallelism:

  • The OutputTracker (implemented as ThreadLocalOutputTracker) will look for a storage object associated with this thread via a ThreadLocal.

  • If no such storage object exists, it will be created pointing to a temporary file.

  • At the end of each shard processed, that file will be closed and an OutputMergeTask will be created so that the shared-memory parallelism code can merge the output at its leisure.

  • The shared-memory parallelism code will merge when a fixed number of temporary files appear in the input queue. The constant used to determine this frequency is fixed at compile time (see HierarchicalMicroScheduler.MAX_OUTSTANDING_OUTPUT_MERGES).

3. Using output management

To use the output management system, declare a field in your walker of one of the existing core output types, coupled with either an @Argument or @Output annotation.

@Output(doc="Write output to this BAM filename instead of STDOUT")
SAMFileWriter out;

Currently supported output types are SAM/BAM (declare SAMFileWriter), VCF (declare VCFWriter), and any non-buffering stream extending from OutputStream.

4. Implementing a new output type

To create a new output type, three types must be implemented: Stub, Storage, and ArgumentTypeDescriptor.

To implement Stub

Create a new Stub class, extending/inheriting the core output type's interface and implementing the Stub interface.

OutputStreamStub extends OutputStream implements Stub<OutputStream> {

Implement a register function so that the engine can provide the stub with the session's OutputTracker.

public void register( OutputTracker outputTracker ) {
    this.outputTracker = outputTracker;
}

Add as fields any parameters necessary for the storage object to create temporary storage.

private final File targetFile;
public File getOutputFile() { return targetFile; }

Implement/override every method in the core output type's interface to pass along calls to the appropriate storage object via the OutputTracker.

public void write( byte[] b, int off, int len ) throws IOException {
    outputTracker.getStorage(this).write(b, off, len);
}

To implement Storage

Create a Storage class, again extending inheriting the core output type's interface and implementing the Storage interface.

public class OutputStreamStorage extends OutputStream implements Storage<OutputStream> {

Implement constructors that will accept just the Stub or Stub + alternate file path and create a repository for data, and a close function that will close that repository.

public OutputStreamStorage( OutputStreamStub stub ) { ... }
public OutputStreamStorage( OutputStreamStub stub, File file ) { ... }
public void close() { ... }

Implement a mergeInto function capable of reconstituting the file created by the constructor, dumping it back into the core output type's interface, and removing the source file.

public void mergeInto( OutputStream targetStream ) { ... }

Add a block to StorageFactory.createStorage() capable of creating the new storage object. TODO: use reflection to generate the storage classes.

    if(stub instanceof OutputStreamStub) {
        if( file != null )
            storage = new OutputStreamStorage((OutputStreamStub)stub,file);
        else
            storage = new OutputStreamStorage((OutputStreamStub)stub);
    }

To implement ArgumentTypeDescriptor

Create a new object inheriting from type ArgumentTypeDescriptor. Note that the ArgumentTypeDescriptor does NOT need to support the core output type's interface.

public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {

Implement a truth function indicating which types this ArgumentTypeDescriptor can service.

 @Override
 public boolean supports( Class type ) {
     return SAMFileWriter.class.equals(type) || StingSAMFileWriter.class.equals(type);
 }

Implement a parse function that constructs the new Stub object. The function should register this type as an output by caling engine.addOutput(stub).

 public Object parse( ParsingEngine parsingEngine, ArgumentSource source, Type type, ArgumentMatches matches )  {
     ...
     OutputStreamStub stub = new OutputStreamStub(new File(fileName));
     ...
     engine.addOutput(stub);
     ....
     return stub;
}

Add a creator for this new ArgumentTypeDescriptor in CommandLineExecutable.getArgumentTypeDescriptors().

 protected Collection<ArgumentTypeDescriptor> getArgumentTypeDescriptors() {
     return Arrays.asList( new VCFWriterArgumentTypeDescriptor(engine,System.out,argumentSources),
                           new SAMFileWriterArgumentTypeDescriptor(engine,System.out),
                           new OutputStreamArgumentTypeDescriptor(engine,System.out) );
 }

After creating these three objects, the new output type should be ready for usage as described above.

5. Outstanding issues

  • Only non-buffering iterators are currently supported by the GATK. Of particular note, PrintWriter will appear to drop records if created by the command-line argument system; use PrintStream instead.

  • For efficiency, the GATK does not reduce output files together following the tree pattern used by shared-memory parallelism; output merges happen via an independent queue. Because of this, output merges happening during a treeReduce may not behave correctly.

Comments (3)

1. Introduction

The GATK provides an implementation of the Per-Base Alignment Qualities (BAQ) developed by Heng Li in late 2010. See this SamTools page for more details.

2. Using BAQ

The BAQ algorithm is applied by the GATK engine itself, which means that all GATK walkers can potentially benefit from it. By default, BAQ is OFF, meaning that the engine will not use BAQ quality scores at all.

The GATK engine accepts the argument -baq with the following enum values:

public enum CalculationMode {
    OFF,                        // don't apply a BAQ at all, the default
    CALCULATE_AS_NECESSARY,     // do HMM BAQ calculation on the fly, as necessary, if there's no tag
    RECALCULATE                 // do HMM BAQ calculation on the fly, regardless of whether there's a tag present
}

If you want to enable BAQ, the usual thing to do is CALCULATE_AS_NECESSARY, which will calculate BAQ values if they are not in the BQ read tag. If your reads are already tagged with BQ values, then the GATK will use those. RECALCULATE will always recalculate the BAQ, regardless of the tag, which is useful if you are experimenting with the gap open penalty (see below).

If you are really an expert, the GATK allows you to specify the BAQ gap open penalty (-baqGOP) to use in the HMM. This value should be 40 by default, a good value for whole genomes and exomes for highly sensitive calls. However, if you are analyzing exome data only, you may want to use 30, which seems to result in more specific call set. We continue to play with these values some. Some walkers, where BAQ would corrupt their analyses, forbid the use of BAQ and will throw an exception if -baq is provided.

3. Some example uses of the BAQ in the GATK

  • For UnifiedGenotyper to get more specific SNP calls.

  • For PrintReads to write out a BAM file with BAQ tagged reads

  • For TableRecalibrator or IndelRealigner to write out a BAM file with BAQ tagged reads. Make sure you use -baq RECALCULATE so the engine knows to recalculate the BAQ after these tools have updated the base quality scores or the read alignments. Note that both of these tools will not use the BAQ values on input, but will write out the tags for analysis tools that will use them.

Note that some tools should not have BAQ applied to them.

This last option will be a particularly useful for people who are already doing base quality score recalibration. Suppose I have a pipeline that does:

RealignerTargetCreator
IndelRealigner

BaseRecalibrator
PrintReads (with --BQSR input)

UnifiedGenotyper

A highly efficient BAQ extended pipeline would look like

RealignerTargetCreator
IndelRealigner // don't bother with BAQ here, since we will calculate it in table recalibrator

BaseRecalibrator
PrintReads (with --BQSR input) -baq RECALCULATE // now the reads will have a BAQ tag added.  Slows the tool down some

UnifiedGenotyper -baq CALCULATE_AS_NECESSARY // UG will use the tags from TableRecalibrate, keeping UG fast

4. BAQ and walker control

Walkers can control via the @BAQMode annotation how the BAQ calculation is applied. Can either be as a tag, by overwriting the qualities scores, or by only returning the baq-capped qualities scores. Additionally, walkers can be set up to have the BAQ applied to the incoming reads (ON_INPUT, the default), to output reads (ON_OUTPUT), or HANDLED_BY_WALKER, which means that calling into the BAQ system is the responsibility of the individual walker.

Comments (2)

1. Naming walkers

Users identify which GATK walker to run by specifying a walker name via the --analysis_type command-line argument. By default, the GATK will derive the walker name from a walker by taking the name of the walker class and removing packaging information from the start of the name, and removing the trailing text Walker from the end of the name, if it exists. For example, the GATK would, by default, assign the name PrintReads to the walker class org.broadinstitute.sting.gatk.walkers.PrintReadsWalker. To override the default walker name, annotate your walker class with @WalkerName("<my name>").

2. Requiring / allowing primary inputs

Walkers can flag exactly which primary data sources are allowed and required for a given walker. Reads, the reference, and reference-ordered data are currently considered primary data sources. Different traversal types have different default requirements for reads and reference, but currently no traversal types require reference-ordered data by default. You can add requirements to your walker with the @Requires / @Allows annotations as follows:

@Requires(DataSource.READS)
@Requires({DataSource.READS,DataSource.REFERENCE})
@Requires(value={DataSource.READS,DataSource.REFERENCE})
@Requires(value=DataSource.REFERENCE})

By default, all parameters are allowed unless you lock them down with the @Allows attribute. The command:

@Allows(value={DataSource.READS,DataSource.REFERENCE})

will only allow the reads and the reference. Any other primary data sources will cause the system to exit with an error.

Note that as of August 2011, the GATK no longer supports RMD the @Requires and @Allows syntax, as these have moved to the standard @Argument system.

3. Command-line argument tagging

Any command-line argument can be tagged with a comma-separated list of freeform tags.

The syntax for tags is as follows:

-<argument>:<tag1>,<tag2>,<tag3> <argument value>

for example:

-I:tumor <my tumor data>.bam
-eval,VCF yri.trio.chr1.vcf

There is currently no mechanism in the GATK to validate either the number of tags supplied or the content of those tags.

Tags can be accessed from within a walker by calling getToolkit().getTags(argumentValue), where argumentValue is the parsed contents of the command-line argument to inspect.

Applications

The GATK currently has comprehensive support for tags on two built-in argument types:

  • -I,--input_file <input_file>

    Input BAM files and BAM file lists can be tagged with any type. When a BAM file list is tagged, the tag is applied to each listed BAM file.

From within a walker, use the following code to access the supplied tag or tags:

getToolkit().getReaderIDForRead(read).getTags();
  • Input RODs, e.g. `-V ' or '-eval '

    Tags are used to specify ROD name and ROD type. There is currently no support for adding additional tags. See the ROD system documentation for more details.

4. Adding additional command-line arguments

Users can create command-line arguments for walkers by creating public member variables annotated with @Argument in the walker. The @Argument annotation takes a number of differentparameters:

  • fullName

    The full name of this argument. Defaults to the toLowerCase()’d member name. When specifying fullName on the command line, prefix with a double dash (--).

  • shortName

    The alternate, short name for this argument. Defaults to the first letter of the member name. When specifying shortName on the command line, prefix with a single dash (-).

  • doc

    Documentation for this argument. Will appear in help output when a user either requests help with the –-help (-h) argument or when a user specifies an invalid set of arguments. Documentation is the only argument that is always required.

  • required

    Whether the argument is required when used with this walker. Default is required = true.

  • exclusiveOf

    Specifies that this argument is mutually exclusive of another argument in the same walker. Defaults to not mutually exclusive of any other arguments.

  • validation

    Specifies a regular expression used to validate the contents of the command-line argument. If the text provided by the user does not match this regex, the GATK will abort with an error.

By default, all command-line arguments will appear in the help system. To prevent new and debugging arguments from appearing in the help system, you can add the @Hidden tag below the @Argument annotation, hiding it from the help system but allowing users to supply it on the command-line. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

Passing Command-Line Arguments

Arguments can be passed to the walker using either the full name or the short name. If passing arguments using the full name, the syntax is −−<arg full name> <value>.

--myint 6

If passing arguments using the short name, the syntax is -<arg short name> <value>. Note that there is a space between the short name and the value:

-m 6

Boolean (class) and boolean (primitive) arguments are a special in that they require no argument. The presence of a boolean indicates true, and its absence indicates false. The following example sets a flag to true.

-B

Supplemental command-line argument annotations

Two additional annotations can influence the behavior of command-line arguments.

  • @Hidden

    Adding this annotation to an @Argument tells the help system to avoid displaying any evidence that this argument exists. This can be used to add additional debugging arguments that aren't suitable for mass consumption.

  • @Deprecated

    Forces the GATK to throw an exception if this argument is supplied on the command-line. This can be used to supply extra documentation to the user as command-line parameters change for walkers that are in flux.

Examples

Create an required int parameter with full name –myint, short name -m. Pass this argument by adding –myint 6 or -m 6 to the command line.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(doc="my integer")
    public int myInt;

Create an optional float parameter with full name –myFloatingPointArgument, short name -m. Pass this argument by adding –myFloatingPointArgument 2.71 or -m 2.71.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(fullName="myFloatingPointArgument",doc="a floating point argument",required=false)
    public float myFloat;

The GATK will parse the argument differently depending on the type of the public member variable’s type. Many different argument types are supported, including primitives and their wrappers, arrays, typed and untyped collections, and any type with a String constructor. When the GATK cannot completely infer the type (such as in the case of untyped collections), it will assume that the argument is a String. GATK is aware of concrete implementations of some interfaces and abstract classes. If the argument’s member variable is of type List or Set, the GATK will fill the member variable with a concrete ArrayList or TreeSet, respectively. Maps are not currently supported.

5. Additional argument types: @Input, @Output

Besides @Argument, the GATK provides two additional types for command-line arguments: @Input and @Output. These two inputs are very similar to @Argument but act as flags to indicate dataflow to Queue, our pipeline management software.

  • The @Input tag indicates that the contents of the tagged field represents a file that will be read by the walker.

  • The @Output tag indicates that the contents of the tagged field represents a file that will be written by the walker, for consumption by downstream walkers.

We're still determining the best way to model walker dependencies in our pipeline. As we determine best practices, we'll post them here.

6. Getting access to Reference Ordered Data (RMD) with @Input and RodBinding

As of August 2011, the GATK now provides a clean mechanism for creating walker @Input arguments and using these arguments to access Reference Meta Data provided by the RefMetaDataTracker in the map() call. This mechanism is preferred to the old implicit string-based mechanism, which has been retired.

At a very high level, the new RodBindings provide a handle for a walker to obtain the Feature records from Tribble from a map() call, specific to a command line binding provided by the user. This can be as simple as a single ROD file argument|one-to-one binding between a command line argument and a track, or as complex as an argument argument accepting multiple command line arguments, each with a specific name. The RodBindings are generic and type specific, so you can require users to provide files that emit VariantContexts, BedTables, etc, or simply the root type Feature from Tribble. Critically, the RodBindings interact nicely with the GATKDocs system, so you can provide summary and detailed documentation for each RodBinding accepted by your walker.

A single ROD file argument

Suppose you have a walker that uses a single track of VariantContexts, such as SelectVariants, in its calculation. You declare a standard GATK-style @Input argument in the walker, of type RodBinding<VariantContext>:

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

This will require the user to provide a command line option --variant:vcf my.vcf to your walker. To get access to your variants, in the map() function you provide the variants variable to the tracker, as in:

Collection<VariantContext> vcs = tracker.getValues(variants, context.getLocation());

which returns all of the VariantContexts in variants that start at context.getLocation(). See RefMetaDataTracker in the javadocs to see the full range of getter routines.

Note that, as with regular tribble tracks, you have to provide the Tribble type of the file as a tag to the argument (:vcf). The system now checks up front that the corresponding Tribble codec produces Features that are type-compatible with the type of the RodBinding<T>.

RodBindings are generic

The RodBinding class is generic, parameterized as RodBinding<T extends Feature>. This T class describes the type of the Feature required by the walker. The best practice for declaring a RodBinding is to choose the most general Feature type that will allow your walker to work. For example, if all you really care about is whether a Feature overlaps the site in map, you can use Feature itself, which supports this, and will allow any Tribble type to be provided, using a RodBinding<Feature>. If you are manipulating VariantContexts, you should declare a RodBinding<VariantContext>, which will restrict automatically the user to providing Tribble types that can create a object consistent with the VariantContext class (a VariantContext itself or subclass).

Note that in multi-argument RodBindings, as List<RodBinding<T>> arg, the system will require all files provided here to provide an object of type T. So List<RodBinding<VariantContext>> arg requires all -arg command line arguments to bind to files that produce VariantContexts.

An argument that can be provided any number of times

The RodBinding system supports the standard @Argument style of allowing a vararg argument by wrapping it in a Java collection. For example, if you want to allow users to provide any number of comp tracks to your walker, simply declare a List<RodBinding<VariantContext>> field:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

With this declaration, your walker will accept any number of -comp arguments, as in:

-comp:vcf 1.vcf -comp:vcf 2.vcf -comp:vcf 3.vcf

For such a command line, the comps field would be initialized to the List with three RodBindings, the first binding to 1.vcf, the second to 2.vcf and finally the third to 3.vcf.

Because this is a required argument, at least one -comp must be provided. Vararg @Input RodBindings can be optional, but you should follow proper varargs style to get the best results.

Proper handling of optional arguments

If you want to make a RodBinding optional, you first need to tell the @Input argument that its options (required=false):

@Input(fullName="discordance", required=false)
private RodBinding<VariantContext> discordanceTrack;

The GATK automagically sets this field to the value of the special static constructor method makeUnbound(Class c) to create a special "unbound" RodBinding here. This unbound object is type safe, can be safely passed to the RefMetaDataTracker get methods, and is guaranteed to never return any values. It also returns false when the isBound() method is called.

An example usage of isBound is to conditionally add header lines, as in:

if ( mask.isBound() ) {
    hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));
}

The case for vararg style RodBindings is slightly different. If you want, as above, users to be able to omit the -comp track entirely, you should initialize the value of the collection to the appropriate emptyList/emptySet in Collections:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=false)
public List<RodBinding<VariantContext>> comps = Collections.emptyList();

which will ensure that comps.isEmpty() is true when no -comp is provided.

Implicit and explicit names for RodBindings

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

By default, the getName() method in RodBinding returns the fullName of the @Input. This can be overloaded on the command-line by providing not one but two tags. The first tag is interpreted as the name for the binding, and the second as the type. As in:

-variant:vcf foo.vcf     => getName() == "variant"
-variant:foo,vcf foo.vcf => getName() == "foo"

This capability is useful when users need to provide more meaningful names for arguments, especially with variable arguments. For example, in VariantEval, there's a List<RodBinding<VariantContext>> comps, which may be dbsnp, hapmap, etc. This would be declared as:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

where a normal command line usage would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

In the code, you might have a loop that looks like:

for ( final RodBinding comp : comps )
    for ( final VariantContext vc : tracker.getValues(comp, context.getLocation())
        out.printf("%s has a binding at %s%n", comp.getName(), getToolkit().getGenomeLocParser.createGenomeLoc(vc)); 

which would print out lines that included things like:

hapmap has a binding at 1:10
omni has a binding at 1:20
hapmap has a binding at 1:30
1000g has a binding at 1:30

This last example begs the question -- what happens with getName() when explicit names are not provided? The system goes out of its way to provide reasonable names for the variables:

  • The first occurrence is named for the fullName, where comp

  • Subsequent occurrences are postfixed with an integer count, starting at 2, so comp2, comp3, etc.

In the above example, the command line

-comp:vcf hapmap.vcf -comp:vcf omni.vcf -comp:vcf 1000g.vcf

would emit

comp has a binding at 1:10
comp2 has a binding at 1:20
comp has a binding at 1:30
comp3 has a binding at 1:30

Dynamic type resolution

The new RodBinding system supports a simple form of dynamic type resolution. If the input filetype can be specially associated with a single Tribble type (as VCF can), then you can omit the type entirely from the the command-line binding of a RodBinding!

So whereas a full command line would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

because these are VCF files they could technically be provided as:

-comp:hapmap hapmap.vcf -comp:omni omni.vcf -comp:1000g 1000g.vcf

If you don't care about naming, you can now say:

-comp hapmap.vcf -comp omni.vcf -comp 1000g.vcf

Best practice for documenting a RodBinding

The best practice is simple: use a javadoc style comment above the @Input annotation, with the standard first line summary and subsequent detailed discussion of the meaning of the argument. These are then picked up by the GATKdocs system and added to the standard walker docs, following the standard structure of GATKDocs @Argument docs. Below is a best practice documentation example from SelectVariants, which accepts a required variant track and two optional discordance and concordance tracks.

public class SelectVariants extends RodWalker<Integer, Integer> {
   /**
     * Variants from this file are sent through the filtering and modifying routines as directed
     * by the arguments to SelectVariants, and finally are emitted.
     */
    @Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
    public RodBinding<VariantContext> variants;

    /**
     * A site is considered discordant if there exists some sample in eval that has a non-reference genotype
     * and either the site isn't present in this track, the sample isn't present in this track,
     * or the sample is called reference in this track.
     */
    @Input(fullName="discordance", shortName = "disc", doc="Output variants that were not called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> discordanceTrack;

    /**
     * A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
     * in both variants and concordance tracks or (2) every sample present in eval is present in the concordance
     * track and they have the sample genotype call.
     */
    @Input(fullName="concordance", shortName = "conc", doc="Output variants that were also called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> concordanceTrack;
}

Note how much better the above version is compared to the old pre-Rodbinding syntax (code below). Below you have a required argument variant that doesn't show up as a formal argument in the GATK, different from the conceptually similar @Arguments for discordanceRodName and concordanceRodName, which have no type restrictions. There's no place to document the variant argument as well, so the system is effectively blind to this essential argument.

@Requires(value={},referenceMetaData=@RMD(name="variant", type=VariantContext.class))
public class SelectVariants extends RodWalker<Integer, Integer> {
    @Argument(fullName="discordance", shortName =  "disc", doc="Output variants that were not called on a ROD comparison track. Use -disc ROD_NAME", required=false)
    private String discordanceRodName = "";

    @Argument(fullName="concordance", shortName =  "conc", doc="Output variants that were also called on a ROD comparison track. Use -conc ROD_NAME", required=false)
    private String concordanceRodName = "";
}

RodBinding examples

In these examples, we have declared two RodBindings in the Walker

@Input(fullName="mask", doc="Input ROD mask", required=false)
public RodBinding<Feature> mask = RodBinding.makeUnbound(Feature.class);

@Input(fullName="comp", doc="Comparison track", required=false)
public List<RodBinding<VariantContext>> comps = new ArrayList<VariantContext>();
  • Get the first value

    Feature f = tracker.getFirstValue(mask)

  • Get all of the values at a location

    Collection<Feature> fs = tracker.getValues(mask, thisGenomeLoc)

  • Get all of the features here, regardless of track

    Collection<Feature> fs = tracker.getValues(Feature.class)

  • Determining if an optional RodBinding was provided . if ( mask.isBound() ) // writes out the mask header line, if one was provided hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));

    if ( ! comps.isEmpty() ) logger.info("At least one comp was provided")

Example usage in Queue scripts

In QScripts when you need to tag a file use the class TaggedFile which extends from java.io.File.

Example in the QScript on the Command Line
Untagged VCF myWalker.variant = new File("my.vcf") -V my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF") -V:VCF my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF,custom=value") -V:VCF,custom=value my.vcf
Labeling a tumor myWalker.input_file :+= new TaggedFile("mytumor.bam", "tumor") -I:tumor mytumor.bam

Notes

No longer need to (or can) use @Requires and @Allows for ROD data. This system is now retired.

Comments (0)

The GATKDocs are what we call "Technical Documentation" in the Guide section of this website. The HTML pages are generated automatically at build time from specific blocks of documentation in the source code.

The best place to look for example documentation for a GATK walker is GATKDocsExample walker in org.broadinstitute.sting.gatk.examples. This is available here.

Below is the reproduction of that file from August 11, 2011:

/**
 * [Short one sentence description of this walker]
 *
 * <p>
 * [Functionality of this walker]
 * </p>
 *
 * <h2>Input</h2>
 * <p>
 * [Input description]
 * </p>
 *
 * <h2>Output</h2>
 * <p>
 * [Output description]
 * </p>
 *
 * <h2>Examples</h2>
 * PRE-TAG
 *    java
 *      -jar GenomeAnalysisTK.jar
 *      -T $WalkerName
 * PRE-TAG
 *
 * @category Walker Category
 * @author Your Name
 * @since Date created
 */
public class GATKDocsExample extends RodWalker<Integer, Integer> {
    /**
     * Put detailed documentation about the argument here.  No need to duplicate the summary information
     * in doc annotation field, as that will be added before this text in the documentation page.
     *
     * Notes:
     * <ul>
     *     <li>This field can contain HTML as a normal javadoc</li>
     *     <li>Don't include information about the default value, as gatkdocs adds this automatically</li>
     *     <li>Try your best to describe in detail the behavior of the argument, as ultimately confusing
     *          docs here will just result in user posts on the forum</li>
     * </ul>
     */
    @Argument(fullName="full", shortName="short", doc="Brief summary of argument [~ 80 characters of text]", required=false)
    private boolean myWalkerArgument = false;

    public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { return 0; }
    public Integer reduceInit() { return 0; }
    public Integer reduce(Integer value, Integer sum) { return value + sum; }
    public void onTraversalDone(Integer result) { }
}
Comments (15)

1. Introduction

Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism. Both techniques and described below.

2. Downsampling

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Defaults

The GATK's default downsampler exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.

  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.

  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.

  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.

  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

    Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.

  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

3. Read filtering

To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:

@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})

4. Command-line arguments for read filters

You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:

public class MaxReadLengthFilter implements SamRecordFilter {
    @Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
    private int maxReadLength;

    public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}

Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.

Note that when you specify a read filter, you need to strip the Filter part of its name off! E.g. in the example above, if you want to use MaxReadLengthFilter, you need to call it like this:

--read_filter MaxReadLength

5. Adding filters dynamically using command-line arguments

The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:

--read_filter MaxReadLength --maxReadLength 76

You can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead
Comments (0)

1. Introduction

The AlignmentContext and ReadBackedPileup work together to provide the read data associated with a given locus. This section details the tools the GATK provides for working with collections of aligned reads.

2. What are read backed pileups?

Read backed pileups are objects that contain all of the reads and their offsets that "pile up" at a locus on the genome. They are the basic input data for the GATK LocusWalkers, and underlie most of the locus-based analysis tools like the recalibrator and SNP caller. Unfortunately, there are many ways to view this data, and version one grew unwieldy trying to support all of these approaches. Version two of the ReadBackedPileup presents a consistent and clean interface for working pileup data, as well as supporting the iterable() interface to enable the convenient for ( PileupElement p : pileup ) for-each loop support.

3. How do I get a ReadBackedPileup and/or how do I create one?

The best way is simply to grab the pileup (the underlying representation of the locus data) from your AlignmentContext object in map:

public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context)
    ReadBackedPileup pileup = context.getPileup();

This aligns your calculations with the GATK core infrastructure, and avoids any unnecessary data copying from the engine to your walker.

If you are trying to create your own, the best constructor is:

public ReadBackedPileup(GenomeLoc loc, ArrayList<PileupElement> pileup )

requiring only a list, in order of read / offset in the pileup, of PileupElements.

From List and List

If you happen to have lists of SAMRecords and integer offsets into them you can construct a ReadBackedPileup this way:

public ReadBackedPileup(GenomeLoc loc, List<SAMRecord> reads, List<Integer> offsets )

4. What's the best way to use them?

Best way if you just need reads, bases and quals

for ( PileupElement p : pileup ) {
  System.out.printf("%c %c %d%n", p.getBase(), p.getSecondBase(), p.getQual());
  // you can get the read itself too using p.getRead()
}

This is the most efficient way to get data, and should be used whenever possible.

I just want a vector of bases and quals

You can use:

public byte[] getBases()
public byte[] getSecondaryBases()
public byte[] getQuals()

To get the bases and quals as a byte[] array, which is the underlying base representation in the SAM-JDK.

All I care about are counts of bases

Use the follow function to get counts of A, C, G, T in order:

public int[] getBaseCounts()

Which returns a int[4] vector with counts according to BaseUtils.simpleBaseToBaseIndex for each base.

Can I view just the reads for a given sample, read group, or any other arbitrary filter?

The GATK can very efficiently stratify pileups by sample, and less efficiently stratify by read group, strand, mapping quality, base quality, or any arbitrary filter function. The sample-specific functions can be called as follows:

pileup.getSamples();
pileup.getPileupForSample(String sampleName);

In addition to the rich set of filtering primitives built into the ReadBackedPileup, you can supply your own primitives by implmenting a PileupElementFilter:

public interface PileupElementFilter {
    public boolean allow(final PileupElement pileupElement);
}

and passing it to ReadBackedPileup's generic filter function:

public ReadBackedPileup getFilteredPileup(PileupElementFilter filter);

See the ReadBackedPileup's java documentation for a complete list of built-in filtering primitives.

Historical: StratifiedAlignmentContext

While ReadBackedPileup is the preferred mechanism for aligned reads, some walkers still use the StratifiedAlignmentContext to carve up selections of reads. If you find functions that you require in StratifiedAlignmentContext that seem to have no analog in ReadBackedPileup, please let us know and we'll port the required functions for you.

Comments (0)

Imagine a simple question like, "What's the depth of coverage at position A of the genome?"

First, you are given billions of reads that are aligned to the genome but not ordered in any particular way (except perhaps in the order they were emitted by the sequencer). This simple question is then very difficult to answer efficiently, because the algorithm is forced to examine every single read in succession, since any one of them might span position A. The algorithm must now take several hours in order to compute this value.

Instead, imagine the billions of reads are now sorted in reference order (that is to say, on each chromosome, the reads are stored on disk in the same order they appear on the chromosome). Now, answering the question above is trivial, as the algorithm can jump to the desired location, examine only the reads that span the position, and return immediately after those reads (and only those reads) are inspected. The total number of reads that need to be interrogated is only a handful, rather than several billion, and the processing time is seconds, not hours.

This reference-ordered sorting enables the GATK to process terabytes of data quickly and without tremendous memory overhead. Most GATK tools run very quickly and with less than 2 gigabytes of RAM. Without this sorting, the GATK cannot operate correctly. Thus, it is a fundamental rule of working with the GATK, which is the reason for the Central Dogma of the GATK:

All datasets (reads, alignments, quality scores, variants, dbSNP information, gene tracks, interval lists - everything) must be sorted in order of one of the canonical references sequences.

Comments (0)

1. What file formats do you support for variant callsets?

We support the Variant Call Format (VCF) for variant callsets. No other file formats are supported.

2. How can I know if my VCF file is valid?

VCFTools contains a validation tool that will allow you to verify it.

3. Are you planning to include any converters from different formats or allow different input formats than VCF?

No, we like VCF and we think it's important to have a good standard format. Multiplying formats just makes life hard for everyone, both developers and analysts.

Comments (1)

1. What is Scala?

Scala is a combination of an object oriented framework and a functional programming language. For a good introduction see the free online book Programming Scala.

The following are extremely brief answers to frequently asked questions about Scala which often pop up when first viewing or editing QScripts. For more information on Scala there a multitude of resources available around the web including the Scala home page and the online Scala Doc.

2. Where do I learn more about Scala?

  • http://www.scala-lang.org
  • http://programming-scala.labs.oreilly.com
  • http://www.scala-lang.org/docu/files/ScalaByExample.pdf
  • http://devcheatsheet.com/tag/scala/
  • http://davetron5000.github.com/scala-style/index.html

3. What is the difference between var and val?

var is a value you can later modify, while val is similar to final in Java.

4. What is the difference between Scala collections and Java collections? / Why do I get the error: type mismatch?

Because the GATK and Queue are a mix of Scala and Java sometimes you'll run into problems when you need a Scala collection and instead a Java collection is returned.

   MyQScript.scala:39: error: type mismatch;
     found   : java.util.List[java.lang.String]
     required: scala.List[String]
        val wrapped: List[String] = TextFormattingUtils.wordWrap(text, width)

Use the implicit definitions in JavaConversions to automatically convert the basic Java collections to and from Scala collections.

import collection.JavaConversions._

Scala has a very rich collections framework which you should take the time to enjoy. One of the first things you'll notice is that the default Scala collections are immutable, which means you should treat them as you would a String. When you want to 'modify' an immutable collection you need to capture the result of the operation, often assigning the result back to the original variable.

var str = "A"
str + "B"
println(str) // prints: A
str += "C"
println(str) // prints: AC

var set = Set("A")
set + "B"
println(set) // prints: Set(A)
set += "C"
println(set) // prints: Set(A, C)

5. How do I append to a list?

Use the :+ operator for a single value.

  var myList = List.empty[String]
  myList :+= "a"
  myList :+= "b"
  myList :+= "c"

Use ++ for appending a list.

  var myList = List.empty[String]
  myList ++= List("a", "b", "c")

6. How do I add to a set?

Use the + operator.

  var mySet = Set.empty[String]
  mySet += "a"
  mySet += "b"
  mySet += "c"

7. How do I add to a map?

Use the + and -> operators.

  var myMap = Map.empty[String,Int]
  myMap += "a" -> 1
  myMap += "b" -> 2
  myMap += "c" -> 3

8. What are Option, Some, and None?

Option is a Scala generic type that can either be some generic value or None. Queue often uses it to represent primitives that may be null.

  var myNullableInt1: Option[Int] = Some(1)
  var myNullableInt2: Option[Int] = None

9. What is _ / What is the underscore?

François Armand's slide deck is a good introduction: http://www.slideshare.net/normation/scala-dreaded

To quote from his slides:

Give me a variable name but
- I don't care of what it is
- and/or
- don't want to pollute my namespace with it

10. How do I format a String?

Use the .format() method.

This Java snippet:

String formatted = String.format("%s %i", myString, myInt);

In Scala would be:

val formatted = "%s %i".format(myString, myInt)

11. Can I use Scala Enumerations as QScript @Arguments?

No. Currently Scala's Enumeration class does not interact with the Java reflection API in a way that could be used for Queue command line arguments. You can use Java enums if for example you are importing a Java based walker's enum type.

If/when we find a workaround for Queue we'll update this entry. In the meantime try using a String.

Comments (1)

1. Many of my GATK functions are setup with the same Reference, Intervals, etc. Is there a quick way to reuse these values for the different analyses in my pipeline?

Yes.

  • Create a trait that extends from CommandLineGATK.
  • In the trait, copy common values from your qscript.
  • Mix the trait into instances of your classes.

For more information, see the ExampleUnifiedGenotyper.scala or examples of using Scala's traits/mixins illustrated in the QScripts documentation.

2. How do I accept a list of arguments to my QScript?

In your QScript, define a var list and annotate it with @Argument. Initialize the value to Nil.

@Argument(doc="filter names", shortName="filter")
var filterNames: List[String] = Nil

On the command line specify the arguments by repeating the argument name.

-filter filter1 -filter filter2 -filter filter3

Then once your QScript is run, the command line arguments will be available for use in the QScript's script method.

  def script {
     var myCommand = new MyFunction
     myCommand.filters = this.filterNames
  }

For a full example of command line arguments see the QScripts documentation.

3. What is the best way to run a utility method at the right time?

Wrap the utility with an InProcessFunction. If your functionality is reusable code you should add it to Sting Utils with Unit Tests and then invoke your new function from your InProcessFunction. Computationally or memory intensive functions should NOT be implemented as InProcessFunctions, and should be wrapped in Queue CommandLineFunctions instead.

    class MySplitter extends InProcessFunction {
      @Input(doc="inputs")
      var in: File = _

      @Output(doc="outputs")
      var out: List[File] = Nil

      def run {
         StingUtilityMethod.quickSplitFile(in, out)
      }
    }

    var splitter = new MySplitter
    splitter.in = new File("input.txt")
    splitter.out = List(new File("out1.txt"), new File("out2.txt"))
    add(splitter)

See Queue CommandLineFunctions for more information on how @Input and @Output are used.

4. What is the best way to write a list of files?

Create an instance of a ListWriterFunction and add it in your script method.

import org.broadinstitute.sting.queue.function.ListWriterFunction

val writeBamList = new ListWriterFunction
writeBamList.inputFiles = bamFiles
writeBamList.listFile = new File("myBams.list")
add(writeBamList)

5. How do I add optional debug output to my QScript?

Queue contains a trait mixin you can use to add Log4J support to your classes.

Add the import for the trait Logging to your QScript.

import org.broadinstitute.sting.queue.util.Logging

Mixin the trait to your class.

class MyScript extends Logging {
...

Then use the mixed in logger to write debug output when the user specifies -l DEBUG.

logger.debug("This will only be displayed when debugging is enabled.")

6. I updated Queue and now I'm getting java.lang.NoClassDefFoundError / java.lang.AbstractMethodError

Try ant clean.

Queue relies on a lot of Scala traits / mixins. These dependencies are not always picked up by the scala/java compilers leading to partially implemented classes. If that doesn't work please let us know in the forum.

7. Do I need to create directories in my QScript?

No. QScript will create all parent directories for outputs.

8. How do I specify the -W 240 for the LSF hour queue at the Broad?

Queue's LSF dispatcher automatically looks up and sets the maximum runtime for whichever LSF queue is specified. If you set your -jobQueue/.jobQueue to hour then you should see something like this under bjobs -l:

RUNLIMIT
240.0 min of gsa3

9. Can I run Queue with GridEngine?

Queue GridEngine functionality is community supported. See here for full details: Queue with Grid Engine.

10. How do I pass advanced java arguments to my GATK commands, such as remote debugging?

The easiest way to do this at the moment is to mixin a trait.

First define a trait which adds your java options:

  trait RemoteDebugging extends JavaCommandLineFunction {
    override def javaOpts = super.javaOpts + " -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
  }

Then mix in the trait to your walker and otherwise run it as normal:

  val printReadsDebug = new PrintReads with RemoteDebugging
  printReadsDebug.reference_sequence = "my.fasta"
  // continue setting up your walker...
  add(printReadsDebug)

11. Why does Queue log "Running jobs. ... Done." but doesn't actually run anything?

If you see something like the following, it means that Queue believes that it previously successfully generated all of the outputs.

INFO 16:25:55,049 QCommandLine - Scripting ExampleUnifiedGenotyper 
INFO 16:25:55,140 QCommandLine - Added 4 functions 
INFO 16:25:55,140 QGraph - Generating graph. 
INFO 16:25:55,164 QGraph - Generating scatter gather jobs. 
INFO 16:25:55,714 QGraph - Removing original jobs. 
INFO 16:25:55,716 QGraph - Adding scatter gather jobs. 
INFO 16:25:55,779 QGraph - Regenerating graph. 
INFO 16:25:55,790 QGraph - Running jobs. 
INFO 16:25:55,853 QGraph - 0 Pend, 0 Run, 0 Fail, 10 Done 
INFO 16:25:55,902 QCommandLine - Done 

Queue will not re-run the job if a .done file is found for the all the outputs, e.g.: /path/to/.output.file.done. You can either remove the specific .done files yourself, or use the -startFromScratch command line option.

Comments (11)

1. Background

Thanks to contributions from the community, Queue contains a job runner compatible with Grid Engine 6.2u5.

As of July 2011 this is the currently known list of forked distributions of Sun's Grid Engine 6.2u5. As long as they are JDRMAA 1.0 source compatible with Grid Engine 6.2u5, the compiled Queue code should run against each of these distributions. However we have yet to receive confirmation that Queue works on any of these setups.

Our internal QScript integration tests run the same tests on both LSF 7.0.6 and a Grid Engine 6.2u5 cluster setup on older software released by Sun.

If you run into trouble, please let us know. If you would like to contribute additions or bug fixes please create a fork in our github repo where we can review and pull in the patch.

2. Running Queue with GridEngine

Try out the Hello World example with -jobRunner GridEngine.

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run

If all goes well Queue should dispatch the job to Grid Engine and wait until the status returns RunningStatus.DONE and "hello world should be echoed into the output file, possibly with other grid engine log messages.

See QFunction and Command Line Options for more info on Queue options.

3. Debugging issues with Queue and GridEngine

If you run into an error with Queue submitting jobs to GridEngine, first try submitting the HelloWorld example with -memLimit 2:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/examples/HelloWorld.scala -jobRunner GridEngine -run -memLimit 2

Then try the following GridEngine qsub commands. They are based on what Queue submits via the API when running the HelloWorld.scala example with and without memory reservations and limits:

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=2048M -l h_rss=2458M echo hello world

One other thing to check is if there is a memory limit on your cluster. For example try submitting jobs with up to 16G.

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=4096M -l h_rss=4915M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=8192M -l h_rss=9830M echo hello world

qsub -w e -V -b y -N echo_hello_world \
  -o test.out -wd $PWD -j y \
  -l mem_free=16384M -l h_rss=19960M echo hello world

If the above tests pass and GridEngine will still not dispatch jobs submitted by Queue please report the issue to our support forum.

Comments (7)

1. Basic QScript run rules

  • In the script method, a QScript will add one or more CommandLineFunctions.
  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.
  • Queue will run functions based on the dependencies between them, so if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

2. Command Line

Each CommandLineFunction must define the actual command line to run as follows.

class MyCommandLine extends CommandLineFunction {
  def commandLine = "myScript.sh hello world"
}

Constructing a Command Line Manually

If you're writing a one-off CommandLineFunction that is not destined for use by other QScripts, it's often easiest to construct the command line directly rather than through the API methods provided in the CommandLineFunction class.

For example:

def commandLine = "cat %s | grep -v \"#\" > %s".format(files, out)

Constructing a Command Line using API Methods

If you're writing a CommandLineFunction that will become part of Queue and/or will be used by other QScripts, however, our best practice recommendation is to construct your command line only using the methods provided in the CommandLineFunction class: required(), optional(), conditional(), and repeat()

The reason for this is that these methods automatically escape the values you give them so that they'll be interpreted literally within the shell scripts Queue generates to run your command, and they also manage whitespace separation of command-line tokens for you. This prevents (for example) a value like MQ > 10 from being interpreted as an output redirection by the shell, and avoids issues with values containing embedded spaces. The methods also give you the ability to turn escaping and/or whitespace separation off as needed. An example:

override def commandLine = super.commandLine +
                           required("eff") +
                           conditional(verbose, "-v") +
                           optional("-c", config) +
                           required("-i", "vcf") +
                           required("-o", "vcf") +
                           required(genomeVersion) +
                           required(inVcf) +
                           required(">", escape=false) +  // This will be shell-interpreted as an output redirection
                           required(outVcf)

The CommandLineFunctions built into Queue, including the CommandLineFunctions automatically generated for GATK Walkers, are all written using this pattern. This means that when you configure a GATK Walker or one of the other built-in CommandLineFunctions in a QScript, you can rely on all of your values being safely escaped and taken literally when the commands are run, including values containing characters that would normally be interpreted by the shell such as MQ > 10.

Below is a brief overview of the API methods available to you in the CommandLineFunction class for safely constructing command lines:

  • required()

Used for command-line arguments that are always present, e.g.:

required("-f", "filename")                              returns: " '-f' 'filename' "
required("-f", "filename", escape=false)                returns: " -f filename "
required("java")                                        returns: " 'java' "
required("INPUT=", "myBam.bam", spaceSeparated=false)   returns: " 'INPUT=myBam.bam' "
  • optional()

Used for command-line arguments that may or may not be present, e.g.:

optional("-f", myVar) behaves like required() if myVar has a value, but returns ""
if myVar is null/Nil/None
  • conditional()

Used for command-line arguments that should only be included if some condition is true, e.g.:

conditional(verbose, "-v") returns " '-v' " if verbose is true, otherwise returns ""
  • repeat()

Used for command-line arguments that are repeated multiple times on the command line, e.g.:

repeat("-f", List("file1", "file2", "file3")) returns: " '-f' 'file1' '-f' 'file2' '-f' 'file3' "

3. Arguments

  • CommandLineFunction arguments use a similar syntax to arguments.

  • CommandLineFunction variables are annotated with @Input, @Output, or @Argument annotations.

Input and Output Files

So that Queue can track the input and output files of a command, CommandLineFunction @Input and @Output must be java.io.File objects.

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file")
  var inputFile: File = _
  def commandLine = "myScript.sh -fileParam " + inputFile
}

FileProvider

CommandLineFunction variables can also provide indirect access to java.io.File inputs and outputs via the FileProvider trait.

class MyCommandLine extends CommandLineFunction {
  @Input(doc="named input file")
  var inputFile: ExampleFileProvider = _
  def commandLine = "myScript.sh " + inputFile
}

// An example FileProvider that stores a 'name' with a 'file'.
class ExampleFileProvider(var name: String, var file: File) extends org.broadinstitute.sting.queue.function.FileProvider {
  override def toString = " -fileName " + name + " -fileParam " + file
}

Optional Arguments

Optional files can be specified via required=false, and can use the CommandLineFunction.optional() utility method, as described above:

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file", required=false)
  var inputFile: File = _
  // -fileParam will only be added if the QScript sets inputFile on this instance of MyCommandLine
  def commandLine = required("myScript.sh") + optional("-fileParam", inputFile)
}

Collections as Arguments

A List or Set of files can use the CommandLineFunction.repeat() utility method, as described above:

class MyCommandLine extends CommandLineFunction {
  @Input(doc="input file")
  var inputFile: List[File] = Nil // NOTE: Do not set List or Set variables to null!
  // -fileParam will added as many times as the QScript adds the inputFile on this instance of MyCommandLine
  def commandLine = required("myScript.sh") + repeat("-fileParam", inputFile)
}

Non-File Arguments

A command line function can define other required arguments via @Argument.

class MyCommandLine extends CommandLineFunction {
  @Argument(doc="message to display")
  var veryImportantMessage: String = _
  // If the QScript does not specify the required veryImportantMessage, the pipeline will not run.
  def commandLine = required("myScript.sh") + required(veryImportantMessage)
}

4. Example: "samtools index"

class SamToolsIndex extends CommandLineFunction {
  @Input(doc="bam to index") var bamFile: File = _
  @Output(doc="bam index") var baiFile: File = _
  def commandLine = "samtools index %s %s".format(bamFile, baiFile)
)

Or, using the CommandLineFunction API methods to construct the command line with automatic shell escaping:

class SamToolsIndex extends CommandLineFunction {
  @Input(doc="bam to index") var bamFile: File = _
  @Output(doc="bam index") var baiFile: File = _
  def commandLine = required("samtools") + required("index") + required(bamFile) + required(baiFile)
)
Comments (9)

These are the most popular Queue command line options. For a complete and up to date list run with --help or -h. QScripts may also add additional command line options.

1. Queue Command Line Options

Command Line Argument Description Default
-run If passed the scripts are run. If not passed a dry run is executed. dry run
-jobRunner <jobrunner> The job runner to dispatch jobs. Setting to Lsf706, GridEngine, or Drmaa will dispatch jobs to LSF or Grid Engine using the job settings (see below). Defaults to Shell which runs jobs on a local shell one at a time. Shell
-bsub Alias for -jobRunner Lsf706 not set
-qsub Alias for -jobRunner GridEngine not set
-status Prints out a summary progress. If a QScript is currently running via -run, you can run the same command line with -status instead to print a summary of progress. not set
-retry <count> Retries a QFunction that returns a non-zero exit code up to count times. The QFunction must not have set jobRestartable to false. 0 = no retries
-startFromScratch Restarts the graph from the beginning. If not specified for each output file specified on a QFunction, ex: /path/to/output.file, Queue will not re-run the job if a .done file is found for the all the outputs, ex: /path/to/.output.file.done. use .done files to determine if jobs are complete
-keepIntermediates By default Queue deletes the output files of QFunctions that set .isIntermediate to true. delete intermediate files
-statusTo <email> Email address to send status to whenever a) A job fails, or b) Queue has run all the functions it can run and is exiting. not set
-statusFrom <email> Email address to send status emails from. user@local.domain
-dot <file> If set renders the job graph to a dot file. not rendered
-l <logging_level> The minimum level of logging, DEBUG, INFO, WARN, or FATAL. INFO
-log <file> Sets the location to save log output in addition to standard out. not set
-debug Set the logging to include a lot of debugging information (SLOW!) not set
-jobReport Path to write the job report text file. If R is installed and available on the $PATH then a pdf will be generated visualizing the job report. jobPrefix.jobreport.txt
-disableJobReport Disables writing the job report. not set
-help Lists all of the command line arguments with their descriptions. not set

2. QFunction Options

The following options can be specified on the command line over overridden per QFunction.

Command Line Argument QFunction Property Description Default
-jobPrefix .jobName The unique name of the job. Used to prefix directories and log files. Use -jobNamePrefix on the Queue command line to replace the default prefix Q-<processid>@<host>. <jobNamePrefix>-<jobNumber>
N/A .jobOutputFile Captures stdout and if jobErrorFile is null it captures stderr as well. <jobName>.out
N/A .jobErrorFile If not null captures stderr. null
N/A .commandDirectory The directory to execute the command line from. current directory
-jobProject .jobProject The project name for the job. default job runner project
-jobQueue .jobQueue The queue to dispatch the job. default job runner queue
-jobPriority .jobPriority The dispatch priority for the job. Lowest priority = 0. Highest priority = 100. default job runner priority
-jobNative .jobNativeArgs Native args to pass to the job runner. Currently only supported in GridEngine and Drmaa. The string is concatenated to the native arguments passed over DRMAA. Example: -w n. none
-jobResReq .jobResourceRequests Resource requests to pass to the job runner. On GridEngine this is multiple -l <req>. On LSF a single -R <req> is generated. memory reservations and limits on LSF and GridEngine
-jobEnv .jobEnvironmentNames Predefined environment names to pass to the job runner. On GridEngine this is -pe <env>. On LSF this is -a <env>. none
-memLimit .memoryLimit The memory limit for the job in gigabytes. Used to populate the variables residentLimit and residentRequest which can also be set separately. default job runner memory limit
-resMemLimit .residentLimit Limit for the resident memory in gigabytes. On GridEngine this is -l mem_free=<mem>. On LSF this is -R rusage[mem=<mem>]. memoryLimit * 1.2
-resMemReq .residentRequest Requested amount of resident memory in gigabytes. On GridEngine this is -l h_rss=<mem>. On LSF this is -R rusage[select=<mem>]. memoryLimit

3. Email Status Options

Command Line Argument Description Default
-emailHost <hostname> SMTP host name localhost
-emailPort <port> SMTP port 25
-emailTLS If set uses TLS. not set
-emailSSL If set uses SSL. not set
-emailUser <username> If set along with emailPass or emailPassFile authenticates the email with this username. not set
-emailPassFile <file> If emailUser is also set authenticates the email with contents of the file. not set
-emailPass <password> If emailUser is also set authenticates the email with this password. NOT SECURE: Use emailPassFile instead! not set
Comments (11)

1. Introduction

As mentioned in the introductory materials, the core concept behind the GATK tools is the walker. The Queue scripting framework contains several mechanisms which make it easy to chain together GATK walkers.

2. Authoring walkers

As part of authoring your walker there are several Queue behaviors that you can specify for QScript authors using your particular walker.

Specifying how to partition

Queue can significantly speed up generating walker outputs by passing different instances of the GATK the same BAM or VCF data but specifying different regions of the data to analyze. After the different instances output their individual results Queue will gather the results back to the original output path requested by QScript.

Queue limits the level it will split genomic data by examining the @PartitionBy() annotation for your walker which specifies a PartitionType. This table lists the different partition types along with the default partition level for each of the different walker types.

PartitionType Default for Walker Type Description Example Intervals Example Splits
PartitionType.CONTIG Read walkers Data is grouped together so that all genomic data from the same contig is never presented to two different instances of the GATK. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60; split 2:chr3:10-11
PartitionType.INTERVAL (none) Data is split down to the interval level but never divides up an explicitly specified interval. If no explicit intervals are specified in the QScript for the GATK then this is effectively the same as splitting by contig. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-40; split 2: chr2:50-60, chr3:10-11
PartitionType.LOCUS Locus walkers, ROD walkers Data is split down to the locus level possibly dividing up intervals. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 split 1: chr1:10-11, chr2:10-20, chr2:30-35; split 2: chr2:36-40, chr2:50-60, chr3:10-11
PartitionType.NONE Read pair walkers, Duplicate walkers The data cannot be split and Queue must run the single instance of the GATK as specified in the QScript. original: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11 no split: chr1:10-11, chr2:10-20, chr2:30-40, chr2:50-60, chr3:10-11

If you walker is implemented in a way that Queue should not divide up your data you should explicitly set the @PartitionBy(PartitionType.NONE). If your walker can theoretically be run per genome location specify @PartitionBy(PartitionType.LOCUS).

@PartitionBy(PartitionType.LOCUS)
public class ExampleWalker extends LocusWalker<Integer, Integer> {
...

Specifying how to join outputs

Queue will join the standard walker outputs.

Output type Default gatherer implementation
SAMFileWriter The BAM files are joined together using Picard's MergeSamFiles.
VCFWriter The VCF files are joined together using the GATK CombineVariants.
PrintStream The first two files are scanned for a common header. The header is written once into the output, and then each file is appended to the output, skipping past with the header lines.

If your PrintStream is not a simple text file that can be concatenated together, you must implement a Gatherer. Extend your custom Gatherer from the abstract base class and implement the gather() method.

package org.broadinstitute.sting.commandline;

import java.io.File;
import java.util.List;

/**
 * Combines a list of files into a single output.
 */
public abstract class Gatherer {
    /**
     * Gathers a list of files into a single output.
     * @param inputs Files to combine.
     * @param output Path to output file.
     */
    public abstract void gather(List<File> inputs, File output);

    /**
     * Returns true if the caller should wait for the input files to propagate over NFS before running gather().
     */
    public boolean waitForInputs() { return true; }
}

Specify your gatherer using the @Gather() annotation by your @Output.

@Output
@Gather(MyGatherer.class)
public PrintStream out;

Queue will run your custom gatherer to join the intermediate outputs together.

3. Using GATK walkers in Queue

Queue GATK Extensions

Running 'ant queue' builds a set of Queue extensions for the GATK-Engine. Every GATK walker and command line program in the compiled GenomeAnalysisTK.jar a Queue compatible wrapper is generated.

The extensions can be imported via import org.broadinstitute.sting.queue.extensions.gatk._

import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class MyQscript extends QScript {
...

Note that the generated GATK extensions will automatically handle shell-escaping of all values assigned to the various Walker parameters, so you can rest assured that all of your values will be taken literally by the shell. Do not attempt to escape values yourself -- ie.,

Do this:

filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0")

NOT this:

filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"")

Listing variables

In addition to the GATK documentation on this wiki you can also find the full list of arguments for each walker extension in a variety of ways.

The source code for the extensions is generated during ant queue and placed in this directory:

build/queue-extensions/src

When properly configured an IDE can provide command completion of the walker extensions. See Queue with IntelliJ IDEA for our recommended settings.

If you do not have access to an IDE you can still find the names of the generated variables using the command line. The generated variable names on each extension are based off of the fullName of the Walker argument. To see the built in documentation for each Walker, run the GATK with:

java -jar GenomeAnalysisTK.jar -T <walker name> -help

Once the import statement is specified you can add() instances of gatk extensions in your QScript's script() method.

Setting variables

If the GATK walker input allows more than one of a value you should specify the values as a List().

  def script() {
    val snps = new UnifiedGenotyper
    snps.reference_file = new File("testdata/exampleFASTA.fasta")
    snps.input_file = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    add(snps)
  }

Although it may be harder for others trying to read your QScript, for each of the long name arguments the extensions contain aliases to their short names as well.

  def script() {
    val snps = new UnifiedGenotyper
    snps.R = new File("testdata/exampleFASTA.fasta")
    snps.I = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    add(snps)
  }

Here are a few more examples using various list assignment operators.

  def script() {
    val countCovariates = new CountCovariates

    // Append to list using item appender :+
    countCovariates.rodBind :+= RodBind("dbsnp", "VCF", dbSNP)

    // Append to list using collection appender ++
    countCovariates.covariate ++= List("ReadGroupCovariate", "QualityScoreCovariate", "CycleCovariate", "DinucCovariate")

    // Assign list using plain old object assignment
    countCovariates.input_file = List(inBam)

    // The following is not a list, so just assigning one file to another
    countCovariates.recal_file = outRecalFile

    add(countCovariates)
  }

Specifying an alternate GATK jar

By default Queue runs the GATK from the current classpath. This works best since the extensions are generated and compiled at time same time the GATK is compiled via ant queue.

If you need to swap in a different version of the GATK you may not be able to use the generated extensions. The alternate GATK jar must have the same command line arguments as the GATK compiled with Queue. Otherwise the arguments will not match and you will get an error when Queue attempts to run the alternate GATK jar. In this case you will have to create your own custom CommandLineFunction for your analysis.

  def script {
    val snps = new UnifiedGenotyper
    snps.jarFile = new File("myPatchedGATK.jar")
    snps.reference_file = new File("testdata/exampleFASTA.fasta")
    snps.input_file = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    add(snps)
  }

GATK scatter/gather

Queue currently allows QScript authors to explicitly invoke scatter/gather on GATK walkers by setting the scatter count on a function.

  def script {
    val snps = new UnifiedGenotyper
    snps.reference_file = new File("testdata/exampleFASTA.fasta")
    snps.input_file = List(new File("testdata/exampleBAM.bam"))
    snps.out = new File("snps.vcf")
    snps.scatterCount = 20
    add(snps)
  }

This will run the UnifiedGenotyper up to 20 ways parallel and then will merge the partial VCFs back into the single snps.vcf.

Additional caveat

Some walkers are still being updated to support Queue fully. For example they may not have defined the @Input and @Output and thus Queue is unable to correctly track their dependencies, or a custom Gatherer may not be implemented yet.

Comments (19)

We have found it that Queue works best with IntelliJ IDEA Community Edition (free) or Ultimate Edition installed with the Scala Plugin enabled. Once you have downloaded IntelliJ IDEA, follow the instructions below to setup a Sting project with Queue and the Scala Plugin.

[[File:sting_project_libraries.png|300px|thumb|right|Project Libraries]] [[File:sting_module_sources.png|300px|thumb|right|Module Sources]] [[File:sting_module_dependencies.png|300px|thumb|right|Module Dependencies]] [[File:sting_module_scala_facet.png|300px|thumb|right|Scala Facet]]

1. Build Queue on the Command Line

Build Queue from source from the command line with ant queue, so that: - The lib folder is initialized including the scala jars - The queue-extensions for the GATK are generated to the build folder

2. Add the scala plugin

  • In IntelliJ, open the menu File > Settings
  • Under the IDE Settings in the left navigation list select Plugins
  • Click on the Available tab under plugins
  • Scroll down in the list of available plugins and install the scala plugin
  • If asked to retrieve dependencies, click No. The correct scala libraries and compiler are already available in the lib folder from when you built Queue from the command line
  • Restart IntelliJ to load the scala plugin

3. Creating a new Sting Project including Queue

  • Select the menu File... > New Project...

  • On the first page of "New Project" select Create project from scratch Click Next >

  • On the second page of "New Project" select Set the project Name: to Sting Set the Project files location: to the directory where you checked out the Sting git repository, for example /Users/jamie/src/Sting Uncheck Create Module Click Finish

  • The "Project Structure" window should open. If not open it via the menu File > Project Structure

  • Under the Project Settings in the left panel of "Project Structure" select Project Make sure that Project SDK is set to a build of 1.6 If the Project SDK only lists <No SDK> add a New > JSDK pointing to /System/Library/Frameworks/JavaVM.framework/Versions/1.6

  • Under the Project Settings in the left panel of "Project Structure" select Libraries Click the plus (+) to create a new Project Library Set the Name: to Sting/lib Select Attach Jar Directories Select the path to lib folder under your SVN checkout

  • Under the Project Settings in the left panel of "Project Structure" select Modules

  • Click on the + box to add a new module

  • On the first page of "Add Module" select Create module from scratch Click Next \>

  • On the second page of "Add Module" select Set the module Name: to Sting Change the Content root to: <directory where you checked out the Sting SVN repository> Click Next \>

  • On the third page Uncheck all of the other source directories only leaving the java/src directory checked Click Next \>

  • On fourth page click Finish

  • Back in the Project Structure window, under the Module 'Sting', on the Sources tab make sure the following folders are selected

    • Source Folders (in blue): public/java/src public/scala/src private/java/src (Broad only) private/scala/src (Broad only) build/queue-extensions/src
    • Test Source Folders (in green): public/java/test public/scala/test private/java/test (Broad only) private/scala/test (Broad only)
  • In the Project Structure window, under the Module 'Sting', on the Module Dependencies tab select Click on the button Add... Select the popup menu Library... Select the Sting/lib library Click Add selected

  • Refresh the Project Structure window so that it becomes aware of the Scala library in Sting/lib Click the OK button Reopen Project Structure via the menu File > Project Structure

  • In the second panel, click on the Sting module Click on the plus (+) button above the second panel module In the popup menu under Facet select Scala On the right under Facet 'Scala' set the Compiler library: to Sting/lib Click OK

4. Enable annotation processing

  • Open the menu File > Settings
  • Under Project Settings [Sting] in the left navigation list select Compiler then Annotation Processors
  • Click to enable the checkbox Enable annotation processing
  • Leave the radio button obtain processors from the classpath selected
  • Click OK

5. Debugging Queue

Adding a Remote Configuration

[[File:queue_debug.png|300px|thumb|right|Queue Remote Debug]]

  • In IntelliJ 10 open the menu Run > Edit Configurations.

  • Click the gold [+] button at the upper left to open the Add New Configuration popup menu.

  • Select Remote from the popup menu.

  • With the new configuration selected on the left, change the configuration name from 'Unnamed' to something like 'Queue Remote Debug'.

  • Set the Host to the hostname of your server, and the Port to an unused port. You can try the default port of 5005.

  • From the Use the following command line arguments for running remote JVM, copy the argument string.

  • On the server, paste / modify your command line to run with the previously copied text, for example java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005 Queue.jar -S myscript.scala ....

  • If you would like the program to wait for you to attach the debugger before running, change suspend=n to suspend=y.

  • Back in IntelliJ, click OK to save your changes.

Running with the Remote Configuration

  • Ensure Queue Remote Debug is selected via the configuration drop down or Run > Edit Configurations.
  • Set your breakpoints as you normally would in IntelliJ.
  • Start your program by running the full java path (with the above -Xdebug -Xrunjdwp ...) on the server.
  • In IntelliJ go to the Run > Debug.

6. Binding javadocs and source

From Stack overflow:

Add javadocs:

Point IntelliJ to http://download.oracle.com/javase/6/docs/api/.
Go to File -> Project Structure -> SDKs -> Apple 1.x -> DocumentationPaths, and the click specify URL.

Add sources:

In IntelliJ, open File -> Project Structure. Click on "SDKs" under "Platform Settings". Add the following path under the Sourcepath tab: /Library/Java/JavaVirtualMachines/1.6.0_29-b11-402.jdk/Contents/Home/src.jar!/src

Comments (6)

1. Introduction

Queue pipelines are Scala 2.8 files with a bit of syntactic sugar, called QScripts. Check out the following as references.

  • http://programming-scala.labs.oreilly.com
  • http://www.scala-lang.org/docu/files/ScalaByExample.pdf
  • http://davetron5000.github.com/scala-style/index.html

QScripts are easiest to develop using an Integrated Development Environment. See Queue with IntelliJ IDEA for our recommended settings.

The following is a basic outline of a QScript:

import org.broadinstitute.sting.queue.QScript
// List other imports here

// Define the overall QScript here.
class MyScript extends QScript {
  // List script arguments here.
  @Input(doc="My QScript inputs")
  var scriptInput: File = _

  // Create and add the functions in the script here.
  def script = {
     var myCL = new MyCommandLine
     myCL.myInput = scriptInput // Example variable input
     myCL.myOutput = new File("/path/to/output") // Example hardcoded output
     add(myCL)
  }

}

2. Imports

Imports can be any scala or java imports in scala syntax.

import java.io.File
import scala.util.Random
import org.favorite.my._
// etc.

3. Classes

  • To add a CommandLineFunction to a pipeline, a class must be defined that extends QScript.

  • The QScript must define a method script.

  • The QScript can define helper methods or variables.

4. Script method

The body of script should create and add Queue CommandlineFunctions.

class MyScript extends org.broadinstitute.sting.queue.QScript {
  def script = add(new CommandLineFunction { def commandLine = "echo hello world" })
}

5. Command Line Arguments

  • A QScript canbe set to read command line arguments by defining variables with @Input, @Output, or @Argument annotations.

  • A command line argument can be a primitive scalar, enum, File, or scala immutable Array, List, Set, or Option of a primitive, enum, or File.

  • QScript command line arguments can be marked as optional by setting required=false.

    class MyScript extends org.broadinstitute.sting.queue.QScript { @Input(doc="example message to echo") var message: String = _ def script = add(new CommandLineFunction { def commandLine = "echo " + message }) }

6. Using and writing CommandLineFunctions

Adding existing GATK walkers

See Pipelining the GATK using Queue for more information on the automatically generated Queue wrappers for GATK walkers.

After functions are defined they should be added to the QScript pipeline using add().

for (vcf <- vcfs) {
  val ve = new VariantEval
  ve.vcfFile = vcf
  ve.evalFile = swapExt(vcf, "vcf", "eval")
  add(ve)
}

Defining new CommandLineFunctions

  • Queue tracks dependencies between functions via variables annotated with @Input and @Output.

  • Queue will run functions based on the dependencies between them, not based on the order in which they are added in the script! So if the @Input of CommandLineFunction A depends on the @Output of ComandLineFunction B, A will wait for B to finish before it starts running.

  • See the main article Queue CommandLineFunctions for more information.

7. Examples

  • The latest version of the example files are available in the Sting git repository under public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/.

  • To print the list of arguments required by an existing QScript run with -help.

  • To check if your script has all of the CommandLineFunction variables set correctly, run without -run.
  • When you are ready to execute the full pipeline, add -run.

Hello World QScript

The following is a "hello world" example that runs a single command line to echo hello world.

import org.broadinstitute.sting.queue.QScript

class HelloWorld extends QScript {
  def script = {
    add(new CommandLineFunction {
      def commandLine = "echo hello world"
    })
  }
}

The above file is checked into the Sting git repository under HelloWorld.scala. After building Queue from source, the QScript can be run with the following command:

java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run

It should produce output similar to:

INFO  16:23:27,825 QScriptManager - Compiling 1 QScript 
INFO  16:23:31,289 QScriptManager - Compilation complete 
INFO  16:23:34,631 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,631 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine 
INFO  16:23:34,632 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/HelloWorld.scala -run  
INFO  16:23:34,632 HelpFormatter - Date/Time: 2011/01/14 16:23:34 
INFO  16:23:34,632 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,632 HelpFormatter - --------------------------------------------------------- 
INFO  16:23:34,634 QCommandLine - Scripting HelloWorld 
INFO  16:23:34,651 QCommandLine - Added 1 functions 
INFO  16:23:34,651 QGraph - Generating graph. 
INFO  16:23:34,660 QGraph - Running jobs. 
INFO  16:23:34,689 ShellJobRunner - Starting: echo hello world 
INFO  16:23:34,689 ShellJobRunner - Output written to /Users/kshakir/src/Sting/Q-43031@bmef8-d8e-1.out 
INFO  16:23:34,771 ShellJobRunner - Done: echo hello world 
INFO  16:23:34,773 QGraph - Deleting intermediate files. 
INFO  16:23:34,773 QCommandLine - Done 

ExampleUnifiedGenotyper.scala

This example uses automatically generated Queue compatible wrappers for the GATK. See Pipelining the GATK using Queue for more info on authoring Queue support into walkers and using walkers in Queue.

The ExampleUnifiedGenotyper.scala for running the UnifiedGenotyper followed by VariantFiltration can be found in the examples folder.

To list the command line parameters, including the required parameters, run with -help.

java -jar dist/Queue.jar -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -help

The help output should appear similar to this:

INFO  10:26:08,491 QScriptManager - Compiling 1 QScript
INFO  10:26:11,926 QScriptManager - Compilation complete
---------------------------------------------------------
Program Name: org.broadinstitute.sting.queue.QCommandLine
---------------------------------------------------------
---------------------------------------------------------
usage: java -jar Queue.jar -S <script> [-run] [-jobRunner <job_runner>] [-bsub] [-status] [-retry <retry_failed>]
       [-startFromScratch] [-keepIntermediates] [-statusTo <status_email_to>] [-statusFrom <status_email_from>] [-dot
       <dot_graph>] [-expandedDot <expanded_dot_graph>] [-jobPrefix <job_name_prefix>] [-jobProject <job_project>] [-jobQueue
       <job_queue>] [-jobPriority <job_priority>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
       <temp_directory>] [-jobSGDir <job_scatter_gather_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>]
       [-emailTLS] [-emailSSL] [-emailUser <emailUsername>] [-emailPassFile <emailPasswordFile>] [-emailPass <emailPassword>]
       [-l <logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h] -R <referencefile> -I <bamfile> [-L <intervals>]
       [-filter <filternames>] [-filterExpression <filterexpressions>]

 -S,--script <script>                                                      QScript scala file
 -run,--run_scripts                                                        Run QScripts.  Without this flag set only
                                                                           performs a dry run.
 -jobRunner,--job_runner <job_runner>                                      Use the specified job runner to dispatch
                                                                           command line jobs
 -bsub,--bsub                                                              Equivalent to -jobRunner Lsf706
 -status,--status                                                          Get status of jobs for the qscript
 -retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
                                                                           command fails.  Defaults to no retries.
 -startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
                                                                           outputs were previously output successfully.
 -keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
                                                                           any Function marked as intermediate.
 -statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
                                                                           completion or on error.
 -statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
                                                                           completion or on error.
 -dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
                                                                           http://en.wikipedia.org/wiki/DOT_language
 -expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
                                                                           a .dot file.  Otherwise overwrites the
                                                                           dot_graph
 -jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
 -jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
 -jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
 -jobPriority,--job_priority <job_priority>                                Default priority for jobs.
 -memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
 -runDir,--run_directory <run_directory>                                   Root directory to run functions from.
 -tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
 -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
                                                                           output for compute farm jobs.
 -emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
 -emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
                                                                           otherwise 25.
 -emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
 -emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
 -emailUser,--emailUsername <emailUsername>                                Email SMTP username. Defaults to none.
 -emailPassFile,--emailPasswordFile <emailPasswordFile>                    Email SMTP password file. Defaults to none.
 -emailPass,--emailPassword <emailPassword>                                Email SMTP password. Defaults to none. Not
                                                                           secure! See emailPassFile.
 -l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
                                                                           setting INFO get's you INFO up to FATAL,
                                                                           setting ERROR gets you ERROR and FATAL level
                                                                           logging.
 -log,--log_to_file <log_to_file>                                          Set the logging location
 -quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
                                                                           stdout
 -debug,--debug_mode                                                       Set the logging file string to include a lot
                                                                           of debugging information (SLOW!)
 -h,--help                                                                 Generate this help message

Arguments for ExampleUnifiedGenotyper:
 -R,--referencefile <referencefile>                          The reference file for the bam files.
 -I,--bamfile <bamfile>                                      Bam file to genotype.
 -L,--intervals <intervals>                                  An optional file with a list of intervals to proccess.
 -filter,--filternames <filternames>                         A optional list of filter names.
 -filterExpression,--filterexpressions <filterexpressions>   An optional list of filter expressions.


##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.commandline.MissingArgumentException:
Argument with name '--bamfile' (-I) is missing.
Argument with name '--referencefile' (-R) is missing.
        at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:192)
        at org.broadinstitute.sting.commandline.ParsingEngine.validate(ParsingEngine.java:172)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:199)
        at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:57)
        at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 1.0.5504):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: Argument with name '--bamfile' (-I) is missing.
##### ERROR Argument with name '--referencefile' (-R) is missing.
##### ERROR ------------------------------------------------------------------------------------------

To dry run the pipeline:

java \
  -Djava.io.tmpdir=tmp \
  -jar dist/Queue.jar \
  -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala \
  -R human_b36_both.fasta \
  -I pilot2_daughters.chr20.10k-11k.bam \
  -L chr20.interval_list \
  -filter StrandBias -filterExpression "SB>=0.10" \
  -filter AlleleBalance -filterExpression "AB>=0.75" \
  -filter QualByDepth -filterExpression "QD<5" \
  -filter HomopolymerRun -filterExpression "HRun>=4"

The dry run output should appear similar to this:

INFO  10:45:00,354 QScriptManager - Compiling 1 QScript
INFO  10:45:04,855 QScriptManager - Compilation complete
INFO  10:45:05,058 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,059 HelpFormatter - Program Name: org.broadinstitute.sting.queue.QCommandLine
INFO  10:45:05,059 HelpFormatter - Program Args: -S public/scala/qscript/org/broadinstitute/sting/queue/qscripts/examples/ExampleUnifiedGenotyper.scala -R human_b36_both.fasta -I pilot2_daughters.chr20.10k-11k.bam -L chr20.interval_list -filter StrandBias -filterExpression SB>=0.10 -filter AlleleBalance -filterExpression AB>=0.75 -filter QualByDepth -filterExpression QD<5 -filter HomopolymerRun -filterExpression HRun>=4 
INFO  10:45:05,059 HelpFormatter - Date/Time: 2011/03/24 10:45:05
INFO  10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,059 HelpFormatter - ---------------------------------------------------------
INFO  10:45:05,061 QCommandLine - Scripting ExampleUnifiedGenotyper
INFO  10:45:05,150 QCommandLine - Added 4 functions
INFO  10:45:05,150 QGraph - Generating graph.
INFO  10:45:05,169 QGraph - Generating scatter gather jobs.
INFO  10:45:05,182 QGraph - Removing original jobs.
INFO  10:45:05,183 QGraph - Adding scatter gather jobs.
INFO  10:45:05,231 QGraph - Regenerating graph.
INFO  10:45:05,247 QGraph - -------
INFO  10:45:05,252 QGraph - Pending: IntervalScatterFunction /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals
INFO  10:45:05,253 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/scatter/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,254 QGraph - -------
INFO  10:45:05,279 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,279 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,279 QGraph - -------
INFO  10:45:05,283 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,283 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,283 QGraph - -------
INFO  10:45:05,287 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T UnifiedGenotyper -I /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.bam -L /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/scatter.intervals -R /Users/kshakir/src/Sting/human_b36_both.fasta -o /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf
INFO  10:45:05,287 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,288 QGraph - -------
INFO  10:45:05,288 QGraph - Pending: SimpleTextGatherFunction /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,288 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-jobOutputFile/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,289 QGraph - -------
INFO  10:45:05,291 QGraph - Pending: java -Xmx1g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T CombineVariants -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:input0,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-1/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input1,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-2/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -B:input2,VCF /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/temp-3/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -priority input0,input1,input2 -assumeIdenticalSamples
INFO  10:45:05,291 QGraph - Log: /Users/kshakir/src/Sting/queueScatterGather/Q-60018@bmef8-d8e-1-sg/gather-out/Q-60018@bmef8-d8e-1.out
INFO  10:45:05,292 QGraph - -------
INFO  10:45:05,296 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.eval
INFO  10:45:05,296 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-2.out
INFO  10:45:05,296 QGraph - -------
INFO  10:45:05,299 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantFiltration -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:vcf,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.unfiltered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -filter SB>=0.10 -filter AB>=0.75 -filter QD<5 -filter HRun>=4 -filterName StrandBias -filterName AlleleBalance -filterName QualByDepth -filterName HomopolymerRun
INFO  10:45:05,299 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-3.out
INFO  10:45:05,302 QGraph - -------
INFO  10:45:05,303 QGraph - Pending: java -Xmx2g -Djava.io.tmpdir=/Users/kshakir/src/Sting/tmp -cp "/Users/kshakir/src/Sting/dist/Queue.jar" org.broadinstitute.sting.gatk.CommandLineGATK -T VariantEval -L /Users/kshakir/src/Sting/chr20.interval_list -R /Users/kshakir/src/Sting/human_b36_both.fasta -B:eval,VCF /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.vcf -o /Users/kshakir/src/Sting/pilot2_daughters.chr20.10k-11k.filtered.eval
INFO  10:45:05,303 QGraph - Log: /Users/kshakir/src/Sting/Q-60018@bmef8-d8e-4.out
INFO  10:45:05,304 QGraph - Dry run completed successfully!
INFO  10:45:05,304 QGraph - Re-run with "-run" to execute the functions.
INFO  10:45:05,304 QCommandLine - Done

8. Using traits to pass common values between QScripts to CommandLineFunctions

QScript files often create multiple CommandLineFunctions with similar arguments. Use various scala tricks such as inner classes, traits / mixins, etc. to reuse variables.

  • A self type can be useful to distinguish between this. We use qscript as an alias for the QScript's this to distinguish from the this inside of inner classes or traits.

  • A trait mixin can be used to reuse functionality. The trait below is designed to copy values from the QScript and then is mixed into different instances of the functions.

See the following example:

class MyScript extends org.broadinstitute.sting.queue.QScript {
  // Create an alias 'qscript' for 'MyScript.this'
  qscript =>

  // This is a script argument
  @Argument(doc="message to display")
  var message: String = _

  // This is a script argument
  @Argument(doc="number of times to display")
  var count: Int = _

  trait ReusableArguments extends MyCommandLineFunction {
    // Whenever a function is created 'with' this trait, it will copy the message.
    this.commandLineMessage = qscript.message
  }

  abstract class MyCommandLineFunction extends CommandLineFunction {
     // This is a per command line argument
     @Argument(doc="message to display")
     var commandLineMessage: String = _
  }

  class MyEchoFunction extends MyCommandLineFunction {
     def commandLine = "echo " + commandLineMessage
  }

  class MyAlsoEchoFunction extends MyCommandLineFunction {
     def commandLine = "echo also " + commandLineMessage
  }

  def script = {
    for (i <- 1 to count) {
      val echo = new MyEchoFunction with ReusableArguments
      val alsoEcho = new MyAlsoEchoFunction with ReusableArguments
      add(echo, alsoEcho)
    }
  }
}
Comments (11)

1. Introduction

GATK-Queue is command-line scripting framework for defining multi-stage genomic analysis pipelines combined with an execution manager that runs those pipelines from end-to-end. Often processing genome data includes several steps to produces outputs, for example our BAM to VCF calling pipeline include among other things:

  • Local realignment around indels
  • Emitting raw SNP calls
  • Emitting indels
  • Masking the SNPs at indels
  • Annotating SNPs using chip data
  • Labeling suspicious calls based on filters
  • Creating a summary report with statistics

Running these tools one by one in series may often take weeks for processing, or would require custom scripting to try and optimize using parallel resources.

With a Queue script users can semantically define the multiple steps of the pipeline and then hand off the logistics of running the pipeline to completion. Queue runs independent jobs in parallel, handles transient errors, and uses various techniques such as running multiple copies of the same program on different portions of the genome to produce outputs faster.


2. Obtaining Queue

You have two options: download the binary distribution (prepackaged, ready to run program) or build it from source.

- Download the binary

This is obviously the easiest way to go. Links are on the Downloads page. Just get the Queue package; no need to get the GATK package separately as GATK is bundled in with Queue.

- Building Queue from source

Briefly, here's what you need to know/do:

Queue is part of the GATK repository. Download the source from the public repository on Github. Run the following command:

git clone https://github.com/broadgsa/gatk.git

IMPORTANT NOTE: These instructions refer to the MIT-licensed version of the GATK+Queue source code. With that version, you will be able to build Queue itself, as well as the public portion of the GATK (the core framework), but that will not include the GATK analysis tools. If you want to use Queue to pipeline the GATK analysis tools, you need to clone the 'protected' repository. Please note however that part of the source code in that repository (the 'protected' module) is under a different license which excludes for-profit use, modification and redistribution.

Move to the git root directory and use maven to build the source.

mvn clean verify

All dependencies will be managed by Maven as needed.

See this article on how to test your installation of Queue.


3. Running Queue

See this article on running Queue for the first time for full details.

Queue arguments can be listed by running with --help

java -jar dist/Queue.jar --help

To list the arguments required by a QScript, add the script with -S and run with --help.

java -jar dist/Queue.jar -S script.scala --help

Note that by default queue runs in a "dry" mode, as explained in the link above. After verifying the generated commands execute the pipeline by adding -run.

See QFunction and Command Line Options for more info on adjusting Queue options.

4. QScripts

General Information

Queue pipelines are written as Scala 2.8 files with a bit of syntactic sugar, called QScripts.

Every QScript includes the following steps:

  • New instances of CommandLineFunctions are created
  • Input and output arguments are specified on each function
  • The function is added with add() to Queue for dispatch and monitoring

The basic command-line to run the Queue pipelines on the command line is

java -jar Queue.jar -S <script>.scala

See the main article Queue QScripts for more info on QScripts.

Supported QScripts

Most QScripts are analysis pipelines that are custom-built for specific projects, and we currently do not offer any QScripts as supported analysis tools. However, we do provide some example scripts that you can use as basis to write your own QScripts (see below).

Example QScripts

The latest version of the example files are available in the Sting github repository under public/scala/qscript/examples


5. Visualization and Queue

QJobReport

Queue automatically generates GATKReport-formatted runtime information about executed jobs. See this presentation for a general introduction to QJobReport.

Note that Queue attempts to generate a standard visualization using an R script in the GATK public/R repository. You must provide a path to this location if you want the script to run automatically. Additionally the script requires the gsalib to be installed on the machine, which is typically done by providing its path in your .Rprofile file:

bm8da-dbe ~/Desktop/broadLocal/GATK/unstable % cat ~/.Rprofile
.libPaths("/Users/depristo/Desktop/broadLocal/GATK/unstable/public/R/")

Note that gsalib is available from the CRAN repository so you can install it with the canonical R package install command.

Caveats

  • The system only provides information about commands that have just run. Resuming from a partially completed job will only show the information for the jobs that just ran, and not for any of the completed commands. This is due to a structural limitation in Queue, and will be fixed when the Queue infrastructure improves

  • This feature only works for command line and LSF execution models. SGE should be easy to add for a motivated individual but we cannot test this capabilities here at the Broad. Please send us a patch if you do extend Queue to support SGE.

DOT visualization of Pipelines

Queue emits a queue.dot file to help visualize your commands. You can open this file in programs like DOT, OmniGraffle, etc to view your pipelines. By default the system will print out your LSF command lines, but this can be too much in a complex pipeline.

To clarify your pipeline, override the dotString() function:

class CountCovariates(bamIn: File, recalDataIn: File, args: String = "") extends GatkFunction {
    @Input(doc="foo") var bam = bamIn
    @Input(doc="foo") var bamIndex = bai(bamIn)
    @Output(doc="foo") var recalData = recalDataIn
    memoryLimit = Some(4)
    override def dotString = "CountCovariates: %s [args %s]".format(bamIn.getName, args)
    def commandLine = gatkCommandLine("CountCovariates") + args + " -l INFO -D /humgen/gsa-hpprojects/GATK/data/dbsnp_129_hg18.rod -I %s --max_reads_at_locus 20000 -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile %s".format(bam, recalData)
}

Here we only see CountCovariates my.bam [-OQ], for example, in the dot file. The base quality score recalibration pipeline, as visualized by DOT, can be viewed here:

6. Further reading

Comments (1)

1. Introduction

The core concept behind GATK tools is the walker, a class that implements the three core operations: filtering, mapping, and reducing.

  • filter Reduces the size of the dataset by applying a predicate.

  • map Applies a function to each individual element in a dataset, effectively mapping it to a new element.

  • reduce Inductively combines the elements of a list. The base case is supplied by the reduceInit() function, and the inductive step is performed by the reduce() function.

Users of the GATK will provide a walker to run their analyses. The engine will produce a result by first filtering the dataset, running a map operation, and finally reducing the map operation to a single result.

2. Creating a Walker

To be usable by the GATK, the walker must satisfy the following properties:

  • It must subclass one of the basic walkers in the org.broadinstitute.sting.gatk.walkers package, usually ReadWalker or LociWalker.

    • Locus walkers present all the reads, reference bases, and reference-ordered data that overlap a single base in the reference. Locus walkers are best used for analyses that look at each locus independently, such as genotyping.

    • Read walkers present only one read at a time, as well as the reference bases and reference-ordered data that overlap that read.

    • Besides read walkers and locus walkers, the GATK features several other data access patterns, described here.

  • The compiled class or jar must be on the current classpath. The Java classpath can be controlled using either the $CLASSPATH environment variable or the JVM's -cp option.

3. Examples

The best way to get started with the GATK is to explore the walkers we've written. Here are the best walkers to look at when getting started:

  • CountLoci

    It is the simplest locus walker in our codebase. It counts the number of loci walked over in a single run of the GATK.

$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLociWalker.java

  • CountReads

    It is the simplest read walker in our codebase. It counts the number of reads walked over in a single run of the GATK.

$STING_HOME/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadsWalker.java

  • GATKPaperGenotyper

    This is a more sophisticated example, taken from our recent paper in Genome Research (and using our ReadBackedPileup to select and filter reads). It is an extremely basic Bayesian genotyper that demonstrates how to output data to a stream and execute simple base operations.

$STING_HOME/java/src/org/broadinstitute/sting/gatk/examples/papergenotyper/GATKPaperGenotyper.java

Please note that the walker above is NOT the UnifiedGenotyper. While conceptually similar to the UnifiedGenotyper, the GATKPaperGenotyper uses a much simpler calling model for increased clarity and readability.

4. External walkers and the 'external' directory

The GATK can absorb external walkers placed in a directory of your choosing. By default, that directory is called 'external' and is relative to the Sting git root directory (for example, ~/src/Sting/external). However, you can choose to place that directory anywhere on the filesystem and specify its complete path using the ant external.dir property.

ant -Dexternal.dir=~/src/external

The GATK will check each directory under the external directory (but not the external directory itself!) for small build scripts. These build scripts must contain at least a compile target that compiles your walker and places the resulting class file into the GATK's class file output directory. The following is a sample compile target:

<target name="compile" depends="init">
    <javac srcdir="." destdir="${build.dir}" classpath="${gatk.classpath}" />
</target>

As a convenience, the build.dir ant property will be predefined to be the GATK's class file output directory and the gatk.classpath property will be predefined to be the GATK's core classpath. Once this structure is defined, any invocation of the ant build scripts will build the contents of the external directory as well as the GATK itself.

Comments (0)

1. What is DiffEngine?

DiffEngine is a summarizing difference engine that allows you to compare two structured files -- such as BAMs and VCFs -- to find what are the differences between them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others.

2. The summarized differences

The GATK contains a summarizing difference engine called DiffEngine that compares hierarchical data structures to emit:

  • A list of specific differences between the two data structures. This is similar to saying the value in field A in record 1 in file F differs from the value in field A in record 1 in file G.

  • A summarized list of differences ordered by frequency of the difference. This output is similar to saying field A differed in 50 records between files F and G.

3. The DiffObjects walker

The GATK contains a private walker called DiffObjects that allows you access to the DiffEngine capabilities on the command line. Simply provide the walker with the master and test files and it will emit summarized differences for you.

4. Understanding the output

The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named nodes. Suppose I have two trees:

Tree1=(A=1 B=(C=2 D=3)) 
Tree2=(A=1 B=(C=3 D=3 E=4))
Tree3=(A=1 B=(C=4 D=3 E=4))

where every node in the tree is named, or is a raw value (here all leaf values are integers). The DiffEngine traverses these data structures by name, identifies equivalent nodes by fully qualified names (Tree1.A is distinct from Tree2.A, and determines where their values are equal (Tree1.A=1, Tree2.A=1, so they are).

These itemized differences are listed as:

Tree1.B.C=2 != Tree2.B.C=3
Tree1.B.C=2 != Tree3.B.C=4
Tree2.B.C=3 != Tree3.B.C=4
Tree1.B.E=MISSING != Tree2.B.E=4

This conceptually very similar to the output of the unix command line tool diff. What's nice about DiffEngine though is that it computes similarity among the itemized differences and displays the count of differences names in the system. In the above example, the field C is not equal three times, while the missing E in Tree1 occurs only once. So the summary is:

*.B.C : 3
*.B.E : 1

where the * operator indicates that any named field matches. This output is sorted by counts, and provides an immediate picture of the commonly occurring differences between the files.

Below is a detailed example of two VCF fields that differ because of a bug in the AC, AF, and AN counting routines, detected by the integrationtest integration (more below). You can see that in the although there are many specific instances of these differences between the two files, the summarized differences provide an immediate picture that the AC, AF, and AN fields are the major causes of the differences.

[testng] path                                                              count
[testng] *.*.*.AC                                                         6
[testng] *.*.*.AF                                                         6
[testng] *.*.*.AN                                                         6
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AC  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AF  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AN  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AC  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AF  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AN  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AC  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AF  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AN  1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC  1

5. Integration tests

The DiffEngine codebase that supports these calculations is integrated into the integrationtest framework, so that when a test fails the system automatically summarizes the differences between the master MD5 file and the failing MD5 file, if it is an understood type. When failing you will see in the integration test logs not only the basic information, but the detailed DiffEngine output.

For example, in the output below I broke the GATK BAQ calculation and the integration test DiffEngine clearly identifies that all of the records differ in their BQ tag value in the two BAM files:

/humgen/1kg/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam -o /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp -L 1:10,000,000-10,100,000 -baq RECALCULATE -et NO_ET
   [testng] WARN  22:59:22,875 TextFormattingUtils - Unable to load help text.  Help output will be sparse.
   [testng] WARN  22:59:22,875 TextFormattingUtils - Unable to load help text.  Help output will be sparse.
   [testng] ##### MD5 file is up to date: integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
   [testng] Checking MD5 for /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp [calculated=e5147656858fc4a5f470177b94b1fc1b, expected=4ac691bde1ba1301a59857694fda6ae2]
   [testng] ##### Test testPrintReadsRecalBAQ is going fail #####
   [testng] ##### Path to expected   file (MD5=4ac691bde1ba1301a59857694fda6ae2): integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest
   [testng] ##### Path to calculated file (MD5=e5147656858fc4a5f470177b94b1fc1b): integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
   [testng] ##### Diff command: diff integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
   [testng] ##:GATKReport.v0.1 diffences : Summarized differences between the master and test files.
   [testng] See http://www.broadinstitute.org/gsa/wiki/index.php/DiffObjectsWalker_and_SummarizedDifferences for more information
   [testng] Difference                                                                               NumberOfOccurrences
   [testng] *.*.*.BQ                                                                                 895
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:2:266:272:361.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:245:474:254.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:255:178:160.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:158:682:495.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:195:591:884.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:165:236:848.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:191:223:910.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:286:279:434.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:2:106:516:354.BQ  1
   [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:3:102:580:518.BQ  1
   [testng]
   [testng] Note that the above list is not comprehensive.  At most 20 lines of output, and 10 specific differences will be listed.  Please use -T DiffObjects -R public/testdata/exampleFASTA.fasta -m integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest -t integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest to explore the differences more freely

6. Adding your own DiffableObjects to the system

The system dynamically finds all classes that implement the following simple interface:

public interface DiffableReader {
    @Ensures("result != null")
    /**
     * Return the name of this DiffableReader type.  For example, the VCF reader returns 'VCF' and the
     * bam reader 'BAM'
     */
    public String getName();

    @Ensures("result != null")
    @Requires("file != null")
    /**
     * Read up to maxElementsToRead DiffElements from file, and return them.
     */
    public DiffElement readFromFile(File file, int maxElementsToRead);

    /**
     * Return true if the file can be read into DiffElement objects with this reader. This should
     * be uniquely true/false for all readers, as the system will use the first reader that can read the
     * file.  This routine should never throw an exception.  The VCF reader, for example, looks at the
     * first line of the file for the ##format=VCF4.1 header, and the BAM reader for the BAM_MAGIC value
     * @param file
     * @return
     */
    @Requires("file != null")
    public boolean canRead(File file);

See the VCF and BAMDiffableReaders for example implementations. If you extend this to a new object types both the DiffObjects walker and the integrationtest framework will automatically work with your new file type.

Comments (1)

New WGS and WEx CEU trio BAM files

We have sequenced at the Broad Institute and released to the 1000 Genomes Project the following datasets for the three members of the CEU trio (NA12878, NA12891 and NA12892):

  • WEx (150x) sequence
  • WGS (>60x) sequence

This is better data to work with than the original DePristo et al. BAMs files, so we recommend you download and analyze these files if you are looking for complete, large-scale data sets to evaluate the GATK or other tools.

Here's the rough library properties of the BAMs:

CEU trio BAM libraries

These data files can be downloaded from the 1000 Genomes DCC

NA12878 Datasets from DePristo et al. (2011) Nature Genetics

Here are the datasets we used in the GATK paper cited below.

DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D and Daly, M (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 43:491-498.

Some of the BAM and VCF files are currently hosted by the NCBI: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/

  • NA12878.hiseq.wgs.bwa.recal.bam -- BAM file for NA12878 HiSeq whole genome
  • NA12878.hiseq.wgs.bwa.raw.bam Raw reads (in BAM format, see below)
  • NA12878.ga2.exome.maq.recal.bam -- BAM file for NA12878 GenomeAnalyzer II whole exome (hg18)
  • NA12878.ga2.exome.maq.raw.bam Raw reads (in BAM format, see below)
  • NA12878.hiseq.wgs.vcf.gz -- SNP calls for NA12878 HiSeq whole genome (hg18)
  • NA12878.ga2.exome.vcf.gz -- SNP calls for NA12878 GenomeAnalyzer II whole exome (hg18)
  • BAM files for CEU + NA12878 whole genome (b36). These are the standard BAM files for the 1000 Genomes pilot CEU samples plus a 4x downsampled version of NA12878 from the pilot 2 data set, available in the DePristoNatGenet2011 directory of the GSA FTP Server
  • SNP calls for CEU + NA12878 whole genome (b36) are available in the DePristoNatGenet2011 directory of the GSA FTP Server
  • Crossbow comparison SNP calls are available in the DePristoNatGenet2011 directory of the GSA FTP Server as crossbow.filtered.vcf. The raw calls can be viewed by ignoring the FILTER field status
  • whole_exome_agilent_designed_120.Homo_sapiens_assembly18.targets.interval_list -- targets used in the analysis of the exome capture data

Please note that we have not collected the indel calls for the paper, as these are only used for filtering SNPs near indels. If you want to call accurate indels, please use the new GATK indel caller in the Unified Genotyper.

Warnings

Both the GATK and the sequencing technologies have improved significantly since the analyses performed in this paper.

  • If you are conducting a review today, we would recommend that the newest version of the GATK, which performs much better than the version described in the paper. Moreover, we would also recommend one use the newest version of Crossbow as well, in case they have improved things. The GATK calls for NA12878 from the paper (above) will give one a good idea what a good call set looks like whole-genome or whole-exome.

  • The data sets used in the paper are no longer state-of-the-art. The WEx BAM is GAII data aligned with MAQ on hg18, but a state-of-the-art data set would use HiSeq and BWA on hg19. Even the 64x HiSeq WG data set is already more than one year old. For a better assessment, we would recommend you use a newer data set for these samples, if you have the capacity to generate it. This applies less to the WG NA12878 data, which is pretty good, but the NA12878 WEx from the paper is nearly 2 years old now and notably worse than our most recent data sets.

Obviously, this was an annoyance for us as well, as it would have been nice to use a state-of-the-art data set for the WEx. But we decided to freeze the data used for analysis to actually finish this paper.

How do I get the raw FASTQ file from a BAM?

If you want the raw, machine output for the data analyzed in the GATK framework paper, obtain the raw BAM files above and convert them from SAM to FASTQ using the Picard tool SamToFastq.

Comments (12)

Objective

Test that Queue is correctly installed, and that the supporting tools like Java are in your path.

Prerequisites

  • Basic familiarity with the command-line environment
  • Understand what is a PATH variable
  • GATK installed
  • Queue downloaded and placed on path

Steps

  1. Invoke the Queue usage/help message
  2. Troubleshooting

1. Invoke the Queue usage/help message

The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.

Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.

Action

Type the following command:

java -jar <path to Queue.jar> --help

replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.

Expected Result

You should see usage output similar to the following:

usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
       [-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
       <temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
       <emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>]
       [-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
       <status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
       [-debug] [-h]

 -S,--script <script>                                                      QScript scala file
 -jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
 -jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
 -jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
 -jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
                                                                           output for compute farm jobs.
 -memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
 -runDir,--run_directory <run_directory>                                   Root directory to run functions from.
 -tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
 -emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
 -emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
                                                                           otherwise 25.
 -emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
 -emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
 -emailUser,--emailUsername <emailUsername>                                Email SMTP username. Defaults to none.
 -emailPass,--emailPassword <emailPassword>                                Email SMTP password. Defaults to none. Not
                                                                           secure! See emailPassFile.
 -emailPassFile,--emailPasswordFile <emailPasswordFile>                    Email SMTP password file. Defaults to none.
 -bsub,--bsub_all_jobs                                                     Use bsub to submit jobs
 -run,--run_scripts                                                        Run QScripts.  Without this flag set only
                                                                           performs a dry run.
 -dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
                                                                           http://en.wikipedia.org/wiki/DOT_language
 -expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
                                                                           a .dot file.  Otherwise overwrites the
                                                                           dot_graph
 -startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
                                                                           outputs were previously output successfully.
 -status,--status                                                          Get status of jobs for the qscript
 -statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
                                                                           completion or on error.
 -statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
                                                                           completion or on error.
 -keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
                                                                           any Function marked as intermediate.
 -retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
                                                                           command fails.  Defaults to no retries.
 -l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
                                                                           setting INFO get's you INFO up to FATAL,
                                                                           setting ERROR gets you ERROR and FATAL level
                                                                           logging.
 -log,--log_to_file <log_to_file>                                          Set the logging location
 -quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
                                                                           stdout
 -debug,--debug_mode                                                       Set the logging file string to include a lot
                                                                           of debugging information (SLOW!)
 -h,--help                                                                 Generate this help message

If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.


2. Troubleshooting

Let's try to figure out what's not working.

Action

First, make sure that your Java version is at least 1.6, by typing the following command:

java -version

Expected Result

You should see something similar to the following text:

java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)  

Remedial actions

If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like

java: Command not found  

make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.

On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.

Comments (0)

This section contains articles related to developing for the GATK. Topics covered include how to write new walkers and Queue scripts, as well as some deeper GATK engine information that is relevant for developers.

Comments (0)

The GATK is an open source project that has greatly benefited from the contributions of outside users. The GATK team welcomes contributions from anyone who produces useful functionality in line with the goals of the toolkit. You are welcome to branch the GATK main repository and develop your own tools. Sometimes these tools may be useful to the GATK user community and you may want to make it part of the main GATK distribution. If so we ask you to follow our guidelines for submission of patches.

1. Good practices

There are a few good GIT practices that you should follow to simplify the ultimate goal, which is, adding your changes to the main GATK repository.

  • Use branches.
    Every time you start new work that you are going to submit to the GATK team later, do it in a new branch. Make it a habit as this will simplify many of the following procedures and allow your master branch to always be a fresh (up to date) copy of the GATK main repository. Take a look on [[#How to create a new submission| how to create a new branch for submission]].
  • Never merge.
    Merging creates a branched history with multiple parent nodes that make history hard to understand, impossible to modify and patches near-impossible to create. Merges are very useful when you need to combine multiple repositories and it should ''only'' be used when it makes sense. This means '''never merge''' and '''never pull''' (if it's not a fast-forward, or you will create a merge).
  • Commit as often as possible.
    Every change, should be committed to make sure you can go back in time effectively in your own tree. The commit messages don't matter to us as long as they're meaningful to you in this stage. You can essentially do whatever you want in your local tree with your commits, as long as you don't merge.
  • Rebase constantly
    Your branch is diverging from the master by the minute, so if you keep rebasing as often as you can, you will avoid major conflicts when it's time to send the patches. Take a look at our guide on [[#How to rebase | how to rebase]].
  • Tell a meaningful story
    When it's time to submit your patches to us, reorder your commits and write meaningful commit messages. Each commit must be (as much as possible) self contained. These commits must tell a meaningful story to us so we can understand what it is you're adding to the codebase. Take a look at an [[#How to make your commits | example commit scenario]].
  • Generate patches and email them to the group
    This part is super easy, provided you've followed the good practices. You just have to [[#How to generate the patches | generate the patches]] and e-mail them to gsa-patches@broadinstitute.org.

2. How to create a new submission

You should always start your code by creating a new branch from the most recent version of the main repository with :

git checkout master                       (make sure you are in the master branch)
git fetch && git rebase origin/master     (you can substitute this line for "git pull" if you have no changes in the master branch) 
git checkout -b newtool                   (create a new branch for your new tool)

Note: If you have submitted a patch to the group, do not continue development on the same branch as we cannot guarantee that your changes will make it to the main repository unchanged.

3. How to rebase

Every time before you rebase, you have to update your copy of the main repository. To do this use:

git fetch

If you are just trying to keep up with the changes in the main repository after a fetch, you can rebase your branch at anytime using (and this should be all you need to do):

git rebase origin/master

In case there are conflicts, resolve them as you would and do:

git rebase --continue

If you don't know how to resolve the conflicts, you can always safely abort the whole process and go back to your branch before you started rebasing:

git rebase --abort

If you are done and want to generate your patches conforming to the latest repository changes, to edit, squash and reorder your commits use :

git rebase -i origin/master

At the prompt, you can follow the instructions to squash, edit and reorder accordingly. You can also do this step from IntelliJ with a visual editor that allows you to select what to edit/squash/reorder. You can also take a look at this nice tutorial on how to use interactive rebase.

4. How to make your commits

It is okay to have a list of commits (numbered) somewhat like this in your local tree:

  • added function X
  • fixed a b and c on X
  • b was actually d
  • started creating feature Y but had to go to the bathroom
  • added Y
  • found bug in X, fixed with e
  • added Z
  • fixed bug in Z with f

Before you can send your tools to us, you have to organize these commits so they tell a meaningful history and are self contained. To achieve this you will need to rebase so you can squash, edit and reorder your commits. This tree makes a lot of sense for your development process, but it makes no sense in the main repository history as it becomes hard to pick/revert commits and understand the history at a glance. After rebasing, you should edit your commits to look like this:

  • added X (including commits 2, 3 and 6)
  • added Y (including commits 4 and 5)
  • added Z (including commits 7 and 8)

Use your commit messages wisely to help quick processing of your patches. Make sure the first line of your commit messages have less than 50 characters (title). Add a blank line and write a paragraph or more explaining what this commit represents (now that it is a package of multiple commits. It is important to have the 50 char title because this is all we see when we look at an extended history to find bugs and it is also our quick access to remember what the commit does to the repository.

A patch should be self contained. Meaning if we decide to adopt feature X and Z but not Y, we should be able to do so by only applying patches 1 and 2. If your patches are co-dependent, you should say so in the commits and justify why you didn't squash the commits together into one tool.

5. How to generate the patches

To generate patches, use :

git format-patch since  

The since parameter is the last commit you want to generate patches from, for example: HEAD^3 will generate patches for HEAD^2, HEAD^1 and HEAD. You can also specify the commit by its id or by using the head of a branch. This is where using branches will make your life easier. If master is always up to date with the main repo with no changes, you can do:

git format-patch master   (provided your master is up to date) 

This will generate a patch for each commit you've created and you can simply e-mail them as an attachment to us.

Comments (0)

1. What it is and how it helps us improve the GATK

Since September, 2010, the GATK has had a "phone-home" feature that sends us information about each GATK run via the Broad filesystem (within the Broad) and Amazon's S3 cloud storage service (outside the Broad). This feature is enabled by default.

The information provided by the phone-home feature is critical in driving improvements to the GATK

  • By recording detailed information about each error that occurs, it enables GATK developers to identify and fix previously-unknown bugs in the GATK. We are constantly monitoring the errors our users encounter and do our best to fix those errors that are caused by bugs in our code.
  • It allows us to better understand how the GATK is used in practice and adjust our documentation and development goals for common use cases.
  • It gives us a picture of which versions of the GATK are in use over time, and how successful we've been at encouraging users to migrate from obsolete or broken versions of the GATK to newer, improved versions.
  • It tells us which tools are most commonly used, allowing us to monitor the adoption of newly-released tools and abandonment of outdated tools.
  • It provides us with a sense of the overall size of our user base and the major organizations/institutions using the GATK.

2. What information is sent to us

Below are two example GATK Run Reports showing exactly what information is sent to us each time the GATK phones home.

A successful run:

<GATK-run-report>
    <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
    <start-time>2012/03/10 20.21.19</start-time>
    <end-time>2012/03/10 20.21.19</end-time>
    <run-time>0</run-time>
    <walker-name>CountReads</walker-name>
    <svn-version>1.4-483-g63ecdb2</svn-version>
    <total-memory>85000192</total-memory>
    <max-memory>129957888</max-memory>
    <user-name>depristo</user-name>
    <host-name>10.0.1.10</host-name>
    <java>Apple Inc.-1.6.0_26</java>
    <machine>Mac OS X-x86_64</machine>
    <iterations>105</iterations>
</GATK-run-report>

A run where an exception has occurred:

<GATK-run-report>
   <id>yX3AnltsqIlXH9kAQqTWHQUd8CQ5bikz</id>   
   <exception>
      <message>Failed to parse Genome Location string: 20:10,000,000-10,000,001x</message>
      <stacktrace class="java.util.ArrayList"> 
         <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:377)</string>
         <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
         <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
         <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
         <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
         <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
         <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
         <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
         <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
         <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
      </stacktrace>
      <cause>
         <message>Position: &apos;10,000,001x&apos; contains invalid chars.</message>
         <stacktrace class="java.util.ArrayList">
            <string>org.broadinstitute.sting.utils.GenomeLocParser.parsePosition(GenomeLocParser.java:411)</string>
            <string>org.broadinstitute.sting.utils.GenomeLocParser.parseGenomeLoc(GenomeLocParser.java:374)</string>
            <string>org.broadinstitute.sting.utils.interval.IntervalUtils.parseIntervalArguments(IntervalUtils.java:82)</string>
            <string>org.broadinstitute.sting.commandline.IntervalBinding.getIntervals(IntervalBinding.java:106)</string>
            <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.loadIntervals(GenomeAnalysisEngine.java:618)</string>
            <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeIntervals(GenomeAnalysisEngine.java:585)</string>
            <string>org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:231)</string>
            <string>org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:128)</string>
            <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)</string>
            <string>org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)</string>
            <string>org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:92)</string>
         </stacktrace>
         <is-user-exception>false</is-user-exception>
      </cause>
      <is-user-exception>true</is-user-exception>
   </exception>
   <start-time>2012/03/10 20.19.52</start-time>
   <end-time>2012/03/10 20.19.52</end-time>
   <run-time>0</run-time>
   <walker-name>CountReads</walker-name>
   <svn-version>1.4-483-g63ecdb2</svn-version>
   <total-memory>85000192</total-memory>
   <max-memory>129957888</max-memory>
   <user-name>depristo</user-name>
   <host-name>10.0.1.10</host-name>
   <java>Apple Inc.-1.6.0_26</java>
   <machine>Mac OS X-x86_64</machine>
   <iterations>0</iterations>
</GATK-run-report>

Note that as of GATK 1.5 we no longer collect information about the command-line executed, the working directory, or tmp directory.

3. Disabling Phone Home

The GATK is currently in the process of evolving to require interaction with Amazon S3 as a normal part of each run. For this reason, and because the information contained in the GATK run reports is so critical in driving improvements to the GATK, we strongly discourage our users from disabling the phone-home feature.

At the same time, we recognize that some of our users do have legitimate reasons for needing to run the GATK with phone-home disabled, and we don't wish to make it impossible for these users to run the GATK.

Examples of legitimate reasons for disabling Phone Home

  • Technical reasons: Your local network might have restrictions in place that don't allow the GATK to access external resources, or you might need to run the GATK in a network-less environment.

  • Organizational reasons: Your organization's policies might forbid the dissemination of one or more pieces of information contained in the GATK run report.

For such users we have provided an -et NO_ET option in the GATK to disable the phone-home feature. To use this option in GATK 1.5 and later, you need to contact us to request a key. Instructions for doing so are below.

How to obtain and use a GATK key

To obtain a GATK key, please fill out the request form.

Running the GATK with a key is simple: you just need to append a -K your.key argument to your customary command line, where your.key is the path to the key file you obtained from us:

java -jar dist/GenomeAnalysisTK.jar \
    -T PrintReads \
    -I public/testdata/exampleBAM.bam \
    -R public/testdata/exampleFASTA.fasta \
    -et NO_ET \
    -K your.key

The -K argument is only necessary when running the GATK with the NO_ET option.

Troubleshooting key-related problems

  • Corrupt/Unreadable/Revoked Keys

If you get an error message from the GATK saying that your key is corrupt, unreadable, or has been revoked, please email '''gsahelp@broadinstitute.org''' to ask for a replacement key.

  • GATK Public Key Not Found

If you get an error message stating that the GATK public key could not be located or read, then something is likely wrong with your build of the GATK. If you're running the binary release, try downloading it again. If you're compiling from source, try doing an ant clean and re-compiling. If all else fails, please ask for help on our community forum.

What does GSA use Phone Home data for?

We use the phone home data for three main purposes. First, we monitor the input logs for errors that occur in the GATK, and proactively fix them in the codebase. Second, we monitor the usage rates of the GATK in general and specific versions of the GATK to explain how widely used the GATK is to funding agencies and other potential supporters. Finally, we monitor adoption rates of specific GATK tools to understand how quickly new tools reach our users. Many of these analyses require us to aggregate the data by unique user, which is why we still collect the username of the individual who ran the GATK (as you can see in the plots). Examples of all three uses are shown in the Tableau graphs below, which update each night and are sent to the GATK members each morning for review.

Comments (12)

We make various files available for public download from the GSA FTP server, such as the GATK resource bundle and presentation slides. We also maintain a public upload feature for processing bug reports from users.

There are two logins to choose from depending on whether you want to upload or download something:

Downloading

location: ftp.broadinstitute.org
username: gsapubftp-anonymous
password: <blank>

Uploading

location: ftp.broadinstitute.org
username: gsapubftp
password: 5WvQWSfi

Using a browser as FTP client

If you use your browser as FTP client, make sure to include the login information in the address, otherwise you will access the general Broad Institute FTP instead of our team FTP. This should work as a direct link (for downloading only):

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle

Comments (1)

NOTICE:

This tutorial is slightly out of date so the output is a little different. We'll update this soon, but in the meantime, don't freak out if you get a result that reads something like

INFO 18:32:38,826 CountReads - CountReads counted 33 reads in the traversal 

instead of

INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33 

You're doing the right thing and getting the right result.

And of course, in doubt, just post a comment on this article; we're here to answer your questions.


Objective

Run a basic analysis command on example data.

Prerequisites

Steps

  1. Invoke the GATK CountReads command
  2. Further exercises

1. Invoke the GATK CountReads command

A very simple analysis that you can do with the GATK is getting a count of the reads in a BAM file. The GATK is capable of much more powerful analyses, but this is a good starting example because there are very few things that can go wrong.

So we are going to count the reads in the file exampleBAM.bam, which you can find in the GATK resource bundle along with its associated index (same file name with .bai extension), as well as the example reference exampleFASTA.fasta and its associated index (same file name with .fai extension) and dictionary (same file name with .dict extension). Copy them to your working directory so that your directory contents look like this:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai

Action

Type the following command:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 

where -T CountReads specifies which analysis tool we want to use, -R exampleFASTA.fasta specifies the reference sequence, and -I exampleBAM.bam specifies the file of aligned reads we want to analyze.

For any analysis that you want to run on a set of aligned reads, you will always need to use at least these three arguments:

  • -T for the tool name, which specifices the corresponding analysis
  • -R for the reference sequence file
  • -I for the input BAM file of aligned reads

They don't have to be in that order in your command, but this way you can remember that you need them if you TRI...

Expected Result

After a few seconds you should see output that looks like to this:

INFO  16:17:45,945 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,946 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:17:45,947 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:17:45,947 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:17:45,947 HelpFormatter - Program Args: -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  16:17:45,947 HelpFormatter - Date/Time: 2012/07/25 16:17:45 
INFO  16:17:45,947 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,948 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:17:45,950 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:17:45,982 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:17:45,993 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:17:46,060 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:17:46,060 TraversalEngine -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining 
INFO  16:17:46,061 Walker - [REDUCE RESULT] Traversal result is: 33 
INFO  16:17:46,061 TraversalEngine - Total runtime 0.00 secs, 0.00 min, 0.00 hours 
INFO  16:17:46,100 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:17:46,729 GATKRunReport - Uploaded run statistics report to AWS S3 

Depending on the GATK release, you may see slightly different information output, but you know everything is running correctly if you see the line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

somewhere in your output.

If you don't see this, check your spelling (GATK commands are case-sensitive), check that the files are in your working directory, and if necessary, re-check that the GATK is properly installed.

If you do see this output, congratulations! You just successfully ran you first GATK analysis!

Basically the output you see means that the CountReadsWalker (which you invoked with the command line option -T CountReads) counted 33 reads in the exampleBAM.bam file, which is exactly what we expect to see.

Wait, what is this walker thing?

In the GATK jargon, we call the tools walkers because the way they work is that they walk through the dataset --either along the reference sequence (LocusWalkers), or down the list of reads in the BAM file (ReadWalkers)-- collecting the requested information along the way.


2. Further Exercises

Now that you're rocking the read counts, you can start to expand your use of the GATK command line.

Let's say you don't care about counting reads anymore; now you want to know the number of loci (positions on the genome) that are covered by one or more reads. The name of the tool, or walker, that does this is CountLoci. Since the structure of the GATK command is basically always the same, you can simply switch the tool name, right?

Action

Instead of this command, which we used earlier:

java -jar <path to GenomeAnalysisTK.jar> -T CountReads -R exampleFASTA.fasta -I exampleBAM.bam 

this time you type this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam 

See the difference?

Result

You should see something like this output:

INFO  16:18:26,183 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,185 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:18:26,185 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:18:26,185 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:18:26,186 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam 
INFO  16:18:26,186 HelpFormatter - Date/Time: 2012/07/25 16:18:26 
INFO  16:18:26,186 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,186 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:18:26,189 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:18:26,222 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:18:26,233 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:18:26,351 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:18:26,351 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
2052
INFO  16:18:26,411 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours 
INFO  16:18:26,450 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:18:27,124 GATKRunReport - Uploaded run statistics report to AWS S3 

Great! But wait -- where's the result? Last time the result was given on this line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

But this time there is no line that says [REDUCE RESULT]! Is something wrong?

Not really. The program ran just fine -- but we forgot to give it an output file name. You see, the CountLoci walker is set up to output the result of its calculations to a text file, unlike CountReads, which is perfectly happy to output its result to the terminal screen.

Action

So we repeat the command, but this time we specify an output file, like this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt

where -o (lowercase o, not zero) is used to specify the output.

Result

You should get essentially the same output on the terminal screen as previously (but notice the difference in the line that contains Program Args -- the new argument is included):

INFO  16:29:15,451 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,453 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.0-22-g40f97eb, Compiled 2012/07/25 15:29:41 
INFO  16:29:15,453 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  16:29:15,453 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  16:29:15,453 HelpFormatter - Program Args: -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam -o output.txt 
INFO  16:29:15,454 HelpFormatter - Date/Time: 2012/07/25 16:29:15 
INFO  16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,454 HelpFormatter - --------------------------------------------------------------------------------- 
INFO  16:29:15,457 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:29:15,488 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  16:29:15,618 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours 
INFO  16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3 

This time however, if we look inside the working directory, there is a newly created file there called output.txt.

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la
drwxr-xr-x  9 vdauwera  CHARLES\Domain Users     306 Jul 25 16:29 .
drwxr-xr-x@ 6 vdauwera  CHARLES\Domain Users     204 Jul 25 15:31 ..
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users    3635 Apr 10 07:39 exampleBAM.bam
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     232 Apr 10 07:39 exampleBAM.bam.bai
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users     148 Apr 10 07:39 exampleFASTA.dict
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users  101673 Apr 10 07:39 exampleFASTA.fasta
-rw-r--r--@ 1 vdauwera  CHARLES\Domain Users      20 Apr 10 07:39 exampleFASTA.fasta.fai
-rw-r--r--  1 vdauwera  CHARLES\Domain Users       5 Jul 25 16:29 output.txt

This file contains the result of the analysis:

[bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt 
2052

This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file.

Discussion

Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on.

Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis.

Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this:

java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta

the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb): 
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Walker requires reads but none were provided.
##### ERROR ------------------------------------------------------------------------------------------

You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command.

So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis!

There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful.

So, at the risk of getting repetitive, always read the documentation of each walker that you want to use!

Comments (25)

Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

Command-line arguments

Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

the data processing pipeline

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

Examples

  1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

  2. Performing realignment and the full data processing pipeline in one pair-ended bam file

    java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

Comments (0)

By default, the forum does not send notification messages about new comments or discussions. If you want to turn on notifications or customize the type of notifications you want to receive (email, popup message etc), you need to do the following:

  • Go to your profile page by clicking on your user name (in blue box, top left corner);
  • Click on "Edit Profile" (button with silhouette of person, top right corner);
  • In the menu on the left, click on "Notification Preferences";
  • Select the categories that you want to follow and the type of notification you want to receive.
  • Be sure to click on Save Preferences.

To specifically get new GATK announcements, scroll down to "Category Notifications" and tick off the "Announcements" category for email notification for discussions (and comments if you really want to know everything).

Comments (0)

If you're a developer who uses the GATK framework, you may have noticed that the 3.2 release broke existing third-party walkers. This happened because we had to do some codebase-wide refactoring/renaming that changes package names and classpaths (most obviously, renaming all instances of "sting" to "gatk"). We're sorry for the inconvenience but this was necessary as a preparation step on the road to providing GATK as a maven artifact.

We have posted a document that details the changes and explains how to update third-party code to work with GATK framework versions 3.2 onward here.

While we're on this topic, it seems that the IntelliJ set up instructions are a little outdated (in part due to the renaming), so we'll be updating that as well in the near future.

Comments (0)

We're looking for a developer (software engineer or equivalent) to join our team on a part-time basis as a software engineering consultant.

The mission is to take on tasks that are non-critical but still important, such as fixing bugs and implementing minor feature requests, usability improvements and so on in the GATK codebase. The idea is to take much of the maintenance burden off of the core development team so they can focus on developing new tools and methods.

We already have one person employed in this capacity, and it's working out very well. However there is quite a bit more to do than he can find time for, and therefore we'd like to hire a second consultant to pick up the extra work in parallel. (We were hoping to just clone him but ran into some difficulties getting IRB approval.)

This position does not require expert knowledge of GATK, but familiarity with the GATK tools is a big plus. The main language is Java, with a small side of R and Scala. We also have a growing corpus of C++ code that is not yet in the scope of this position, but may move into scope some months down the line.

Note that this opportunity is amenable to remote work so you don't need to be local to the Boston area, but it is limited to US residents with valid work authorization (no H1B sponsorship possible, sorry).

Drop us a line in the comments below or private-message me (@Geraldine_VdAuwera) to discuss details.

Comments (4)

Let's say I have a case class in a qscript like so:

case class getFileFromRemote(remoteFile:File, localFile:file) extends ExternalCommonArgs with SixteenCoreJob with FourDayJob{ @Input(doc = "remote") val _remoteFile = remoteFile @Output(doc = "local") val _localFile = localFile def commandLine = "fetchFile.sh " + remoteFile + " " + localFile this.isIntermediate = true this.jobName = "fetchFile" this.analysisName = this.jobName }

Then I can add it to my script() like so:

add( getFileFromRemote("user@server:/path/to/remote", "path/to/local") )

All is well so far.

Lets say that I have lots of files to fetch, so I add jobs in a for loop over a Seq of file names. I then add jobs downstream jobs as usual. The problem that I run in to is that all 1000+ fetchFile.sh (which uses irods/irsync behind the scenes) sessions will start at the same time, choking the system and nothing will get downloaded.

One solution to my problem would be to be able to set a limit in my fetcher case class, to tell Queue to never submit more than 5 (or so) of these jobs to the cluster. Is that possible, or can anyone see another way around this?

Comments (4)

Hi fellow htsjdk/picard/gatk developers!

I've been thinking about this for quite some time now, so I thought I should write up a quick post about it here.

I've been writing custom tools for our group using both picard and GATK for some time now. It's been working nicely, but I have been missing a set of basic tutorials and examples, for users to quickly get started writing walkers. My most commonly used reference has been the 20-line life savers (http://www.slideshare.net/danbolser/20line-lifesavers-coding-simple-solutions-in-the-gatk) which is getting a bit dated.

What I would like to see is something like for following:

  • What's in htsjsk? What's not in htsjdk? (from a dev's perspective - in terms of frameworks)
  • What's in picard? What's not in picard? (from a dev's perspective - in terms of frameworks)
  • What's in gatk? What's not in gatk? (from a dev's perspective - in terms of frameworks)
  • When to use htsjdk, picard any GATK. What are the strengths and weaknesses of the three. (possibly more that I've missed)
  • Your first htsjdk walker
  • Your first picard walker
  • Your first gatk walker
  • Traversing a BAM in htsjdk vs gatk - what are the differences

There might be more stuff that could go in here as well. The driving force behind this is that I'm myself a bit confused by the overlap of these three packages/frameworks. I do understand that picard uses htsjdk, and that GATK uses both as dependencies, but it's not super clear what extra functionality (for a developer) is added from htsjdk -> picard -> gatk.

Could we assemble a small group of interested developers to contribute to this? We could set up a git repo with the examples and tutorials for easy collaboration and sharing online.

Anyone interested? I'll could myself as the first member :)

Comments (4)

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

Comments (4)

I believe that I may have found an issue with the CombineVariants tool of GATK that manifests itself when there is a repeated ID in a given VCF. For us, the reason to have repeated IDs in a VCF file is to detect inconsistencies in our sample by calling variants on 2 different DNA samples and then checking the concordance. Our current process is:

1) Generate a VCF containing unique IDs (using GATK CallVariants)
2) Replace the VCF header with potentially non-unique IDs (using tabix -r)
3) Merge a single VCF to uniqify the IDs (using GATK CombineVariants)

It seems that the genotypes in the merged VCF are off by one column. I've attached 3 files that demonstrate the issue: "combined" which is the result of step 1, "combined.renamed", which is the output of step 2, and "combined.renamed.merged", which is the output of step 3.

The relevant lines are as follows:
combined:

HG00421@123910725 HG00422 HG00422@123910706 HG00423@123910701 NA12801 NA12802
0/0:300           0/0:127 0/0:292           0/0:290           0/0:127 0/0:127
0/0:299           0/0:127 0/0:299           0/0:293           0/0:127 0/0:127

combined.renamed:

HG00421 HG00422 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127 0/0:127

combined.renamed.merged:

HG00421 HG00422 HG00423 NA12801 NA12802
0/0:300 0/0:127 0/0:292 0/0:290 0/0:127
0/0:299 0/0:127 0/0:299 0/0:293 0/0:127

Using the depth argument here, we can see that in the merged dataset, NA12801 has depths 290,293 whereas in the original and renamed datasets the depths were 127,127. The 290,293 depths correspond to HG00423, which is the column before.

I have confirmed this behavior in both GATK 2.7-4 and 2.8-1. If there's any more information that you need, please let me know, and I would be happy to provide it. Also, if you might know where this issue arises, I would be happy to try to provide a patch.

Thanks,

John Wallace

Comments (7)

Is there anyway to write a sliding window RodWalker? I've been try to look in your documentation, but I have not found a clue.

Comments (4)

Hi,

Recently I've been wanting to preform tasks for each interval in a file using the GATK. Are you guys planning to create an IntervalWalker class? (Or is there a workaround?) DiagnoseTargets in gatk-protected seem to work this way, but not in a very straightforward way - also since it's protected I'm not sure how much if any code I could borrow from there.

cheers Daniel

Comments (6)

Hi,

I have developed a custom walker that I think could be useful to the community. Therefore, I'd like to distribute it.

I tried following the brief guide "Redistributing the GATK-Lite or distributing walkers", but the building fails:

/Users/dankle/Dropbox/IdeaProjects/gatk/build.xml:955: no resources specified

The command I run is ant clean && ant package -Dexecutable=MyWalker.jar, and the xml-file in packages is

<package name="MyWalker">
  <version file="StingText.properties" property="org.broadinstitute.sting.gatk.version" />
  <executable name="MyWalker">
    <main-class name="org.broadinstitute.sting.gatk.walkers.dk.MyWalker" />
    <resource-bundle file="StingText.properties" />
    <modules>
      <module file="GATKEngine.xml"/>
    </modules>
  </executable>
  <release>
    <executable directory="/humgen/gsa-hpprojects/GATK/bin" symlink="current" />
    <archive directory="/humgen/gsa-hpprojects/GATK/bin" symlink="GenomeAnalysisTK-latest.tar.bz2" />
  </release>
</package>

What am I doing wrong?

Daniel

Comments (1)

So I started to play with the GATK API to annotate my VCF. :-)

My first simple aim is to translate mt VCFBigWig to the GATK . Here is my code so far (see below)

I compiled it with:

javac -cp /home/lindenb/package/GenomeAnalysisTK-2.4-3-g2a7af43/GenomeAnalysisTK.jar:/home/lindenb/package/bigwig-read-only/dist/BigWig.jar:/home/lindenb/package/bigwig-read-only/lib/log4j-1.2.15.jar -sourcepath src/main/java \
        ./src/main/java/com/github/lindenb/gatk/tools/vcfbigwig/VCFBigWig.java

(it is compiled but there is a bunch of warnings like "/home/lindenb/package/GenomeAnalysisTK-2.4-3-g2a7af43/GenomeAnalysisTK.jar(org/broadinstitute/sting/gatk/GenomeAnalysisEngine.class): warning: Cannot find annotation method 'value()' in type 'Ensures': class file for com.google.java.contract.Ensures not found"

And I tried to run my code using various invocations of java, like

java  -cp /home/lindenb/package/GenomeAnalysisTK-2.4-3-g2a7af43/GenomeAnalysisTK.jar:/home/lindenb/package/bigwig-read-only/dist/BigWig.jar:/home/lindenb/package/bigwig-read-only/lib/log4j-1.2.15.jar:. org.broadinstitute.sting.gatk.CommandLineGATK \
            -T VariantAnnotator \
            -A com.github.lindenb.gatk.tools.vcfbigwig.VCFBigWig \
            -R /test.fa test.vcf.gz

but it raises an exeception

java.lang.ExceptionInInitializerError
    at org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine.initializeAnnotations(VariantAnnotatorEngine.java:116)
(...)
Caused by: java.lang.NullPointerException
    at org.broadinstitute.sting.utils.classloader.JVMUtils.isAnonymous(JVMUtils.java:91)
    at org.broadinstitute.sting.utils.classloader.PluginManager.<init>(PluginManager.java:158)
    at org.broadinstitute.sting.utils.classloader.PluginManager.<init>(PluginManager.java:107)
    at org.broadinstitute.sting.gatk.walkers.annotator.interfaces.AnnotationInterfaceManager.<clinit>(AnnotationInterfaceManager.java:34)
    ... 9 more

Questions:

  • What the proper to declare my class VCFBigWig , how can I make it available for the VariantAnnotatorEngine ?
  • I used the @Argument annotation, is it the right way to catch an argument from the command line?
  • Can I use the same Annotator for one GATK invocation but with different arguments ?
  • there is a "initialize" method; Is there a "dispose" method to release the resources ?
  • what is the interface "RodRequiringAnnotation" ?
  • is "UserException" the right way to send an error ?
  • Should an InfoFieldAnnotation or a Walker be thread free (like a HttpServlet) ? I mean can I safely use class members ?
  • I saw in VariantAnnotatorEngine that an annotator can either change the ID or the INFO. How should I design my class in order to alter both fields ? For example my tool VCFTabix with use another VCF file (e.g: the VCF for 1KGenomes) to update the ID and the INFO fields.

Thank you for your help,

Pierre

.

package com.github.lindenb.gatk.tools.vcfbigwig;

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.broad.igv.bbfile.BBFileReader;
import org.broad.igv.bbfile.BigWigIterator;
import org.broad.igv.bbfile.WigItem;
import org.broadinstitute.sting.commandline.Argument;
import org.broadinstitute.sting.gatk.GenomeAnalysisEngine;
import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.AnnotatorCompatible;
import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation;
import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.RodRequiringAnnotation;
import org.broadinstitute.sting.utils.exceptions.UserException;
import org.broadinstitute.sting.utils.genotyper.PerReadAlleleLikelihoodMap;
import org.broadinstitute.variant.variantcontext.VariantContext;
import org.broadinstitute.variant.vcf.VCFHeaderLine;
import org.broadinstitute.variant.vcf.VCFHeaderLineType;
import org.broadinstitute.variant.vcf.VCFInfoHeaderLine;


public class VCFBigWig
extends InfoFieldAnnotation 
implements RodRequiringAnnotation
{
/** PATH to the bigwig file */
@Argument(fullName="bigWigFile",shortName="bw",doc="BigWig to process",required=true)
private String bigWigFile=null;
/** the ID in the INFO field */
@Argument(fullName = "bigwigid", shortName = "bwid", doc="bigwig ID ", required=true)
private String bigWigId=null;
/** BBFileReader to the bigwig file */
private BBFileReader bbFileReader=null;

public VCFBigWig()
    {
    }

@Override
public void initialize(AnnotatorCompatible walker,
            GenomeAnalysisEngine toolkit, Set<VCFHeaderLine> headerLines) {
    super.initialize(walker, toolkit, headerLines);
    /* open the BIGFile, check it is a bigwig */
    try {
        this.bbFileReader=new BBFileReader(bigWigFile);
        }
    catch (IOException e) {
        throw new UserException("Cannot read "+bigWigFile,e);
        }
    if(!this.bbFileReader.isBigWigFile())
        {
        throw new UserException("The following file is not a BigWig File : "+bigWigFile);
        }
    }


@Override
public Map<String, Object> annotate(
        RefMetaDataTracker tracker,
        AnnotatorCompatible walker,
        ReferenceContext ref,
        Map<String, AlignmentContext> stratifiedContexts,
        VariantContext vc,
        Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap)
    {
    return annotate(tracker, walker,ref,stratifiedContexts,vc);
    }



@Override
public Map<String, Object> annotate(
        RefMetaDataTracker tracker,
        AnnotatorCompatible walker,
        ReferenceContext ref,
        Map<String, AlignmentContext> stratifiedContexts,
        VariantContext vc)
    {
    final boolean contained=true;
    double sum=0;
    int count=0;
    BigWigIterator iter=this.bbFileReader.getBigWigIterator(
            vc.getChr(), vc.getStart()-1,
            vc.getChr(), vc.getEnd(),
            contained
            );

    while(iter.hasNext())
        {
        WigItem item=iter.next();
        float v=item.getWigValue();
        sum+=v;
        count++;
        }

    if(count==0)
        {
        return Collections.emptyMap();
        }
    Map<String,Object> m=new HashMap<String, Object>(1);
    m.put(this.bigWigId, (float)(sum/count));
    return m ;
    }
@Override
public List<VCFInfoHeaderLine> getDescriptions()
    {
    return Arrays.asList(new VCFInfoHeaderLine[]{
        new VCFInfoHeaderLine(bigWigId,1, VCFHeaderLineType.Float,"Mean values for bigwig file "+this.bigWigFile)   
        });
    }
@Override
public List<String> getKeyNames() {
    return Arrays.asList(new String[]{
            bigWigId    
            });
    }

}
Comments (3)

I just quickly wrote a set of Tools to annotate my VCFs ( http://plindenbaum.blogspot.fr/2013/02/4-tools-i-wrote-today-to-annotate-vcf.html )

For example, one of those tools uses a BED/XML file indexed with tabix to annotate my VCF . (My code just uses the java api for tabix to get the XML at a given position)

Question: is there something in the GATK-API that would allow me to implement my code using the GATK-API: What kind of walker should I use ? What would be the benefits of using the GATK-API ? for example does using a gatk-walker will automatically make my code parallelizable ?

Pierre

Comments (5)

Hi all,

Wouldn't it be useful to have GATK wrapped into a Python API? pygatk. As pysam is for samtools or pybedtools is for bedtools. Is anybody developing this?

Regards, Pablo.

Comments (1)

I am writing a Java program to a call a number of CommandLine classes in succession to sort, reorder, index I call SortSam, CleanSam,AddOrReplaceReadGroups, ReorderSam. Each call has a main which calls System.exit. Instead if calling that main (which would kill my pipeline, I call instanceMain on the relevant class using reflection. I want to pass as output file tmp.bam and expect tmp.bai to be generated as an index. That works. After I call invokeMain I want to rename the tmp file and the tmp index to the original names. I first rename the original to get it out of the way and also the original index file . My problem is that the original index file has not been closed and cannot be moved out of the way. While it is true that calling System.exit (as the CommandLine main does) addresses this issue - this is a VERY BLUNT HAMMER and precludes writing a program which chains command line operations. How can I close the index file without calling System.exit