# Tagged with #advanced 12 documentation articles | 0 announcements | 0 forum discussions

This document provides an overview of what are the steps required to make a walker multi-threadable using the -nct and the -nt arguments, which make use of the NanoSchedulable and TreeReducible interfaces, respectively.

### NanoSchedulable / -nct

Providing -nct support requires that you certify that your walker's map() method is thread-safe -- eg., if any data structures are shared across map() calls, access to these must be properly synchronized. Once your map() method is thread-safe, you can implement the NanoSchedulable interface, an empty interface with no methods that just marks your walker as having a map() method that's safe to parallelize:

/**
* Root parallelism interface.  Walkers that implement this
* declare that their map function is thread-safe and so multiple
* map calls can be run in parallel in the same JVM instance.
*/
public interface NanoSchedulable {
}


### TreeReducible / -nt

Providing -nt support requires that both map() and reduce() be thread-safe, and you also need to implement the TreeReducible interface. Implementing TreeReducible requires you to write a treeReduce() method that tells the engine how to combine the results of multiple reduce() calls:

public interface TreeReducible<ReduceType> {
/**
* A composite, 'reduce of reduces' function.
* @param lhs 'left-most' portion of data in the composite reduce.
* @param rhs 'right-most' portion of data in the composite reduce.
* @return The composite reduce type.
*/
ReduceType treeReduce(ReduceType lhs, ReduceType rhs);
}


This method differs from reduce() in that while reduce() adds the result of a single map() call onto a running total, treeReduce() takes the aggregated results from multiple map/reduce tasks that have been run in parallel and combines them. So, lhs and rhs might each represent the final result from several hundred map/reduce calls.

Example treeReduce() implementation from the UnifiedGenotyper:

public UGStatistics treeReduce(UGStatistics lhs, UGStatistics rhs) {
lhs.nBasesCallable += rhs.nBasesCallable;
lhs.nBasesCalledConfidently += rhs.nBasesCalledConfidently;
lhs.nBasesVisited += rhs.nBasesVisited;
return lhs;
}


The GATK build system uses the Ivy dependency manager to make it easy for our users to add additional dependencies. Ivy can pull the latest jars and their dependencies from the Maven repository, making adding or updating a dependency as simple as adding a new line to the ivy.xml file.

If your tool is available in the maven repository, add a line to the ivy.xml file similar to the following:

<dependency org="junit" name="junit" rev="4.4" />


If you would like to add a dependency to a tool not available in the maven repository, please email gsahelp@broadinstitute.org

## Updating SAM-JDK and Picard

Because we work so closely with the SAM-JDK/Picard team and are critically dependent on the code they produce, we have a special procedure for updating the SAM/Picard jars. Please use the following procedure to when updating sam-*.jar or picard-*.jar.

• Get the latest svn versions for picard public and picard private by running the following commands:

svn info $PICARD_PUBLIC_HOME | grep "Revision" svn info$PICARD_PRIVATE_HOME | grep "Revision"

• Rename the jars and xmls in $STING_HOME/settings/repository/net.sf to {picard|sam}-$PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.PICARD_PUBLIC_SVN_REV.{jar|xml} • Update the jars in $STING_HOME/settings/repository/net.sf with their newer equivalents in $PICARD_PUBLIC_HOME/dist/picard_lib. • Update the xmls in $STING_HOME/settings/repository/net.sf with the appropriate version number ($PICARD_PUBLIC_MAJOR_VERSION.$PICARD_PUBLIC_MINOR_VERSION.$PICARD_PUBLIC_SVN_REV). ### Updating the Picard private jar • Create the picard private jar with the following command: ant clean package -Dexecutable=PicardPrivate -Dpicard.dist.dir=${PICARD_PRIVATE_HOME}/dist

• Rename picard-private-parts-*.jar in $STING_HOME/settings/repository/edu.mit.broad to picard-private-parts-$PICARD_PRIVATE_SVN_REV.jar.

• Update picard-private-parts-*.jar in $STING_HOME/settings/repository/edu.mit.broad with the picard-private-parts.jar in $STING_HOME/dist/packages/picard-private-parts.

• Update the xml in $STING_HOME/settings/repository/edu.mit.broad to reflect the new revision and publication date. ## 1. Introduction The LocusTraversal now supports passing walkers reads that have deletions spanning the current locus. This is useful in many situation where you want to calculate coverage, call variants and need to avoid calling variants where there are a lot of deletions, etc. Currently, the system by default will not pass you deletion-spanning reads. In order to see them, you need to overload the function: /** * (conceptual static) method that states whether you want to see reads piling up at a locus * that contain a deletion at the locus. * * ref: ATCTGA * read1: ATCTGA * read2: AT--GA * * Normally, the locus iterator only returns a list of read1 at this locus at position 3, but * if this function returns true, then the system will return (read1, read2) with offsets * of (3, -1). The -1 offset indicates a deletion in the read. * * @return false if you don't want to see deletions, or true if you do */ public boolean includeReadsWithDeletionAtLoci() { return true; }  in your walker. Now you will start seeing deletion-spanning reads in your walker. These reads are flagged with offsets of -1, so that you can:  for ( int i = 0; i < context.getReads().size(); i++ ) { SAMRecord read = context.getReads().get(i); int offset = context.getOffsets().get(i); if ( offset == -1 ) nDeletionReads++; else nCleanReads++; }  There are also two convenience functions in AlignmentContext to extract subsets of the reads with and without spanning deletions: /** * Returns only the reads in ac that do not contain spanning deletions of this locus * * @param ac * @return */ public static AlignmentContext withoutSpanningDeletions( AlignmentContext ac ); /** * Returns only the reads in ac that do contain spanning deletions of this locus * * @param ac * @return */ public static AlignmentContext withSpanningDeletions( AlignmentContext ac );  ## Implementing a Queue JobRunner The following scala methods need to be implemented for a new JobRunner. See the implementations of GridEngine and LSF for concrete full examples. ### 1. class JobRunner.start() Start should to copy the settings from the CommandLineFunction into your job scheduler and invoke the command via sh <jobScript>. As an example of what needs to be implemented, here is the current contents of the start() method in MyCustomJobRunner which contains the pseudo code.  def start() { // TODO: Copy settings from function to your job scheduler syntax. val mySchedulerJob = new ... // Set the display name to 4000 characters of the description (or whatever your max is) mySchedulerJob.displayName = function.description.take(4000) // Set the output file for stdout mySchedulerJob.outputFile = function.jobOutputFile.getPath // Set the current working directory mySchedulerJob.workingDirectory = function.commandDirectory.getPath // If the error file is set specify the separate output for stderr if (function.jobErrorFile != null) { mySchedulerJob.errFile = function.jobErrorFile.getPath } // If a project name is set specify the project name if (function.jobProject != null) { mySchedulerJob.projectName = function.jobProject } // If the job queue is set specify the job queue if (function.jobQueue != null) { mySchedulerJob.queue = function.jobQueue } // If the resident set size is requested pass on the memory request if (residentRequestMB.isDefined) { mySchedulerJob.jobMemoryRequest = "%dM".format(residentRequestMB.get.ceil.toInt) } // If the resident set size limit is defined specify the memory limit if (residentLimitMB.isDefined) { mySchedulerJob.jobMemoryLimit = "%dM".format(residentLimitMB.get.ceil.toInt) } // If the priority is set (user specified Int) specify the priority if (function.jobPriority.isDefined) { mySchedulerJob.jobPriority = function.jobPriority.get } // Instead of running the function.commandLine, run "sh <jobScript>" mySchedulerJob.command = "sh " + jobScript // Store the status so it can be returned in the status method. myStatus = RunnerStatus.RUNNING // Start the job and store the id so it can be killed in tryStop myJobId = mySchedulerJob.start() }  ### 2. class JobRunner.status The status method should return one of the enum values from org.broadinstitute.sting.queue.engine.RunnerStatus: • RunnerStatus.RUNNING • RunnerStatus.DONE • RunnerStatus.FAILED ### 3. object JobRunner.init() Add any initialization code to the companion object static initializer. See the LSF or GridEngine implementations for how this is done. ### 4. object JobRunner.tryStop() The jobs that are still in RunnerStatus.RUNNING will be passed into this function. tryStop() should send these jobs the equivalent of a Ctrl-C or SIGTERM(15), or worst case a SIGKILL(9) if SIGTERM is not available. ## Running Queue with a new JobRunner Once there is a basic implementation, you can try out the Hello World example with -jobRunner MyJobRunner. java -Djava.io.tmpdir=tmp -jar dist/Queue.jar -S scala/qscript/examples/HelloWorld.scala -jobRunner MyJobRunner -run  If all goes well Queue should dispatch the job to your job scheduler and wait until the status returns RunningStatus.DONE and hello world should be echo'ed into the output file, possibly with other log messages. See QFunction and Command Line Options for more info on Queue options. ## 1. Testing core walkers is critical Most GATK walkers are really too complex to easily test using the standard unit test framework. It's just not feasible to make artificial read piles and then extrapolate from simple tests passing whether the system as a whole is working correctly. However, we need some way to determine whether changes to the core of the GATK are altering the expected output of complex walkers like BaseRecalibrator or SingleSampleGenotyper. In additional to correctness, we want to make sure that the performance of key walkers isn't degrading over time, so that calling snps, cleaning indels, etc., isn't slowly creeping down over time. Since we are now using a bamboo server to automatically build and run unit tests (as well as measure their runtimes) we want to put as many good walker tests into the test framework so we capture performance metrics over time. ## 2. The WalkerTest framework To make this testing process easier, we've created a WalkerTest framework that lets you invoke the GATK using command-line GATK commands in the JUnit system and test for changes in your output files by comparing the current ant build results to previous run via an MD5 sum. It's a bit coarse grain, but it will work to ensure that changes to key walkers are detected quickly by the system, and authors can either update the expected MD5s or go track down bugs. The system is fairly straightforward to use. Ultimately we will end up with JUnit style tests in the unit testing structure. In the piece of code below, we have a piece of code that checks the MD5 of the SingleSampleGenotyper's GELI text output at LOD 3 and LOD 10. package org.broadinstitute.sting.gatk.walkers.genotyper; import org.broadinstitute.sting.WalkerTest; import org.junit.Test; import java.util.HashMap; import java.util.Map; import java.util.Arrays; public class SingleSampleGenotyperTest extends WalkerTest { @Test public void testLOD() { HashMap<Double, String> e = new HashMap<Double, String>(); e.put( 10.0, "e4c51dca6f1fa999f4399b7412829534" ); e.put( 3.0, "d804c24d49669235e3660e92e664ba1a" ); for ( Map.Entry<Double, String> entry : e.entrySet() ) { WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec( "-T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout %s --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod " + entry.getKey(), 1, Arrays.asList(entry.getValue())); executeTest("testLOD", spec); } } }  The fundamental piece here is to inherit from WalkerTest. This gives you access to the executeTest() function that consumes a WalkerTestSpec:  public WalkerTestSpec(String args, int nOutputFiles, List<String> md5s)  The WalkerTestSpec takes regular, command-line style GATK arguments describing what you want to run, the number of output files the walker will generate, and your expected MD5s for each of these output files. The args string can contain %s String.format specifications, and for each of the nOutputFiles, the executeTest() function will (1) generate a tmp file for output and (2) call String.format on your args to fill in the tmp output files in your arguments string. For example, in the above argument string varout is followed by %s, so our single SingleSampleGenotyper output is the variant output file. ## 3. Example output When you add a walkerTest inherited unit test to the GATK, and then build test, you'll see output that looks like: [junit] WARN 13:29:50,068 WalkerTest - -------------------------------------------------------------------------------- [junit] WARN 13:29:50,068 WalkerTest - -------------------------------------------------------------------------------- [junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0 [junit] [junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0 [junit] [junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a] [junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a] [junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED [junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED [junit] WARN 13:30:39,409 WalkerTest - -------------------------------------------------------------------------------- [junit] WARN 13:30:39,409 WalkerTest - -------------------------------------------------------------------------------- [junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0 [junit] [junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0 [junit] [junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534] [junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534] [junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED [junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED [junit] WARN 13:31:30,214 SingleSampleGenotyperTest - [junit] WARN 13:31:30,214 SingleSampleGenotyperTest -  ## 4. Recommended location for GATK testing data We keep all of the permenant GATK testing data in: /humgen/gsa-scr1/GATK_Data/Validation_Data/  A good set of data to use for walker testing is the CEU daughter data from 1000 Genomes: gsa2 ~/dev/GenomeAnalysisTK/trunk > ls -ltr /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.bam /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.calls -rw-rw-r--+ 1 depristo wga 51M 2009-09-03 07:56 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -rw-rw-r--+ 1 depristo wga 185K 2009-09-04 13:21 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.variants.geli.calls -rw-rw-r--+ 1 depristo wga 164M 2009-09-04 13:22 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.genotypes.geli.calls -rw-rw-r--+ 1 depristo wga 24M 2009-09-04 15:00 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SOLID.bam -rw-rw-r--+ 1 depristo wga 12M 2009-09-04 15:01 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.454.bam -rw-r--r--+ 1 depristo wga 91M 2009-09-04 15:02 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam  ## 5. Test dependencies The tests depend on a variety of input files, that are generally constrained to three mount points on the internal Broad network: */seq/ */humgen/1kg/ */humgen/gsa-hpprojects/GATK/Data/Validation_Data/  To run the unit and integration tests you'll have to have access to these files. They may have different mount points on your machine (say, if you're running remotely over the VPN and have mounted the directories on your own machine). ## 6. MD5 database and comparing MD5 results Every file that generates an MD5 sum as part of the WalkerTest framework will be copied to <MD5>. integrationtest in the integrationtests subdirectory of the GATK trunk. This MD5 database of results enables you to easily examine the results of an integration test as well as compare the results of a test before/after a code change. For example, below is an example test for the UnifiedGenotyper that, due to a code change, where the output VCF differs from the VCF with the expected MD5 value in the test code itself. The test provides provides the path to the two results files as well as a diff command to compare expected to the observed MD5: [junit] -------------------------------------------------------------------------------- [junit] Executing test testParameter[-genotype] with GATK arguments: -T UnifiedGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05997727998894311741.tmp -L 1:10,000,000-10,010,000 -genotype [junit] ##### MD5 file is up to date: integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest [junit] Checking MD5 for /tmp/walktest.tmp_param.05997727998894311741.tmp [calculated=ab20d4953b13c3fc3060d12c7c6fe29d, expected=0ac7ab893a3f550cb1b8c34f28baedf6] [junit] ##### Test testParameter[-genotype] is going fail ##### [junit] ##### Path to expected file (MD5=0ac7ab893a3f550cb1b8c34f28baedf6): integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest [junit] ##### Path to calculated file (MD5=ab20d4953b13c3fc3060d12c7c6fe29d): integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest [junit] ##### Diff command: diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest  Examining the diff we see a few lines that have changed the DP count in the new code > diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest | head 385,387c385,387 < 1 10000345 . A . 106.54 . AN=2;DP=33;Dels=0.00;MQ=89.17;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78 < 1 10000346 . A . 103.75 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99 < 1 10000347 . A . 109.79 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78.04 --- > 1 10000345 . A . 106.54 . AN=2;DP=32;Dels=0.00;MQ=89.50;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78 > 1 10000346 . A . 103.75 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99 > 1 10000347 . A . 109.79 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78  Whether this is the expected change is up to you to decide, but the system makes it as easy as possible to see the consequences of your code change. ## 7. Testing for Exceptions The walker test framework supports an additional syntax for ensuring that a particular java Exception is thrown when a walker executes using a simple alternate version of the WalkerSpec object. Rather than specifying the MD5 of the result, you can provide a single subclass of Exception.class and the testing framework will ensure that when the walker runs an instance (class or subclass) of your expected exception is thrown. The system also flags if no exception is thrown. For example, the following code tests that the GATK can detect and error out when incompatible VCF and FASTA files are given: @Test public void fail8() { executeTest("hg18lex-v-b36", test(lexHG18, callsB36)); } private WalkerTest.WalkerTestSpec test(String ref, String vcf) { return new WalkerTest.WalkerTestSpec("-T VariantsToTable -M 10 -B:two,vcf " + vcf + " -F POS,CHROM -R " + ref + " -o %s", 1, UserException.IncompatibleSequenceDictionaries.class); }  During the integration test this looks like: [junit] Executing test hg18lex-v-b36 with GATK arguments: -T VariantsToTable -M 10 -B:two,vcf /humgen/gsa-hpprojects/GATK/data/Validation_Data/lowpass.N3.chr1.raw.vcf -F POS,CHROM -R /humgen/gsa-hpprojects/GATK/data/Validation_Data/lexFasta/lex.hg18.fasta -o /tmp/walktest.tmp_param.05541601616101756852.tmp -l WARN -et NO_ET [junit] [junit] Wanted exception class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries, saw class org.broadinstitute.sting.utils.exceptions.UserExceptionIncompatibleSequenceDictionaries [junit] => hg18lex-v-b36 PASSED  ## 8. Miscellaneous information • Please do not put any extremely long tests in the regular ant build test target. We are currently splitting the system into fast and slow tests so that unit tests can be run in \< 3 minutes while saving a test target for long-running regression tests. More information on that will be posted. • An expected MG5 string of "" means don't check for equality between the calculated and expected MD5s. Useful if you are just writing a new test and don't know the true output. • Overload parameterize() { return true; } if you want the system to just run your calculations, not throw an error if your MD5s don't match, across all tests • If your tests all of a sudden stop giving equality MD5s, you can just (1) look at the .tmp output files directly or (2) grab the printed GATK command-line options and explore what is happening. • You can always run a GATK walker on the command line and then run md5sum on its output files to obtain, outside of the testing framework, the MD5 expected results. • Don't worry about the duplication of lines in the output ; it's just an annoyance of having two global loggers. Eventually we'll bug fix this away. ### 1. Introduction When running either single-threaded or in shared-memory parallelism mode, the GATK guarantees that output written to an output stream created via the @Argument mechanism will ultimately be assembled in genomic order. In order to assemble the final output file, the GATK will write the output generated from each thread into a temporary output file, ultimately assembling the data via a central coordinating thread. There are three major elements in the GATK that facilitate this functionality: • Stub The front-end interface to the output management system. Stubs will be injected into the walker by the command-line argument system and relay information from the walker to the output management system. There will be one stub per invocation of the GATK. • Storage The back end interface, responsible for creating, writing and deleting temporary output files as well as merging their contents back into the primary output file. One Storage object will exist per shard processed in the GATK. • OutputTracker The dispatcher; ultimately connects the stub object's output creation request back to the most appropriate storage object to satisfy that request. One OutputTracker will exist per GATK invocation. ### 2. Basic Mechanism Stubs are directly injected into the walker through the GATK's command-line argument parser as a go-between from walker to output management system. When a walker calls into the stub it's first responsibility is to call into the output tracker to retrieve an appropriate storage object. The behavior of the OutputTracker from this point forward depends mainly on the parallelization mode of this traversal of the GATK. #### If the traversal is single-threaded: • the OutputTracker (implemented as DirectOutputTracker) will create the storage object if necessary and return it to the stub. • The stub will forward the request to the provided storage object. • At the end of the traversal, the microscheduler will request that the OutputTracker finalize and close the file. #### If the traversal is multi-threaded using shared-memory parallelism: • The OutputTracker (implemented as ThreadLocalOutputTracker) will look for a storage object associated with this thread via a ThreadLocal. • If no such storage object exists, it will be created pointing to a temporary file. • At the end of each shard processed, that file will be closed and an OutputMergeTask will be created so that the shared-memory parallelism code can merge the output at its leisure. • The shared-memory parallelism code will merge when a fixed number of temporary files appear in the input queue. The constant used to determine this frequency is fixed at compile time (see HierarchicalMicroScheduler.MAX_OUTSTANDING_OUTPUT_MERGES). ### 3. Using output management To use the output management system, declare a field in your walker of one of the existing core output types, coupled with either an @Argument or @Output annotation. @Output(doc="Write output to this BAM filename instead of STDOUT") SAMFileWriter out;  Currently supported output types are SAM/BAM (declare SAMFileWriter), VCF (declare VCFWriter), and any non-buffering stream extending from OutputStream. ### 4. Implementing a new output type To create a new output type, three types must be implemented: Stub, Storage, and ArgumentTypeDescriptor. #### To implement Stub Create a new Stub class, extending/inheriting the core output type's interface and implementing the Stub interface. OutputStreamStub extends OutputStream implements Stub<OutputStream> {  Implement a register function so that the engine can provide the stub with the session's OutputTracker. public void register( OutputTracker outputTracker ) { this.outputTracker = outputTracker; }  Add as fields any parameters necessary for the storage object to create temporary storage. private final File targetFile; public File getOutputFile() { return targetFile; }  Implement/override every method in the core output type's interface to pass along calls to the appropriate storage object via the OutputTracker. public void write( byte[] b, int off, int len ) throws IOException { outputTracker.getStorage(this).write(b, off, len); }  #### To implement Storage Create a Storage class, again extending inheriting the core output type's interface and implementing the Storage interface. public class OutputStreamStorage extends OutputStream implements Storage<OutputStream> {  Implement constructors that will accept just the Stub or Stub + alternate file path and create a repository for data, and a close function that will close that repository. public OutputStreamStorage( OutputStreamStub stub ) { ... } public OutputStreamStorage( OutputStreamStub stub, File file ) { ... } public void close() { ... }  Implement a mergeInto function capable of reconstituting the file created by the constructor, dumping it back into the core output type's interface, and removing the source file. public void mergeInto( OutputStream targetStream ) { ... }  Add a block to StorageFactory.createStorage() capable of creating the new storage object. TODO: use reflection to generate the storage classes.  if(stub instanceof OutputStreamStub) { if( file != null ) storage = new OutputStreamStorage((OutputStreamStub)stub,file); else storage = new OutputStreamStorage((OutputStreamStub)stub); }  #### To implement ArgumentTypeDescriptor Create a new object inheriting from type ArgumentTypeDescriptor. Note that the ArgumentTypeDescriptor does NOT need to support the core output type's interface. public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {  Implement a truth function indicating which types this ArgumentTypeDescriptor can service.  @Override public boolean supports( Class type ) { return SAMFileWriter.class.equals(type) || StingSAMFileWriter.class.equals(type); }  Implement a parse function that constructs the new Stub object. The function should register this type as an output by caling engine.addOutput(stub).  public Object parse( ParsingEngine parsingEngine, ArgumentSource source, Type type, ArgumentMatches matches ) { ... OutputStreamStub stub = new OutputStreamStub(new File(fileName)); ... engine.addOutput(stub); .... return stub; }  Add a creator for this new ArgumentTypeDescriptor in CommandLineExecutable.getArgumentTypeDescriptors().  protected Collection<ArgumentTypeDescriptor> getArgumentTypeDescriptors() { return Arrays.asList( new VCFWriterArgumentTypeDescriptor(engine,System.out,argumentSources), new SAMFileWriterArgumentTypeDescriptor(engine,System.out), new OutputStreamArgumentTypeDescriptor(engine,System.out) ); }  After creating these three objects, the new output type should be ready for usage as described above. ### 5. Outstanding issues • Only non-buffering iterators are currently supported by the GATK. Of particular note, PrintWriter will appear to drop records if created by the command-line argument system; use PrintStream instead. • For efficiency, the GATK does not reduce output files together following the tree pattern used by shared-memory parallelism; output merges happen via an independent queue. Because of this, output merges happening during a treeReduce may not behave correctly. ### 1. What is DiffEngine? DiffEngine is a summarizing difference engine that allows you to compare two structured files -- such as BAMs and VCFs -- to find what are the differences between them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others. ### 2. The summarized differences The GATK contains a summarizing difference engine called DiffEngine that compares hierarchical data structures to emit: • A list of specific differences between the two data structures. This is similar to saying the value in field A in record 1 in file F differs from the value in field A in record 1 in file G. • A summarized list of differences ordered by frequency of the difference. This output is similar to saying field A differed in 50 records between files F and G. ### 3. The DiffObjects walker The GATK contains a private walker called DiffObjects that allows you access to the DiffEngine capabilities on the command line. Simply provide the walker with the master and test files and it will emit summarized differences for you. ### 4. Understanding the output The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named nodes. Suppose I have two trees: Tree1=(A=1 B=(C=2 D=3)) Tree2=(A=1 B=(C=3 D=3 E=4)) Tree3=(A=1 B=(C=4 D=3 E=4))  where every node in the tree is named, or is a raw value (here all leaf values are integers). The DiffEngine traverses these data structures by name, identifies equivalent nodes by fully qualified names (Tree1.A is distinct from Tree2.A, and determines where their values are equal (Tree1.A=1, Tree2.A=1, so they are). These itemized differences are listed as: Tree1.B.C=2 != Tree2.B.C=3 Tree1.B.C=2 != Tree3.B.C=4 Tree2.B.C=3 != Tree3.B.C=4 Tree1.B.E=MISSING != Tree2.B.E=4  This conceptually very similar to the output of the unix command line tool diff. What's nice about DiffEngine though is that it computes similarity among the itemized differences and displays the count of differences names in the system. In the above example, the field C is not equal three times, while the missing E in Tree1 occurs only once. So the summary is: *.B.C : 3 *.B.E : 1  where the * operator indicates that any named field matches. This output is sorted by counts, and provides an immediate picture of the commonly occurring differences between the files. Below is a detailed example of two VCF fields that differ because of a bug in the AC, AF, and AN counting routines, detected by the integrationtest integration (more below). You can see that in the although there are many specific instances of these differences between the two files, the summarized differences provide an immediate picture that the AC, AF, and AN fields are the major causes of the differences. [testng] path count [testng] *.*.*.AC 6 [testng] *.*.*.AF 6 [testng] *.*.*.AN 6 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AC 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AF 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AN 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AC 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AF 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AN 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AC 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AF 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AN 1 [testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC 1  ### 5. Integration tests The DiffEngine codebase that supports these calculations is integrated into the integrationtest framework, so that when a test fails the system automatically summarizes the differences between the master MD5 file and the failing MD5 file, if it is an understood type. When failing you will see in the integration test logs not only the basic information, but the detailed DiffEngine output. For example, in the output below I broke the GATK BAQ calculation and the integration test DiffEngine clearly identifies that all of the records differ in their BQ tag value in the two BAM files: /humgen/1kg/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam -o /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp -L 1:10,000,000-10,100,000 -baq RECALCULATE -et NO_ET [testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse. [testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse. [testng] ##### MD5 file is up to date: integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest [testng] Checking MD5 for /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp [calculated=e5147656858fc4a5f470177b94b1fc1b, expected=4ac691bde1ba1301a59857694fda6ae2] [testng] ##### Test testPrintReadsRecalBAQ is going fail ##### [testng] ##### Path to expected file (MD5=4ac691bde1ba1301a59857694fda6ae2): integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest [testng] ##### Path to calculated file (MD5=e5147656858fc4a5f470177b94b1fc1b): integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest [testng] ##### Diff command: diff integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest [testng] ##:GATKReport.v0.1 diffences : Summarized differences between the master and test files. [testng] See http://www.broadinstitute.org/gsa/wiki/index.php/DiffObjectsWalker_and_SummarizedDifferences for more information [testng] Difference NumberOfOccurrences [testng] *.*.*.BQ 895 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:2:266:272:361.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:245:474:254.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:255:178:160.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:158:682:495.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:195:591:884.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:165:236:848.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:191:223:910.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:286:279:434.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:2:106:516:354.BQ 1 [testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:3:102:580:518.BQ 1 [testng] [testng] Note that the above list is not comprehensive. At most 20 lines of output, and 10 specific differences will be listed. Please use -T DiffObjects -R public/testdata/exampleFASTA.fasta -m integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest -t integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest to explore the differences more freely  ### 6. Adding your own DiffableObjects to the system The system dynamically finds all classes that implement the following simple interface: public interface DiffableReader { @Ensures("result != null") /** * Return the name of this DiffableReader type. For example, the VCF reader returns 'VCF' and the * bam reader 'BAM' */ public String getName(); @Ensures("result != null") @Requires("file != null") /** * Read up to maxElementsToRead DiffElements from file, and return them. */ public DiffElement readFromFile(File file, int maxElementsToRead); /** * Return true if the file can be read into DiffElement objects with this reader. This should * be uniquely true/false for all readers, as the system will use the first reader that can read the * file. This routine should never throw an exception. The VCF reader, for example, looks at the * first line of the file for the ##format=VCF4.1 header, and the BAM reader for the BAM_MAGIC value * @param file * @return */ @Requires("file != null") public boolean canRead(File file);  See the VCF and BAMDiffableReaders for example implementations. If you extend this to a new object types both the DiffObjects walker and the integrationtest framework will automatically work with your new file type. ### 1. JEXL in a nutshell JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants. ### 2. Basic structure of JEXL expressions for use with the GATK In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply. JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30: "QUAL > 30.0"  • QUAL is a key: the name of the annotation we want to look at • 30.0 is a value: the threshold that we want to use to evaluate variant quality against • > is an operator: it determines which "side" of the threshold we want to select The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example: "MY_STRING_KEY == 'foo'"  ### 3. Evaluation on multiple annotations You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value: "QUAL / DP < 10.0"  You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP): "QUAL > 30.0 && DP == 10"  where && is the logical "AND". Or if you want to select variants that have at least one of several conditions fulfilled: "QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"  where || is the logical "OR". ### 4. Important caveats #### Sensitivity to case and type • Case Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression. • Type The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception). #### Complex queries We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria. ### 5. More complex JEXL magic Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial. #### Introducing the VariantContext object When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience. The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line. In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite. #### Examples using VariantContext directly For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'  Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample: ! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263  #### Examples using the VariantContext to evaluate boolean values The classic way of evaluating a boolean goes like this: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'  But you can also use the VariantContext object like this: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'  #### Example using JEXL to evaluate arrays Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example: java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'  WARNING: unfortunately we do not have the resources to directly support the HLA typer at this time. As such this tool is no longer under active development or supported by our group. The source code is available in the GATK as is. This tool may or may not work without substantial experimentation by an analyst. ### Introduction Inherited DNA sequence variation in the major histocompatibilty complex (MHC) on human chromosome 6 significantly influences the inherited risk for autoimmune diseases and the host response to pathogenic infections. Collecting allelic sequence information at the classical human leukocyte antigen (HLA) genes is critical for matching in organ transplantation and for genetic association studies, but is complicated due to the high degree of polymorphism across the MHC. Next-generation sequencing offers a cost-effective alternative to Sanger-based sequencing, which has been the standard for classical HLA typing. To bridge the gap between traditional typing and newer sequencing technologies, we developed a generic algorithm to call HLA alleles at 4-digit resolution from next-generation sequence data. ### Downloading the HLA tools The HLA-specific walkers/tools (FindClosestHLA, CalculateBaseLikelihoods, and HLACaller) are available as a separate download from our FTP site and as source code only. Instructions for obtaining and compiling them are as follows: Download the source code (in a tar ball): location: ftp://gsapubftp-anonymous@ftp.broadinstitute.org password: <blank> subdirectory: HLA/ Untar the file, 'cd' into the untar'ed directory and compile with 'ant' Remember that we no longer support this tool, so if you encounter issues with any of these steps please do NOT post them to our support forum. ### The algorithm Algorithmic components of the HLA caller: The HLA caller algorithm, developed as part of the open-source GATK, examines sequence reads aligned to the classical HLA loci taking SAM/BAM formatted files as input and calculates, for each locus, the posterior probabilities for all pairs of classical alleles based on three key considerations: (1) genotype calls at each base position, (2) phase information of nearby variants, and (3) population-specific allele frequencies. See the diagram below for a visualization of the heuristic. The output of the algorithm is a list of HLA allele pairs with the highest posterior probabilities. Functionally, the HLA caller was designed to run in three steps: 1. The "FindClosestAllele" walker detects misaligned reads by comparing each read to the dictionary of HLA alleles (reads with < 75% SNP homology to the closest matching allele are removed) 2. The "CalculateBaseLikelihoods" walker calculates the likelihoods for each genotype at each position within the HLA loci and finds the polymorphic sites in relation to the reference 3. The "HLAcaller" walker reads the output of the previous steps, and makes the likelihood / probability calculations based on base genotypes, phase information, and allele frequencies. ### Required inputs These inputs are required: • Aligned sequence (.bam) file - input data • Genomic reference (.bam) file - human genome build 36. • HLA exons (HLA.intervals) file - list of HLA loci / exons to examine. • HLA dictionary - list of HLA alleles, DNA sequences, and genomic positions. • HLA allele frequencies - allele frequencies for HLA alleles across multiple populations. • HLA polymorphic sites - list of polymorphic sites (used by FindClosestHLA walker) You can download the latter four here. ### Usage and Arguments #### Standard GATK arguments (applies to subsequent functions) The GATK contains a wealth of tools for analysis of sequencing data. Required inputs include an aligned bam file and reference fasta file. The following example shows how to calculate depth of coverage. Usage: java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -I input.bam -R ref.fasta -L input.intervals > output.doc  Arguments: • -T (required) name of walker/function • -I (required) Input (.bam) file. • -R (required) Genomic reference (.fasta) file. • -L (optional) Interval or list of genomic intervals to run the genotyper on. #### FindClosestHLA The FindClosestHLA walker traverses each read and compares it to all overlapping HLA alleles (at specific polymorphic sites), and identifies the closest matching alleles. This is useful for detecting misalignments (low concordance with best-matching alleles), and helps narrow the list of candidate alleles (narrowing the search space reduces computational speed) for subsequent analysis by the HLACaller walker. Inputs include the HLA dictionary, a list of polymorphic sites in the HLA, and the exons of interest. Output is a file (output.filter) that includes the closest matching alleles and statistics for each read. Usage: java -jar GenomeAnalysisTK.jar -T FindClosestHLA -I input.bam -R ref.fasta -L HLA_EXONS.intervals -HLAdictionary HLA_DICTIONARY.txt \ -PolymorphicSites HLA_POLYMORPHIC_SITES.txt -useInterval HLA_EXONS.intervals | grep -v INFO > output.filter  Arguments: • -HLAdictionary (required) HLA_DICTIONARY.txt file • -PolymorphicSites (required) HLA_POLYMORPHIC_SITES.txt file • -useInterval (required) HLA_EXONS.intervals file #### CalculateBaseLikelihoods CalculateBestLikelihoods walker traverses each base position to determine the likelihood for each of the 10 diploid genotypes. These calculations are used later by HLACaller to determine likelihoods for HLA allele pairs based on genotypes, as well as determining the polymorphic sites used in the phasing algorithm. Inputs include aligned bam input, (optional) results from FindClosestHLA (to remove misalignments), and cutoff values for inclusion or exclusion of specific reads. Output is a file (output.baselikelihoods) that contains base likelihoods at each position. Usage: java -jar GenomeAnalysisTK.jar -T CalculateBaseLikelihoods -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter \ -maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v "INFO" | grep -v "MISALIGNED" > output.baselikelihoods  Arguments: • -filter (optional) file = output of FindClosestHLA walker (output.filter - to exclude misaligned reads in genotype calculations) • -maxAllowedMismatches (optional) max number of mismatches tolerated between a read and the closest allele (default = 6) • -minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 0) #### HLACaller The HLACaller walker calculates the likelihoods for observing pairs of HLA alleles given the data based on genotype, phasing, and allele frequency information. It traverses through each read as part of the phasing algorithm to determine likelihoods based on phase information. The inputs include an aligned bam files, the outputs from FindClosestHLA and CalculateBaseLikelihoods, the HLA dictionary and allele frequencies, and optional cutoffs for excluding specific reads due to misalignment (maxAllowedMismatches and minRequiredMatches). Usage: java -jar GenomeAnalysisTK.jar -T HLACaller -I input.bam -R ref.fasta -L HLA_EXONS.intervals -filter output.filter -baselikelihoods output.baselikelihoods\ -maxAllowedMismatches 6 -minRequiredMatches 5 -HLAdictionary HLA_DICTIONARY.txt -HLAfrequencies HLA_FREQUENCIES.txt | grep -v "INFO" > output.calls  Arguments: • -baseLikelihoods (required) output of CalculateBaseLikelihoods walker (output.baselikelihoods - genotype likelihoods / list of polymorphic sites from the data) • -HLAdictionary (required) HLA_DICTIONARY.txt file • -HLAfrequencies (required) HLA_FREQUENCIES.txt file • -useInterval (required) HLA_EXONS.intervals file • -filter (optional) file = output of FindClosestAllele walker (to exclude misaligned reads in genotype calculations) • -maxAllowedMismatches (optional) max number of mismatched bases tolerated between a read and the closest allele (default = 6) • -minRequiredMatches (optional) min number of base matches required between a read and the closest allele (default = 5) • -minFreq (option) minimum allele frequency required to consider the HLA allele (default = 0.0). #### Example workflow (genome-wide HiSeq data in NA12878 from HapMap; computations were performed on the Broad servers) Extract sequences from the HLA loci and make a new bam file: use Java-1.6 set HLA=/seq/NKseq/sjia/HLA_CALLER set GATK=/seq/NKseq/sjia/Sting/dist/GenomeAnalysisTK.jar set REF=/humgen/1kg/reference/human_b36_both.fasta cpHLA/samheader NA12878.HLA.sam

java -jar $GATK -T PrintReads \ -I /seq/dirseq/ftp/NA12878_exome/NA12878.bam -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta \ -L$HLA/HLA.intervals | grep -v RESULT | sed 's/chr6/6/g' >> NA12878.HLA.sam



Use FindClosestHLA to find closest matching HLA alleles and to detect possible misalignments:

java -jar $GATK -T FindClosestHLA -I NA12878.HLA.bam -R$REF -L $HLA/HLA_EXONS.intervals -useInterval$HLA/HLA_EXONS.intervals \
-HLAdictionary $HLA/HLA_DICTIONARY.txt -PolymorphicSites$HLA/HLA_POLYMORPHIC_SITES.txt | grep -v INFO > NA12878.HLA.filter

READ_NAME   START-END       S   %Match  Matches Discord Alleles
20FUKAAXX100202:7:63:8309:75917 30018423-30018523   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...
20GAVAAXX100126:3:24:13495:18608    30018441-30018541   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,...
20FUKAAXX100202:8:44:16857:92134    30018442-30018517   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,HLA_A*010105,...
20FUKAAXX100202:8:5:4309:85338  30018452-30018552   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20GAVAAXX100126:3:28:7925:160832    30018453-30018553   1.0 1.000   3   0   HLA_A*0312,HLA_A*110101,HLA_A*110102,HLA_A*110103,HLA_A*110104,HLA_A*110105,...
20FUKAAXX100202:1:2:10539:169258    30018459-30018530   1.0 1.000   1   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,.
20FUKAAXX100202:8:43:18611:44456    30018460-30018560   1.0 1.000   3   0   HLA_A*01010101,HLA_A*01010102N,HLA_A*010102,HLA_A*010103,HLA_A*010104,...


Use CalculateBaseLikelihoods to determine genotype likelihoods at every base position:

java -jar $GATK -T CalculateBaseLikelihoods -I NA12878.HLA.bam -R$REF -L $HLA/HLA_EXONS.intervals \ -filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 0 | grep -v INFO | grep -v MISALIGNED > NA12878.HLA.baselikelihoods chr:pos Ref Counts AA AC AG AT CC CG CT GG GT TT 6:30018513 G A[0]C[0]T[1]G[39] -113.58 -113.58 -13.80 -113.29 -113.58 -13.80 -113.29 -3.09 -13.50 -113.11 6:30018514 C A[0]C[39]T[0]G[0] -119.91 -13.00 -119.91 -119.91 -2.28 -13.00 -13.00 -119.91 -119.91 -119.91 6:30018515 T A[0]C[0]T[39]G[0] -118.21 -118.21 -118.21 -13.04 -118.21 -118.21 -13.04 -118.21 -13.04 -2.35 6:30018516 C A[0]C[38]T[1]G[0] -106.91 -13.44 -106.91 -106.77 -3.05 -13.44 -13.30 -106.91 -106.77 -106.66 6:30018517 C A[0]C[38]T[0]G[0] -103.13 -13.45 -103.13 -103.13 -3.64 -13.45 -13.45 -103.13 -103.13 -103.13 6:30018518 C A[0]C[38]T[0]G[0] -112.23 -12.93 -112.23 -112.23 -2.71 -12.93 -12.93 -112.23 -112.23 -112.23 ...  Run HLACaller using outputs from previous steps to determine the most likely alleles at each locus: java -jar$GATK -T HLACaller -I NA12878.HLA.bam -R $REF -L$HLA/HLA_EXONS.intervals -useInterval $HLA/HLA_EXONS.intervals \ -bl NA12878.HLA.baselikelihoods -filter NA12878.HLA.filter -maxAllowedMismatches 6 -minRequiredMatches 5 \ -HLAdictionary$HLA/HLA_DICTIONARY.txt -HLAfrequencies $HLA/HLA_FREQUENCIES.txt > NA12878.HLA.info grep -v INFO NA12878.HLA.info > NA12878.HLA.calls Locus A1 A2 Geno Phase Frq1 Frq2 L Prob Reads1 Reads2 Locus EXP White Black Asian A 0101 1101 -1229.5 -15.2 -0.82 -0.73 -1244.7 1.00 180 191 229 1.62 -1.99 -3.13 -2.07 B 0801 5601 -832.3 -37.3 -1.01 -2.15 -872.1 1.00 58 59 100 1.17 -3.31 -4.10 -3.95 C 0102 0701 -1344.8 -37.5 -0.87 -0.86 -1384.2 1.00 91 139 228 1.01 -2.35 -2.95 -2.31 DPA1 0103 0201 -842.1 -1.8 -0.12 -0.79 -846.7 1.00 72 48 120 1.00 -0.90 -INF -1.27 DPB1 0401 1401 -991.5 -18.4 -0.45 -1.55 -1010.7 1.00 64 48 113 0.99 -2.24 -3.14 -2.64 DQA1 0101 0501 -1077.5 -15.9 -0.90 -0.62 -1095.4 1.00 160 77 247 0.96 -1.53 -1.60 -1.87 DQB1 0201 0501 -709.6 -18.6 -0.77 -0.76 -729.7 0.95 50 87 137 1.00 -1.76 -1.54 -2.23 DRB1 0101 0301 -1513.8 -317.3 -1.06 -0.94 -1832.6 1.00 52 32 101 0.83 -1.99 -2.83 -2.34  Make a SAM/BAM file of the called alleles: awk '{if (NR > 1){print$1 "*" $2 "\n"$1 "*" $3}}' NA12878.HLA.calls | sort -u > NA12878.HLA.calls.unique cp$HLA/samheader NA12878.HLA.calls.sam
awk '{split($1,a,"*"); print "grep \"" a[1] "[*]" a[2] "\" '$HLA/HLA_DICTIONARY.sam' >> 'NA12878.HLA'.tmp";}' NA12878.HLA.calls.unique | sh
sort -k4 -n NA12878.HLA.tmp >> NA12878.HLA.calls.sam
rm NA12878.HLA.tmp


There exist a few performance / accuracy tradeoffs in the HLA caller, as in any algorithm. The following are a few key considerations that the user should keep in mind when using the software for HLA typing.

#### Robustness to sequencing/alignment artifact vs. Ability to recognize rare alleles

In polymorphic regions of the genome like the HLA, misaligned reads (presence of erroneous reads or lack of proper sequences) and sequencing errors (indels, systematic PCR errors) may cause the HLA caller to call rare alleles with polymorphisms at the affected bases. The user can manually spot these errors when the algorithm calls a rare allele (Frq1 and Frq2 columns in the output of HLACaller indicate log10 of the allele frequencies). Alternatively, the user can choose to consider only non-rare alleles (use the "-minFreq 0.001" option in HLACaller) to make the algorithm (faster and) more robust against sequencing or alignment errors. The drawback to this approach is that the algorithm may not be able to correctly identifying rare alleles when they are truly present. We recommend using the -minFreq option for genome-wide sequencing datasets, but not for high-quality (targeted PCR 454) data specifically captured for HLA typing in large cohorts.

#### Misalignment Detection and Data Pre-Processing

The FindClosestAllele walker (optional step) is recommended for two reasons:

1. The ability to detect misalignments for reads that don't match very well to the closest appearing HLA allele - removing these misaligned reads improves calling accuracy.

2. Creating a list of closest-matching HLA alleles reduces the search space (over 3,000 HLA alleles across the class I and class II loci) that HLACaller has to iterate through, reducing computational burdon.

However, using this pre-processing step is not without costs:

1. Any cutoff chosen for %concordance, min base matches, or max base mismatches will not distinguish between correctly aligned and misaligned reads 100% of the time - there is a chance that correctly aligned reads may be removed, and misaligned reads not removed.

2. The list of closest-matching alleles in some cases may not contain the true allele if there is sufficient sequencing error, in which case the true allele will not be considered by the HLACaller walker. In our experience, the advantages of using this pre-processing FindClosestAllele walker greatly outweighs the disadvantages, as recommend including it in the pipeline long as the user understands the possible risks of using this function.

### Contributions

The HLA caller algorithm was was developed by Xiaoming (Sherman) Jia with generous support of the GATK team (especially Mark Depristo, Eric Banks), and Paul de Bakker.

• xiaomingjia at gmail dot com
• depristo at broadinstitute dot org
• ebanks at broadinstitute dot org
• pdebakker at rics dot bwh dot harvard dot edu

### Introduction

Genotype and Validate is a tool to asses the quality of a technology dataset for calling SNPs and Indels given a secondary (validation) datasource.

The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you want to know how well a particular technology performs calling these snps. With a dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's dataset.

Another option is to validate the calls on a VCF file, using a deep coverage BAM file that you trust the calls on. The GenotypeAndValidate walker will make calls using the reads in the BAM file and take them as truth, then compare to the calls in the VCF file and produce a truth table.

### Command-line arguments

Usage of GenotypeAndValidate and its command line arguments are described here.

### The VCF Annotations

The annotations can be either true positive (T) or false positive (F). 'T' means it is known to be a true SNP/Indel, while a 'F' means it is known not to be a SNP/Indel but the technology used to create the VCF calls it. To annotate the VCF, simply add an INFO field GV with the value T or F.

### The Outputs

GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true positive or a false positive). The table should look like this:

ALT REF Predictive Value
called alt True Positive (TP) False Positive (FP) Positive PV
called ref False Negative (FN) True Negative (TN) Negative PV

The positive predictive value (PPV) is the proportion of subjects with positive test results who are correctly diagnose.

The negative predictive value (NPV) is the proportion of subjects with a negative test result who are correctly diagnosed.

The optional VCF file will contain only the variants that were called or not called, excluding the ones that were uncovered or didn't pass the filters (-depth). This file is useful if you are trying to compare the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to apples).

• You should always use -BTI alleles, so that the GATK only looks at the sites on the VCF file, speeds up the process a lot. (this will soon be added as a default gatk engine mode)

• The total number of visited bases may be greater than the number of variants in the original VCF file because of extended indels, as they trigger one call per new insertion or deletion. (i.e. ACTG/- will count as 4 genotyper calls, but it's only one line in the VCF).

### Examples

Genotypes BAM file from new technology using the VCF as a truth dataset:

java \
-jar /GenomeAnalysisTK.jar \
-T  GenotypeAndValidate \
-R human_g1k_v37.fasta \
-alleles handAnnotatedVCF.vcf \
-BTI alleles \
-o gav.vcf


An annotated VCF example (info field clipped for clarity)

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
1   20568807    .   C   T   0    HapMapHet        AC=1;AF=0.50;AN=2;DP=0;GV=T  GT  0/1
1   22359922    .   T   C   282  WG-CG-HiSeq      AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ  1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99    ./.
13  102391461   .   G   A   341  Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ  ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99   ./.
1   175516757   .   C   G   655  SnpCluster,WG    AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ  ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99   ./.


Using a BAM file as the truth dataset:

java \
-jar /GenomeAnalysisTK.jar \
-T  GenotypeAndValidate \
-R human_g1k_v37.fasta \
-I myTruthDataset.bam \
-alleles callsToValidate.vcf \
-BTI alleles \
-bt \
-o gav.vcf


Example truth table of PacBio reads (BAM) to validate HiSeq annotated dataset (VCF) using the GenotypeAndValidate walker:

Note: As of version 4, BEAGLE reads and outputs VCF files directly, and can handle multiallelic sites. We have not yet evaluated what this means for the GATK-BEAGLE interface; it is possible that some of the information provided below is no longer applicable as a result.

### Introduction

BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can

• phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios.
• infer sporadic missing genotype data.
• impute ungenotyped markers that have been genotyped in a reference panel.
• perform single marker and haplotypic association analysis.
• detect genetic regions that are homozygous-by-descent in an individual or identical-by-descent in pairs of individuals.

The GATK provides an experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.

### Workflow

The basic workflow for this interface is as follows:

After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.

• User needs to run BEAGLE with this likelihood file specified as input.
• After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
• User can then run GATK walker BeagleOutputToVCF to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.

### Producing Beagle input likelihoods file

Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.

For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:

marker    alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892
20:60251    T        C     10.00   1.26    0.00     9.77   2.45    0.00
20:60321    G        T     10.00   5.01    0.01    10.00   0.31    0.00
20:60467    G        C      9.55   2.40    0.00     9.55   1.20    0.00


Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).

IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.

### Running Beagle

We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example

java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun


Extra BEAGLE arguments can be added as required.

Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.

### Processing BEAGLE output files

BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:

# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like


### Creating a new VCF from BEAGLE data with BeagleOutputToVCF

Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.

The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.

The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).

The resulting VCF can now be used for further downstream analysis.

### Merging VCFs broken up by chromosome into a single genome-wide file

Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.

java -jar /path/to/dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R reffile.fasta \
--out genome_wide_output.vcf \
-V:input1 beagle_output_chr1.vcf \
-V:input2 beagle_output_chr2.vcf \
.
.
.
-V:inputX beagle_output_chrX.vcf \
-type UNION -priority input1,input2,...,inputX


### Please note that the DataProcessingPipeline qscript is no longer available.

The DPP script was only provided has an example, but many people were using it "out of the box" without properly understanding how it works. In order to protect users from mishandling this tool, and to decrease our support burden, we have taken the difficult decision of removing the script from our public repository. If you would like to put together your own version of the DPP, please have a look at our other example scripts to understand how Qscripts work, and read the Best Practices documentation to understand what are the processing steps and what parameters you need to set/adjust.

## Data Processing Pipeline

The Data Processing Pipeline is a Queue script designed to take BAM files from the NGS machines to analysis ready BAMs for the GATK.

### Introduction

Reads come off the sequencers in a raw state that is not suitable for analysis using the GATK. In order to prepare the dataset, one must perform the steps described here. This pipeline performs the following steps: indel cleaning, duplicate marking and base score recalibration, following the GSA's latest definition of best practices. The product of this pipeline is a set of analysis ready BAM files (one per sample sequenced).

### Requirements

This pipeline is a Queue script that uses tools from the GATK, Picard and BWA (optional) software suites which are all freely available through their respective websites. Queue is a GATK companion that is included in the GATK package.

Warning: This pipeline was designed specifically to handle the Broad Institute's main sequencing pipeline with Illumina BAM files and BWA alignment. The GSA cannot support its use for other types of datasets. It is possible however, with some effort, to modify it for your needs.

### Command-line arguments

#### Required Parameters

Argument (short-name) Argument (long-name) Description
-i <BAM file / BAM list> --input <BAM file / BAM list> input BAM file - or list of BAM files.
-R <fasta> --reference <fasta> Reference fasta file.
-D <vcf> --dbsnp <dbsnp vcf> dbsnp ROD to use (must be in VCF format).

#### Optional Parameters

Argument (short-name) Argument (long-name) Description
-indels <vcf> --extra_indels <vcf> VCF files to use as reference indels for Indel Realignment.
-bwa <path> --path_to_bwa <path> The path to the binary of bwa (usually BAM files have already been mapped - but if you want to remap this is the option)
-outputDir <path> --output_directory <path> Output path for the processed BAM files.
-L <GATK interval string> --gatk_interval_string <GATK interval string> the -L interval string to be used by GATK - output bams at interval only
-intervals <GATK interval file> --gatk_interval_file <GATK interval file> an intervals file to be used by GATK - output bams at intervals

#### Modes of Operation (also optional parameters)

Argument (short-name) Argument (long-name) Description
-p <name> --project <name> the project name determines the final output (BAM file) base name. Example NA12878 yields NA12878.processed.bam
-knowns --knowns_only Perform cleaning on knowns only.
-sw --use_smith_waterman Perform cleaning using Smith Waterman
-bwase --use_bwa_single_ended Decompose input BAM file and fully realign it using BWA and assume Single Ended reads
-bwape --use_bwa_pair_ended Decompose input BAM file and fully realign it using BWA and assume Pair Ended reads

## The Pipeline

Data processing pipeline of the best practices for raw data processing, from sequencer data (fastq files) to analysis read reads (bam file):

Following the group's Best Practices definition, the data processing pipeline does all the processing at the sample level. There are two high-level parts of the pipeline:

### BWA alignment

This option is for datasets that have already been processed using a different pipeline or different criteria, and you want to reprocess it using this pipeline. One example is a BAM file that has been processed at the lane level, or did not perform some of the best practices steps of the current pipeline. By using the optional BWA stage of the processing pipeline, your BAM file will be realigned from scratch before creating sample level bams and entering the pipeline.

### Sample Level Processing

This is the where the pipeline applies its main procedures: Indel Realignment and Base Quality Score Recalibration.

#### Indel Realignment

This is a two step process. First we create targets using the Realigner Target Creator (either for knowns only, or including data indels), then we realign the targets using the Indel Realigner (see [Local realignment around indels]) with an optional smith waterman realignment. The Indel Realigner also fixes mate pair information for reads that get realigned.

#### Base Quality Score Recalibration

This is a crucial step that re-adjusts the quality score using statistics based on several different covariates. In this pipeline we utilize four: Read Group Covariate, Quality Score Covariate, Cycle Covariate, Dinucleotide Covariate

### The Outputs

The Data Processing Pipeline produces 3 types of output for each file: a fully processed bam file, a validation report on the input bam and output bam files, a analysis before and after base quality score recalibration. If you look at the pipeline flowchart, the grey boxes indicate processes that generate an output.

#### Processed Bam File

The final product of the pipeline is one BAM file per sample in the dataset. It also provides one BAM list with all the bams in the dataset. This file is named <project name>.cohort.list, and each sample bam file has the name <project name>.<sample name>.bam. The sample names are extracted from the input BAM headers, and the project name is provided as a parameter to the pipeline.

#### Validation Files

We validate each unprocessed sample level BAM file and each final processed sample level BAM file. The validation is performed using Picard's ValidateSamFile. Because the parameters of this validation are very strict, we don't enforce that the input BAM has to pass all validation, but we provide the log of the validation as an informative companion to your input. The validation file is named : <project name>.<sample name>.pre.validation and <project name>.<sample name>.post.validation.

Notice that even if your BAM file fails validation, the pipeline can still go through successfully. The validation is a strict report on how your BAM file is looking. Some errors are not critical, but the output files (both pre.validation and post.validation) should give you some input on how to make your dataset better organized in the BAM format.

#### Base Quality Score Recalibration Analysis

PDF plots of the base qualities are generated before and after recalibration for further analysis on the impact of recalibrating the base quality scores in each sample file. These graphs are explained in detail here. The plots are created in directories named : <project name>.<sample name>.pre and <project name>.<sample name>.post.

### Examples

1. Example script that runs the data processing pipeline with its standard parameters and uses LSF for scatter/gathering (without bwa)

java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/GATK/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -p myFancyProjectName \ -i myDataSet.list \ -R reference.fasta \ -D dbSNP.vcf \ -run

2. Performing realignment and the full data processing pipeline in one pair-ended bam file

java \ -Xmx4g \ -Djava.io.tmpdir=/path/to/tmpdir \ -jar path/to/Queue.jar \ -S path/to/DataProcessingPipeline.scala \ -bwa path/to/bwa \ -i test.bam \ -R reference.fasta \ -D dbSNP.vcf \ -p myProjectWithRealignment \ -bwape \ -run

No posts found with the requested search criteria.
No posts found with the requested search criteria.