Test that the GATK is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks the GATK to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your GATK package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to GenomeAnalysisTK.jar> --help
replacing the <path to GenomeAnalysisTK.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L
<intervals>] [-R <reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H
<hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e <err>] [-oe <outerr>] [-A] [-M
<maximum_reads>] [-sort <sort_on_the_fly>] [-compress <bam_compression>] [-fmq0] [-dfrac
<downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S
<validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l
<logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h]
-T,--analysis_type <analysis_type> Type of analysis to run
-I,--input_file <input_file> SAM or BAM file(s)
-L,--intervals <intervals> A list of genomic intervals over which
to operate. Can be explicitly specified
on the command line or in a file.
-R,--reference_sequence <reference_sequence> Reference sequence file
-B,--rodBind <rodBind> Bindings for reference-ordered data, in
the form <name>,<type>,<file>
-D,--DBSNP <DBSNP> DBSNP file
-H,--hapmap <hapmap> Hapmap file
-hc,--hapmap_chip <hapmap_chip> Hapmap chip file
-o,--out <out> An output file presented to the walker.
Will overwrite contents if file exists.
-e,--err <err> An error output file presented to the
walker. Will overwrite contents if file
exists.
-oe,--outerr <outerr> A joint file for 'normal' and error
output presented to the walker. Will
overwrite contents if file exists.
...
If you see this message, your GATK installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
Test that Queue is correctly installed, and that the supporting tools like Java are in your path.
The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.
Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.
Type the following command:
java -jar <path to Queue.jar> --help
replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.
You should see usage output similar to the following:
usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
[-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
<emailUsername>] [-emailPass <emailPassword>] [-emailPassFile <emailPasswordFile>] [-bsub] [-run] [-dot <dot_graph>]
[-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
<status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
[-debug] [-h]
-S,--script <script> QScript scala file
-jobPrefix,--job_name_prefix <job_name_prefix> Default name prefix for compute farm jobs.
-jobQueue,--job_queue <job_queue> Default queue for compute farm jobs.
-jobProject,--job_project <job_project> Default project for compute farm jobs.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory> Default directory to place scatter gather
output for compute farm jobs.
-memLimit,--default_memory_limit <default_memory_limit> Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory> Root directory to run functions from.
-tempDir,--temp_directory <temp_directory> Temp directory to pass to functions.
-emailHost,--emailSmtpHost <emailSmtpHost> Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort> Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL Email should use SSL. Defaults to false.
-emailUser,--emailUsername <emailUsername> Email SMTP username. Defaults to none.
-emailPass,--emailPassword <emailPassword> Email SMTP password. Defaults to none. Not
secure! See emailPassFile.
-emailPassFile,--emailPasswordFile <emailPasswordFile> Email SMTP password file. Defaults to none.
-bsub,--bsub_all_jobs Use bsub to submit jobs
-run,--run_scripts Run QScripts. Without this flag set only
performs a dry run.
-dot,--dot_graph <dot_graph> Outputs the queue graph to a .dot file. See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph> Outputs the queue graph of scatter gather to
a .dot file. Otherwise overwrites the
dot_graph
-startFromScratch,--start_from_scratch Runs all command line functions even if the
outputs were previously output successfully.
-status,--status Get status of jobs for the qscript
-statusFrom,--status_email_from <status_email_from> Email address to send emails from upon
completion or on error.
-statusTo,--status_email_to <status_email_to> Email address to send emails to upon
completion or on error.
-keepIntermediates,--keep_intermediate_outputs After a successful run keep the outputs of
any Function marked as intermediate.
-retry,--retry_failed <retry_failed> Retry the specified number of times after a
command fails. Defaults to no retries.
-l,--logging_level <logging_level> Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file> Set the logging location
-quiet,--quiet_output_mode Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help Generate this help message
If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.
Let's try to figure out what's not working.
First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like
java: Command not found
make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.
On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.
DiffEngine is a summarizing difference engine that allows you to compare two structured files -- such as BAMs and VCFs -- to find what are the differences between them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others.
The GATK contains a summarizing difference engine called DiffEngine that compares hierarchical data structures to emit:
A list of specific differences between the two data structures. This is similar to saying the value in field A in record 1 in file F differs from the value in field A in record 1 in file G.
A summarized list of differences ordered by frequency of the difference. This output is similar to saying field A differed in 50 records between files F and G.
The GATK contains a private walker called DiffObjects that allows you access to the DiffEngine capabilities on the command line. Simply provide the walker with the master and test files and it will emit summarized differences for you.
The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named nodes. Suppose I have two trees:
Tree1=(A=1 B=(C=2 D=3))
Tree2=(A=1 B=(C=3 D=3 E=4))
Tree3=(A=1 B=(C=4 D=3 E=4))
where every node in the tree is named, or is a raw value (here all leaf values are integers). The DiffEngine traverses these data structures by name, identifies equivalent nodes by fully qualified names (Tree1.A is distinct from Tree2.A, and determines where their values are equal (Tree1.A=1, Tree2.A=1, so they are).
These itemized differences are listed as:
Tree1.B.C=2 != Tree2.B.C=3
Tree1.B.C=2 != Tree3.B.C=4
Tree2.B.C=3 != Tree3.B.C=4
Tree1.B.E=MISSING != Tree2.B.E=4
This conceptually very similar to the output of the unix command line tool diff. What's nice about DiffEngine though is that it computes similarity among the itemized differences and displays the count of differences names in the system. In the above example, the field C is not equal three times, while the missing E in Tree1 occurs only once. So the summary is:
*.B.C : 3
*.B.E : 1
where the * operator indicates that any named field matches. This output is sorted by counts, and provides an immediate picture of the commonly occurring differences between the files.
Below is a detailed example of two VCF fields that differ because of a bug in the AC, AF, and AN counting routines, detected by the integrationtest integration (more below). You can see that in the although there are many specific instances of these differences between the two files, the summarized differences provide an immediate picture that the AC, AF, and AN fields are the major causes of the differences.
[testng] path count
[testng] *.*.*.AC 6
[testng] *.*.*.AF 6
[testng] *.*.*.AN 6
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AC 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AF 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000000.AN 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AC 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AF 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000117.AN 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AC 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AF 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000211.AN 1
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC 1
The DiffEngine codebase that supports these calculations is integrated into the integrationtest framework, so that when a test fails the system automatically summarizes the differences between the master MD5 file and the failing MD5 file, if it is an understood type. When failing you will see in the integration test logs not only the basic information, but the detailed DiffEngine output.
For example, in the output below I broke the GATK BAQ calculation and the integration test DiffEngine clearly identifies that all of the records differ in their BQ tag value in the two BAM files:
/humgen/1kg/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam -o /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp -L 1:10,000,000-10,100,000 -baq RECALCULATE -et NO_ET
[testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse.
[testng] WARN 22:59:22,875 TextFormattingUtils - Unable to load help text. Help output will be sparse.
[testng] ##### MD5 file is up to date: integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
[testng] Checking MD5 for /var/folders/Us/UsMJ3xRrFVyuDXWkUos1xkC43FQ/-Tmp-/walktest.tmp_param.05785205687740257584.tmp [calculated=e5147656858fc4a5f470177b94b1fc1b, expected=4ac691bde1ba1301a59857694fda6ae2]
[testng] ##### Test testPrintReadsRecalBAQ is going fail #####
[testng] ##### Path to expected file (MD5=4ac691bde1ba1301a59857694fda6ae2): integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest
[testng] ##### Path to calculated file (MD5=e5147656858fc4a5f470177b94b1fc1b): integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
[testng] ##### Diff command: diff integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest
[testng] ##:GATKReport.v0.1 diffences : Summarized differences between the master and test files.
[testng] See http://www.broadinstitute.org/gsa/wiki/index.php/DiffObjectsWalker_and_SummarizedDifferences for more information
[testng] Difference NumberOfOccurrences
[testng] *.*.*.BQ 895
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:2:266:272:361.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:245:474:254.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:5:255:178:160.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:158:682:495.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:6:195:591:884.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:165:236:848.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:191:223:910.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAE_0002_FC205W7AAXX:7:286:279:434.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:2:106:516:354.BQ 1
[testng] 4ac691bde1ba1301a59857694fda6ae2.integrationtest.-XAF_0002_FC205Y7AAXX:3:102:580:518.BQ 1
[testng]
[testng] Note that the above list is not comprehensive. At most 20 lines of output, and 10 specific differences will be listed. Please use -T DiffObjects -R public/testdata/exampleFASTA.fasta -m integrationtests/4ac691bde1ba1301a59857694fda6ae2.integrationtest -t integrationtests/e5147656858fc4a5f470177b94b1fc1b.integrationtest to explore the differences more freely
The system dynamically finds all classes that implement the following simple interface:
public interface DiffableReader {
@Ensures("result != null")
/**
* Return the name of this DiffableReader type. For example, the VCF reader returns 'VCF' and the
* bam reader 'BAM'
*/
public String getName();
@Ensures("result != null")
@Requires("file != null")
/**
* Read up to maxElementsToRead DiffElements from file, and return them.
*/
public DiffElement readFromFile(File file, int maxElementsToRead);
/**
* Return true if the file can be read into DiffElement objects with this reader. This should
* be uniquely true/false for all readers, as the system will use the first reader that can read the
* file. This routine should never throw an exception. The VCF reader, for example, looks at the
* first line of the file for the ##format=VCF4.1 header, and the BAM reader for the BAM_MAGIC value
* @param file
* @return
*/
@Requires("file != null")
public boolean canRead(File file);
See the VCF and BAMDiffableReaders for example implementations. If you extend this to a new object types both the DiffObjects walker and the integrationtest framework will automatically work with your new file type.
Most GATK walkers are really too complex to easily test using the standard unit test framework. It's just not feasible to make artificial read piles and then extrapolate from simple tests passing whether the system as a whole is working correctly. However, we need some way to determine whether changes to the core of the GATK are altering the expected output of complex walkers like BaseRecalibrator or SingleSampleGenotyper. In additional to correctness, we want to make sure that the performance of key walkers isn't degrading over time, so that calling snps, cleaning indels, etc., isn't slowly creeping down over time. Since we are now using a bamboo server to automatically build and run unit tests (as well as measure their runtimes) we want to put as many good walker tests into the test framework so we capture performance metrics over time.
To make this testing process easier, we've created a WalkerTest framework that lets you invoke the GATK using command-line GATK commands in the JUnit system and test for changes in your output files by comparing the current ant build results to previous run via an MD5 sum. It's a bit coarse grain, but it will work to ensure that changes to key walkers are detected quickly by the system, and authors can either update the expected MD5s or go track down bugs.
The system is fairly straightforward to use. Ultimately we will end up with JUnit style tests in the unit testing structure. In the piece of code below, we have a piece of code that checks the MD5 of the SingleSampleGenotyper's GELI text output at LOD 3 and LOD 10.
package org.broadinstitute.sting.gatk.walkers.genotyper;
import org.broadinstitute.sting.WalkerTest;
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
import java.util.Arrays;
public class SingleSampleGenotyperTest extends WalkerTest {
@Test
public void testLOD() {
HashMap<Double, String> e = new HashMap<Double, String>();
e.put( 10.0, "e4c51dca6f1fa999f4399b7412829534" );
e.put( 3.0, "d804c24d49669235e3660e92e664ba1a" );
for ( Map.Entry<Double, String> entry : e.entrySet() ) {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout %s --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod " + entry.getKey(), 1,
Arrays.asList(entry.getValue()));
executeTest("testLOD", spec);
}
}
}
The fundamental piece here is to inherit from WalkerTest. This gives you access to the executeTest() function that consumes a WalkerTestSpec:
public WalkerTestSpec(String args, int nOutputFiles, List<String> md5s)
The WalkerTestSpec takes regular, command-line style GATK arguments describing what you want to run, the number of output files the walker will generate, and your expected MD5s for each of these output files. The args string can contain %s String.format specifications, and for each of the nOutputFiles, the executeTest() function will (1) generate a tmp file for output and (2) call String.format on your args to fill in the tmp output files in your arguments string. For example, in the above argument string varout is followed by %s, so our single SingleSampleGenotyper output is the variant output file.
When you add a walkerTest inherited unit test to the GATK, and then build test, you'll see output that looks like:
[junit] WARN 13:29:50,068 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:29:50,068 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit]
[junit] WARN 13:29:50,069 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05524470250256847817.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 3.0
[junit]
[junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a]
[junit] WARN 13:30:39,407 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.05524470250256847817.tmp [calculated=d804c24d49669235e3660e92e664ba1a, expected=d804c24d49669235e3660e92e664ba1a]
[junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED
[junit] WARN 13:30:39,408 WalkerTest - => testLOD PASSED
[junit] WARN 13:30:39,409 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:30:39,409 WalkerTest - --------------------------------------------------------------------------------
[junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit]
[junit] WARN 13:30:39,409 WalkerTest - Executing test testLOD with GATK arguments: -T SingleSampleGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.03852477489430798188.tmp --variant_output_format GELI -L 1:10,000,000-11,000,000 -m EMPIRICAL -lod 10.0
[junit]
[junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534]
[junit] WARN 13:31:30,213 WalkerTest - Checking MD5 for /tmp/walktest.tmp_param.03852477489430798188.tmp [calculated=e4c51dca6f1fa999f4399b7412829534, expected=e4c51dca6f1fa999f4399b7412829534]
[junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED
[junit] WARN 13:31:30,213 WalkerTest - => testLOD PASSED
[junit] WARN 13:31:30,214 SingleSampleGenotyperTest -
[junit] WARN 13:31:30,214 SingleSampleGenotyperTest -
We keep all of the permenant GATK testing data in:
/humgen/gsa-scr1/GATK_Data/Validation_Data/
A good set of data to use for walker testing is the CEU daughter data from 1000 Genomes:
gsa2 ~/dev/GenomeAnalysisTK/trunk > ls -ltr /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.bam /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_1*.calls
-rw-rw-r--+ 1 depristo wga 51M 2009-09-03 07:56 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam
-rw-rw-r--+ 1 depristo wga 185K 2009-09-04 13:21 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.variants.geli.calls
-rw-rw-r--+ 1 depristo wga 164M 2009-09-04 13:22 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.lod5.genotypes.geli.calls
-rw-rw-r--+ 1 depristo wga 24M 2009-09-04 15:00 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SOLID.bam
-rw-rw-r--+ 1 depristo wga 12M 2009-09-04 15:01 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.454.bam
-rw-r--r--+ 1 depristo wga 91M 2009-09-04 15:02 /humgen/gsa-scr1/GATK_Data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.allTechs.bam
The tests depend on a variety of input files, that are generally constrained to three mount points on the internal Broad network:
*/seq/
*/humgen/1kg/
*/humgen/gsa-hpprojects/GATK/Data/Validation_Data/
To run the unit and integration tests you'll have to have access to these files. They may have different mount points on your machine (say, if you're running remotely over the VPN and have mounted the directories on your own machine).
Every file that generates an MD5 sum as part of the WalkerTest framework will be copied to <MD5>. integrationtest in the integrationtests subdirectory of the GATK trunk. This MD5 database of results enables you to easily examine the results of an integration test as well as compare the results of a test before/after a code change. For example, below is an example test for the UnifiedGenotyper that, due to a code change, where the output VCF differs from the VCF with the expected MD5 value in the test code itself. The test provides provides the path to the two results files as well as a diff command to compare expected to the observed MD5:
[junit] --------------------------------------------------------------------------------
[junit] Executing test testParameter[-genotype] with GATK arguments: -T UnifiedGenotyper -R /broad/1KG/reference/human_b36_both.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878.1kg.p2.chr1_10mb_11_mb.SLX.bam -varout /tmp/walktest.tmp_param.05997727998894311741.tmp -L 1:10,000,000-10,010,000 -genotype
[junit] ##### MD5 file is up to date: integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
[junit] Checking MD5 for /tmp/walktest.tmp_param.05997727998894311741.tmp [calculated=ab20d4953b13c3fc3060d12c7c6fe29d, expected=0ac7ab893a3f550cb1b8c34f28baedf6]
[junit] ##### Test testParameter[-genotype] is going fail #####
[junit] ##### Path to expected file (MD5=0ac7ab893a3f550cb1b8c34f28baedf6): integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest
[junit] ##### Path to calculated file (MD5=ab20d4953b13c3fc3060d12c7c6fe29d): integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
[junit] ##### Diff command: diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest
Examining the diff we see a few lines that have changed the DP count in the new code
> diff integrationtests/0ac7ab893a3f550cb1b8c34f28baedf6.integrationtest integrationtests/ab20d4953b13c3fc3060d12c7c6fe29d.integrationtest | head
385,387c385,387
< 1 10000345 . A . 106.54 . AN=2;DP=33;Dels=0.00;MQ=89.17;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78
< 1 10000346 . A . 103.75 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99
< 1 10000347 . A . 109.79 . AN=2;DP=31;Dels=0.00;MQ=88.85;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78.04
---
> 1 10000345 . A . 106.54 . AN=2;DP=32;Dels=0.00;MQ=89.50;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:25:-0.09,-7.57,-75.74:74.78
> 1 10000346 . A . 103.75 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:24:-0.07,-7.27,-76.00:71.99
> 1 10000347 . A . 109.79 . AN=2;DP=30;Dels=0.00;MQ=89.18;MQ0=0;SB=-10.00 GT:DP:GL:GQ 0/0:26:-0.05,-7.85,-84.74:78
Whether this is the expected change is up to you to decide, but the system makes it as easy as possible to see the consequences of your code change.
The walker test framework supports an additional syntax for ensuring that a particular java Exception is thrown when a walker executes using a simple alternate version of the WalkerSpec object. Rather than specifying the MD5 of the result, you can provide a single subclass of Exception.class and the testing framework will ensure that when the walker runs an instance (class or subclass) of your expected exception is thrown. The system also flags if no exception is thrown.
For example, the following code tests that the GATK can detect and error out when incompatible VCF and FASTA files are given:
@Test public void fail8() { executeTest("hg18lex-v-b36", test(lexHG18, callsB36)); }
private WalkerTest.WalkerTestSpec test(String ref, String vcf) {
return new WalkerTest.WalkerTestSpec("-T VariantsToTable -M 10 -B:two,vcf "
+ vcf + " -F POS,CHROM -R "
+ ref + " -o %s",
1, UserException.IncompatibleSequenceDictionaries.class);
}
During the integration test this looks like:
[junit] Executing test hg18lex-v-b36 with GATK arguments: -T VariantsToTable -M 10 -B:two,vcf /humgen/gsa-hpprojects/GATK/data/Validation_Data/lowpass.N3.chr1.raw.vcf -F POS,CHROM -R /humgen/gsa-hpprojects/GATK/data/Validation_Data/lexFasta/lex.hg18.fasta -o /tmp/walktest.tmp_param.05541601616101756852.tmp -l WARN -et NO_ET
[junit] [junit] Wanted exception class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries, saw class org.broadinstitute.sting.utils.exceptions.UserException$IncompatibleSequenceDictionaries
[junit] => hg18lex-v-b36 PASSED
Please do not put any extremely long tests in the regular ant build test target. We are currently splitting the system into fast and slow tests so that unit tests can be run in \< 3 minutes while saving a test target for long-running regression tests. More information on that will be posted.
An expected MG5 string of "" means don't check for equality between the calculated and expected MD5s. Useful if you are just writing a new test and don't know the true output.
Overload parameterize() { return true; } if you want the system to just run your calculations, not throw an error if your MD5s don't match, across all tests
If your tests all of a sudden stop giving equality MD5s, you can just (1) look at the .tmp output files directly or (2) grab the printed GATK command-line options and explore what is happening.
You can always run a GATK walker on the command line and then run md5sum on its output files to obtain, outside of the testing framework, the MD5 expected results.
Don't worry about the duplication of lines in the output ; it's just an annoyance of having two global loggers. Eventually we'll bug fix this away.
I have two licence questions. First of:
There are two licence notes in BaseTest.java. I'm assuming that the top one is the one that should be viewed as currently in use. Is this correct?
Secondly, under which licence are the test resources made available. Am I'm free to use and redistribute these? I'm asking since I'm in the process of refactoring my pipeline project to use GATK as a external dependency, but some of the tests I've written use the files provided with the GATK as a resource, and I'd like to keep them and distribute them with the project as I've done before in my GATK fork.
In addition to testing walkers individually, you may want to also run integration tests for your QScript pipelines.
ant target pipelinetest and run under pipelinetestrun.PipelineTest to run under the ant target.PipelineTestSpec and then run it via PipelineTest.exec().When building up a pipeline test spec specify the following variables for your test.
| Variable | Type | Description |
|---|---|---|
args |
String |
The arguments to pass to the Queue test, ex: -S scala/qscript/examples/HelloWorld.scala |
jobQueue |
String |
Job Queue to run the test. Default is null which means use hour. |
fileMD5s |
Map[Path, MD5] |
Expected MD5 results for each file path. |
expectedException |
classOf[Exception] |
Expected exception from the test. |
The following example runs the ExampleCountLoci QScript on a small bam and verifies that the MD5 result is as expected.
It is checked into the Sting repository under scala/test/org/broadinstitute/sting/queue/pipeline/examples/ExampleCountLociPipelineTest.scala
package org.broadinstitute.sting.queue.pipeline.examples
import org.testng.annotations.Test
import org.broadinstitute.sting.queue.pipeline.{PipelineTest, PipelineTestSpec}
import org.broadinstitute.sting.BaseTest
class ExampleCountLociPipelineTest {
@Test
def testCountLoci {
val testOut = "count.out"
val spec = new PipelineTestSpec
spec.name = "countloci"
spec.args = Array(
" -S scala/qscript/examples/ExampleCountLoci.scala",
" -R " + BaseTest.hg18Reference,
" -I " + BaseTest.validationDataLocation + "small_bam_for_countloci.bam",
" -o " + testOut).mkString
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
PipelineTest.executeTest(spec)
}
}
To test if the script is at least compiling with your arguments run ant pipelinetest specifying the name of your class to -Dsingle:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest
Sample output:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour
[testng] => countloci PASSED DRY RUN
[testng] PASSED: testCountLoci
As of July 2011 the pipeline tests run against LSF 7.0.6 and Grid Engine 6.2u5. To include these two packages in your environment use the hidden dotkit .combined_LSF_SGE.
reuse .combined_LSF_SGE
Once you are satisfied that the dry run has completed without error, to actually run the pipeline test run ant pipelinetestrun.
ant pipelinetestrun -Dsingle=ExampleCountLociPipelineTest
Sample output:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e4c42c267b3a6]
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
If you don't know the MD5s yet you can run the command yourself on the command line and then MD5s the outputs yourself, or you can set the MD5s in your test to "" and run the pipeline.
When the MD5s are blank as in:
spec.fileMD5s += testOut -> ""
You run:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run
And the output will look like:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is , equal? = false
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
When a pipeline test fails due to an MD5 mismatch you can use the MD5 database to diff the results.
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### Updating MD5 file: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] Checking MD5 for pipelinetests/countloci/run/count.out [calculated=67823e4722495eb10a5e4c42c267b3a6, expected=67823e4722495eb10a5e0000deadbeef]
[testng] ##### Test countloci is going fail #####
[testng] ##### Path to expected file (MD5=67823e4722495eb10a5e0000deadbeef): integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest
[testng] ##### Path to calculated file (MD5=67823e4722495eb10a5e4c42c267b3a6): integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] ##### Diff command: diff integrationtests/67823e4722495eb10a5e0000deadbeef.integrationtest integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] FAILED: testCountLoci
[testng] java.lang.AssertionError: 1 of 1 MD5s did not match.
If you need to examine a number of MD5s which may have changed you can briefly shut off MD5 mismatch failures by setting parameterize = true.
spec.parameterize = true
spec.fileMD5s += testOut -> "67823e4722495eb10a5e4c42c267b3a6"
For this run:
ant pipelinetest -Dsingle=ExampleCountLociPipelineTest -Dpipeline.run=run
If there's a match the output will resemble:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e4c42c267b3a6, equal? = true
[testng] => countloci PASSED
[testng] PASSED: testCountLoci
While for a mismatch it will look like this:
[testng] --------------------------------------------------------------------------------
[testng] Executing test countloci with Queue arguments: -S scala/qscript/examples/ExampleCountLoci.scala -R /seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta -I /humgen/gsa-hpprojects/GATK/data/Validation_Data/small_bam_for_countloci.bam -o count.out -bsub -l WARN -tempDir pipelinetests/countloci/temp/ -runDir pipelinetests/countloci/run/ -jobQueue hour -run
[testng] ##### MD5 file is up to date: integrationtests/67823e4722495eb10a5e4c42c267b3a6.integrationtest
[testng] PARAMETERIZATION[countloci]: file pipelinetests/countloci/run/count.out has md5 = 67823e4722495eb10a5e4c42c267b3a6, stated expectation is 67823e4722495eb10a5e0000deadbeef, equal? = false
[testng] => countloci PASSED
[testng] PASSED: testCountLoci