Running the GATK for the first time
From GSA
Once you've installed the GATK toolkit, whether from the source distribution or from a binary distribution, you can run the various analyses that the GATK supports.
Test Your Installation
The first step is to test that the GATK is correctly installed, and that the supporting tools like Java are in your path. Type the following command:
java -jar <path to GenomeAnalysisTK.jar> --help
replacing the path to GenomeAnalysisTK.jar with the path you have setup.
You should see usage output similar to the following:
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L <intervals>] [-R
<reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H <hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e
<err>] [-oe <outerr>] [-A] [-M <maximum_reads>] [-sort <sort_on_the_fly>] [-compress
<bam_compression>] [-fmq0] [-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S
<validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l <logging_level>] [-log
<log_to_file>] [-quiet] [-debug] [-h]
-T,--analysis_type <analysis_type> Type of analysis to run
-I,--input_file <input_file> SAM or BAM file(s)
-L,--intervals <intervals> A list of genomic intervals over which
to operate. Can be explicitly specified
on the command line or in a file.
-R,--reference_sequence <reference_sequence> Reference sequence file
-B,--rodBind <rodBind> Bindings for reference-ordered data, in
the form <name>,<type>,<file>
-D,--DBSNP <DBSNP> DBSNP file
-H,--hapmap <hapmap> Hapmap file
-hc,--hapmap_chip <hapmap_chip> Hapmap chip file
-o,--out <out> An output file presented to the walker.
Will overwrite contents if file exists.
-e,--err <err> An error output file presented to the
walker. Will overwrite contents if file
exists.
-oe,--outerr <outerr> A joint file for 'normal' and error
output presented to the walker. Will
overwrite contents if file exists.
...
Troubleshooting install
If you don't see this message, and instead get an error message there are a couple of things that you should check. First, make sure that your Java version is at least 1.6, by typing the following command:
java -version
You should see something similar to the following text:
java version "1.6.0_12" Java(TM) SE Runtime Environment (build 1.6.0_12-b04) Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like java: Command not found, make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables. On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top before your machine will default to running version 1.6, even if it has been installed.
Run the GATK
Now that we have correctly setup GATK, lets run the toolkit on some example data. A common simple analysis that people use the GATK for is getting a count of the reads in a bam file (although the GATK is capable of much more powerful analyses, this will serve as our example).
First download our example data from the GATK resource bundle from the directory exampleFASTA. You should now have a exampleBAM.bam, and it's associated files (a bai file), and an exampleFASTA.fasta and it's associated files (a .dict file and a fasta.fai file). This is everything you need to run a basic analysis with the following command:
java -jar GenomeAnalysisTK.jar -R exampleFASTA.fasta -I exampleBAM.bam -T CountReads
After a few seconds you should see output that looks like to this:
INFO 21:53:04,240 HelpFormatter - --------------------------------------------------------------------------- INFO 21:53:04,243 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0.4747, Compiled 2010/11/29 21:04:30 INFO 21:53:04,244 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:53:04,244 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 21:53:04,244 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 21:53:04,245 HelpFormatter - Program Args: -T CountReads -I packages/resources/exampleBAM.bam -R packages/resources/exampleFASTA.fasta INFO 21:53:04,245 HelpFormatter - Date/Time: 2010/11/29 21:53:04 INFO 21:53:04,245 HelpFormatter - --------------------------------------------------------------------------- INFO 21:53:04,245 HelpFormatter - --------------------------------------------------------------------------- INFO 21:53:04,246 AbstractGenomeAnalysisEngine - Strictness is SILENT INFO 21:53:04,555 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 INFO 21:53:04,576 TraversalEngine - [PROGRESS] Traversed 33 reads in 0.01 secs (333.33 secs per 1M reads) INFO 21:53:04,578 TraversalEngine - Total runtime 0.03 secs, 0.00 min, 0.00 hours INFO 21:53:04,581 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) INFO 21:53:04,584 GATKRunReport - Aggregating data for run report
The results of the traversal indicate that the CountReadsWalker (which you specified with the command line option -T CountReads) counted 33 reads in the example BAM file, which is exactly what we expect to see. Please note that depending on exact logging level and GATK release, you may see slightly different info output. Everything is running correctly if you see the line:
INFO 21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33
somwhere in your output. A full listing of the possible command line options are explained on the Built-in command-line arguments. In the case above we only use three:
- -R for the reference file
- -I for the input bam file
- -T for the analysis name.
You can play around with changing the analysis type (though be warned, many require input files or arguments beyond what we've gone over). You can see the list of available analysis tools (walkers in GATK lingo) when you run the GATK with the --help option.
Try changing the command line option for the analysis to a count locus analysis, which counts bases on the genome that are covered by one or more reads:
java -jar GenomeAnalysisTK.jar -R exampleFASTA.fasta -I exampleBAM.bam -T CountLoci -o output.txt
