Running the GATK for the first time

From GSA

Jump to: navigation, search

Once you've installed the GATK toolkit, whether from the source distribution or from a binary distribution, you can run the various analyses that the GATK supports.

Test Your Installation

The first step is to test that the GATK is correctly installed, and that the supporting tools like Java are in your path. Type the following command:

java -jar <path to GenomeAnalysisTK.jar> --help

replacing the path to GenomeAnalysisTK.jar with the path you have setup.

You should see usage output similar to the following:

usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L <intervals>] [-R 
       <reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H <hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e 
       <err>] [-oe <outerr>] [-A] [-M <maximum_reads>] [-sort <sort_on_the_fly>] [-compress 
       <bam_compression>] [-fmq0] [-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S 
       <validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l <logging_level>] [-log 
       <log_to_file>] [-quiet] [-debug] [-h]
 -T,--analysis_type <analysis_type>                         Type of analysis to run
 -I,--input_file <input_file>                               SAM or BAM file(s)
 -L,--intervals <intervals>                                 A list of genomic intervals over which 
                                                            to operate. Can be explicitly specified 
                                                            on the command line or in a file.
 -R,--reference_sequence <reference_sequence>               Reference sequence file
 -B,--rodBind <rodBind>                                     Bindings for reference-ordered data, in 
                                                            the form <name>,<type>,<file>
 -D,--DBSNP <DBSNP>                                         DBSNP file
 -H,--hapmap <hapmap>                                       Hapmap file
 -hc,--hapmap_chip <hapmap_chip>                            Hapmap chip file
 -o,--out <out>                                             An output file presented to the walker. 
                                                             Will overwrite contents if file exists.
 -e,--err <err>                                             An error output file presented to the 
                                                            walker.  Will overwrite contents if file 
                                                            exists.
 -oe,--outerr <outerr>                                      A joint file for 'normal' and error 
                                                            output presented to the walker.  Will 
                                                            overwrite contents if file exists.

...

Troubleshooting install

If you don't see this message, and instead get an error message there are a couple of things that you should check. First, make sure that your Java version is at least 1.6, by typing the following command:

java -version

You should see something similar to the following text:

java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)

If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like java: Command not found, make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables. On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top before your machine will default to running version 1.6, even if it has been installed.

Run the GATK

Now that we have correctly setup GATK, lets run the toolkit on some example data. A common simple analysis that people use the GATK for is getting a count of the reads in a bam file (although the GATK is capable of much more powerful analyses, this will serve as our example).

First download our example data from the GATK resource bundle from the directory exampleFASTA. You should now have a exampleBAM.bam, and it's associated files (a bai file), and an exampleFASTA.fasta and it's associated files (a .dict file and a fasta.fai file). This is everything you need to run a basic analysis with the following command:

java -jar GenomeAnalysisTK.jar -R exampleFASTA.fasta  -I exampleBAM.bam -T CountReads

After a few seconds you should see output that looks like to this:

INFO  21:53:04,240 HelpFormatter - --------------------------------------------------------------------------- 
INFO  21:53:04,243 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0.4747, Compiled 2010/11/29 21:04:30 
INFO  21:53:04,244 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  21:53:04,244 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki 
INFO  21:53:04,244 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa 
INFO  21:53:04,245 HelpFormatter - Program Args: -T CountReads -I packages/resources/exampleBAM.bam -R packages/resources/exampleFASTA.fasta  
INFO  21:53:04,245 HelpFormatter - Date/Time: 2010/11/29 21:53:04 
INFO  21:53:04,245 HelpFormatter - --------------------------------------------------------------------------- 
INFO  21:53:04,245 HelpFormatter - --------------------------------------------------------------------------- 
INFO  21:53:04,246 AbstractGenomeAnalysisEngine - Strictness is SILENT 
INFO  21:53:04,555 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 
INFO  21:53:04,576 TraversalEngine - [PROGRESS] Traversed 33 reads in 0.01 secs (333.33 secs per 1M reads) 
INFO  21:53:04,578 TraversalEngine - Total runtime 0.03 secs, 0.00 min, 0.00 hours 
INFO  21:53:04,581 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) 
INFO  21:53:04,584 GATKRunReport - Aggregating data for run report 

The results of the traversal indicate that the CountReadsWalker (which you specified with the command line option -T CountReads) counted 33 reads in the example BAM file, which is exactly what we expect to see. Please note that depending on exact logging level and GATK release, you may see slightly different info output. Everything is running correctly if you see the line:

INFO  21:53:04,556 Walker - [REDUCE RESULT] Traversal result is: 33 

somwhere in your output. A full listing of the possible command line options are explained on the Built-in command-line arguments. In the case above we only use three:

  • -R for the reference file
  • -I for the input bam file
  • -T for the analysis name.

You can play around with changing the analysis type (though be warned, many require input files or arguments beyond what we've gone over). You can see the list of available analysis tools (walkers in GATK lingo) when you run the GATK with the --help option.

Try changing the command line option for the analysis to a count locus analysis, which counts bases on the genome that are covered by one or more reads:

java -jar GenomeAnalysisTK.jar -R exampleFASTA.fasta  -I exampleBAM.bam -T CountLoci -o output.txt
Personal tools