Tagged with #commandline
1 documentation article | 2 announcements | 2 forum discussions


Comments (0)

Overview

This document describes how GATK commands are structured and how to add arguments to basic command examples.


Basic java syntax

Commands for GATK always follow the same basic syntax:

java [Java arguments] -jar GenomeAnalysisTK.jar [GATK arguments]

The core of the command is java -jar GenomeAnalysisTK.jar, which starts up the GATK program in a Java Virtual Machine (JVM). Any additional java-specific arguments (such as -Xmx to increase memory allocation) should be inserted between java and -jar, like this:

java -Xmx4G -jar GenomeAnalysisTK.jar [GATK arguments]

The order of arguments between java and -jar is not important.


GATK arguments

There are two universal arguments that are required for every GATK command (with very few exceptions, the clp-type utilities), -R for Reference (e.g. -R human_b37.fasta) and -T for Tool name (e.g. -T HaplotypeCaller).

Additional arguments fall in two categories:

  • Engine arguments like -L (for specifying a list of intervals) which can be given to all tools and are technically optional but may be effectively required at certain steps for specific analytical designs (e.g. the -L argument for calling variants on exomes);

  • Tool-specific arguments which may be required, like -I (to provide an input file containing sequence reads to tools that process BAM files) or optional, like -alleles (to provide a list of known alleles for genotyping).

The ordering of GATK arguments is not important, but we recommend always passing the tool name (-T) and reference (-R) first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.

All available engine and tool-specific arguments are listed in the tool documentation section. Arguments typically have both a long name (prefixed by --) and a short name (prefixed by -). The GATK command line parser recognizes both equally, so you can use whichever you prefer, depending on whether you prefer commands to be more verbose or more succinct.

Finally, a note about flags. Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example, --keep_program_records will make certain GATK tools output additional information in the BAM header that would be omitted otherwise. In GATK, all flags are set to FALSE by default, so if you want to set one to TRUE, all you need to do is add the flag name to the command. You don't need to specify an actual value.


Examples of complete GATK command lines

This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing raw variants.

java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf

If the data is from exome sequencing, we should additionally provide the exome targets using the -L argument:

java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L exome_intervals.list

If we just want to genotype specific sites of interest using known alleles based on results from a previous study, we can change the HaplotypeCaller’s genotyping mode using -gt_mode, provide those alleles using -alleles, and restrict the analysis to just those sites using -L:

java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L known_alleles.vcf -alleles known_alleles.vcf -gt_mode GENOTYPE_GIVEN_ALLELES

For more examples of commands and for specific tool commands, see the tool documentation section.

Comments (0)

Consider this a public service announcement, since most GATK users probably also use Picard tools routinely. The recently released version 1.124 of the Picard tools includes many lovely improvements, bug fixes and even a few new tools (see release notes for full details) -- but you'll want to pay attention to one major change in particular.

From this version onward, the Picard release will contain a single JAR file containing all the tools, instead of one JAR file per tool as it was before. This means that when you invoke a Picar tool, you'll invoke a single JAR, then specify which tool (which they call CLP for Command Line Program) you want to run. This should feel completely familiar if you already use GATK regularly, but it does mean you'll need to update any scripts that use Picard tools to the new paradigm. Other than that, there's no change of syntax; Picard will still use e.g. I=input.bam where GATK would use -I input.bam.

We will need to update some of our own documentation accordingly over the near future; please bear with us as we go through this process, and let us know by commenting in this thread if you find any docs that have yet to be updated.

Comments (0)

I'm not sure why it hadn't occurred to us to do this before, but we've finally done it: an FAQ article that formally explains how GATK commands are structured, what are the basic types of arguments, and how to string them all together.

We realized that command structure requirements can be confusing, if you are new to command line programs, if only because so many toolkits use fairly different ones. For example, Picard tools (which are also developed at the Broad!) have separate jar files for each tool in the toolkit, while GATK has one jar file containing all the tools. The Picard syntax for passing argument values is also different; they use = to join the argument name and value, while GATK commands just take a space.

So if that's something you need help with, check out the doc! We'd love to hear from people who are new to GATK about whether this is helpful and how we can improve it further.

Comments (3)

I'm working on add RSEM to our RNAseq pipeline which uses Queue. RSEM takes a number of inputs on the command line, so I have a case class and override commandLine for this to work. Nothing special there.

However, RSEM wants a prefix of the output sample names. If i give it sample_name, it will generate a whole bunch of files, sample_name.genes.results with expression values for genes, sample_name.isoforms.results with expression values for isoforms, sample_name.genome.bam, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai with mappings etc, etc.

What's the best way to handle this in terms of @Output?

Should I use (1):

case class rsem(inFq1: File, inFq2: File, prefix: String) extends ExternalCommonArgs {
   ...
   @Output val myPrefix = prefix
   ...

and them use the prefix in the downstream jobs? Or should I use (2):

case class rsem(inFq1: File, inFq2: File, prefix: String, bam: File, geneResults: File) extends ExternalCommonArgs {
   ...
   @Output val myBam = bam
   @Output val myGeneRes = geneResults
   ...

In (2), I would still use prefix in the def commandLine, of course.

Is there a preferred way to handle this in Queue?

Comments (1)

The docs make it look like "-L" is required:

java -Xmx2g -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T FastaAlternateReferenceMaker \
   -o output.fasta \
   -L input.intervals \
   --variant input.vcf \
   [--snpmask mask.vcf]

It's not, the command works fine when you just provide a VCF file with "--variant".

Thanks again.