This document describes how GATK commands are structured and how to add arguments to basic command examples.
Commands for GATK always follow the same basic syntax:
java [Java arguments] -jar GenomeAnalysisTK.jar [GATK arguments]
The core of the command is
java -jar GenomeAnalysisTK.jar, which starts up the GATK program in a Java Virtual Machine (JVM). Any additional java-specific arguments (such as -Xmx to increase memory allocation) should be inserted between
-jar, like this:
java -Xmx4G -jar GenomeAnalysisTK.jar [GATK arguments]
The order of arguments between
-jar is not important.
There are two universal arguments that are required for every GATK command (with very few exceptions, the
-R for Reference (e.g.
-R human_b37.fasta) and
-T for Tool name (e.g.
Additional arguments fall in two categories:
Engine arguments like
-L (for specifying a list of intervals) which can be given to all tools and are technically optional but may be effectively required at certain steps for specific analytical designs (e.g. the
-L argument for calling variants on exomes);
Tool-specific arguments which may be required, like
-I (to provide an input file containing sequence reads to tools that process BAM files) or optional, like
-alleles (to provide a list of known alleles for genotyping).
The ordering of GATK arguments is not important, but we recommend always passing the tool name (
-T) and reference (
-R) first for consistency. It is also a good idea to consistently order arguments by some kind of logic in order to make it easy to compare different commands over the course of a project. It’s up to you to choose what that logic should be.
All available engine and tool-specific arguments are listed in the tool documentation section. Arguments typically have both a long name (prefixed by
--) and a short name (prefixed by
-). The GATK command line parser recognizes both equally, so you can use whichever you prefer, depending on whether you prefer commands to be more verbose or more succinct.
Finally, a note about flags. Flags are arguments that have boolean values, i.e. TRUE or FALSE. They are typically used to enable or disable specific features; for example,
--keep_program_records will make certain GATK tools output additional information in the BAM header that would be omitted otherwise. In GATK, all flags are set to FALSE by default, so if you want to set one to TRUE, all you need to do is add the flag name to the command. You don't need to specify an actual value.
This is a very simple command that runs HaplotypeCaller in default mode on a single input BAM file containing sequence data and outputs a VCF file containing raw variants.
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf
If the data is from exome sequencing, we should additionally provide the exome targets using the
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L exome_intervals.list
If we just want to genotype specific sites of interest using known alleles based on results from a previous study, we can change the HaplotypeCaller’s genotyping mode using
-gt_mode, provide those alleles using
-alleles, and restrict the analysis to just those sites using
java -Xmx4G -jar GenomeAnalysisTK.jar -R human_b37.fasta -T HaplotypeCaller -I sample1.bam -o raw_variants.vcf -L known_alleles.vcf -alleles known_alleles.vcf -gt_mode GENOTYPE_GIVEN_ALLELES
For more examples of commands and for specific tool commands, see the tool documentation section.
Is the a way to access argument tags in the arguments to a Qscript?
I have a script that takes a number of bam files as input and I would like to be able to tag them. i.e.
--input:whole-genome some_long_name.bam --input:exome a_different_bam.bam
In a walker I do this and then look up the tags by calling getToolkit().getTags(argumentValue), but this isn't available to a qscript. Is there a good way to do this?
Previously I have been running a command like this:
java -jar /path/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R /path/human_g1k_v37.fasta \ -et NO_ET \ -K /path/key \ -out_mode EMIT_ALL_SITES \ --input_file /path/bam \ -L /path/intervals \ -gt_mode GENOTYPE_GIVEN_ALLELES \ --alleles /path/vcf \ --dbsnp /path/dbsnp_135.b37.vcf \ -o /path/my.vcf
But I was reading the documentation again and I read this statement: GENOTYPE_GIVEN_ALLELES only the alleles passed in from a VCF rod bound to the -alleles argument will be used for genotyping
Which lead me to believe that there wasn't a need to include the lines: --input_file /path/bam \ -L /path/intervals \
because it would be redundant. But when I try to run without those line I get back an error message: Walker requires reads but none were provided.
Can you give an explaination as to why both of those lines AND GENOTYPE_GIVEN_ALLELES would be needed?