# TutorialsStep-by-step tutorials that demonstrate how to use the tools in practice

#### View by tag

Created 2013-06-17 22:48:43 | Updated 2015-03-30 11:18:39 | Tags: official analyst filtering hardfilters

#### Objective

Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available.

• TBD

#### Steps

1. Extract the SNPs from the call set
2. Determine parameters for filtering SNPs
3. Apply the filter to the SNP call set
4. Extract the Indels from the call set
5. Determine parameters for filtering indels
6. Apply the filter to the Indel call set

### 1. Extract the SNPs from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_variants.vcf \
-L 20 \
-selectType SNP \
-o raw_snps.vcf


#### Expected Result

This creates a VCF file called raw_snps.vcf, containing just the SNPs from the original file of raw variants.

### 2. Determine parameters for filtering SNPs

SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 60.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

• RMSMappingQuality (MQ) 40.0

This is the Root Mean Square of the mapping quality of the reads across all samples.

• MappingQualityRankSumTest (MQRankSum) 12.5

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 3. Apply the filter to the SNP call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_snps.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filterName "my_snp_filter" \
-o filtered_snps.vcf


#### Expected Result

This creates a VCF file called filtered_snps.vcf, containing all the original SNPs from the raw_snps.vcf file, but now the SNPs are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For SNPs that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each SNP failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

### 4. Extract the Indels from the call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R reference.fa \
-V raw_HC_variants.vcf \
-L 20 \
-selectType INDEL \
-o raw_indels.vcf


#### Expected Result

This creates a VCF file called raw_indels.vcf, containing just the Indels from the original file of raw variants.

### 5. Determine parameters for filtering Indels.

Indels matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the indel using the culprit annotation. Indels that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

• QualByDepth (QD) 2.0

This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples.

• FisherStrand (FS) 200.0

Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.

This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.

### 6. Apply the filter to the Indel call set

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T VariantFiltration \
-R reference.fa \
-V raw_indels.vcf \
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \
--filterName "my_indel_filter" \
-o filtered_indels.vcf


#### Expected Result

This creates a VCF file called filtered_indels.vcf, containing all the original Indels from the raw_indels.vcf file, but now the Indels are annotated with either PASS or FILTER depending on whether or not they passed the filters.

For Indels that failed the filter, the variant annotation also includes the name of the filter. That way, if you apply several different filters (simultaneously or sequentially), you can keep track of which filter(s) each Indel failed, and later you can retrieve specific subsets of your calls using the SelectVariants tool. To learn more about composing different types of filtering expressions and retrieving subsets of variants using SelectVariants, please see the online GATK documentation.

Created 2013-06-17 21:31:21 | Updated 2015-05-16 07:04:42 | Tags: haplotypecaller variant-discovery

#### Objective

Call variants on a single genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

#### Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Genotyping mode (–genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

#### Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-L 20 \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 30 \
-o raw_variants.vcf


Note: This is an example command. Please look up what the arguments do and see if they fit your analysis before copying this. To see how the -L argument works, you can refer here: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

#### Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2013-06-17 21:39:48 | Updated 2015-05-08 22:01:39 | Tags: unifiedgenotyper variant-discovery locus-based

### Note: the UnifiedGenotyper has been replaced by HaplotypeCaller, which is a much better tool. UG is still available but you should really consider using HC instead.

#### Objective

Call variants on a haploid genome with the UnifiedGenotyper, producing a raw (unfiltered) VCF.

• TBD

#### Steps

1. Determine the basic parameters of the analysis
2. Call variants in your sequence data

### 1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

• Ploidy (-ploidy)

In its basic use, this is the ploidy (number of chromosomes) per sample. By default it is set to 2, to process diploid organisms' genomes, but it can be set to any other desired value, starting at 1 for haploid organisms, and up for polyploids. This argument can also be used to handle pooled data. For that purpose, you'll need to set -ploidy to Number of samples in each pool * Sample Ploidy. There is no fixed upper limit, but keep in mind that high-level ploidy will increase processing times since the calculations involved are more complex. For full details on how to process pooled data, see Eran et al. (in preparation).

• Genotype likelihood model (-glm)

This is the model that the program will use to calculate the genotype likelihoods. By default, it is set to SNP, but it can also be set to INDEL or BOTH. If set to BOTH, both SNPs and Indels will be called in the same run and be output to the same variants file.

• Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

• Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

### 2. Call variants in your sequence data

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R haploid_reference.fa \
-L 20 \
--glm BOTH \
--stand_call_conf 30 \
--stand_emit_conf 10 \
-o raw_ug_variants.vcf


This creates a VCF file called raw_ug_variants.vcf, containing all the sites that the UnifiedGenotyper evaluated to be potentially variant.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

Created 2013-07-03 00:53:53 | Updated 2015-01-26 18:11:45 | Tags: analyst readgroup indexing

#### Objective

Fix a BAM that is not indexed or not sorted, has not had duplicates marked, or is lacking read group information. These steps can be performed independently of each other but this order is recommended.

#### Prerequisites

• Installed Picard tools

#### Steps

1. Sort the aligned reads by coordinate order
2. Mark duplicates
4. Index the BAM file

#### Note

You may ask, is all of this really necessary? The GATK is notorious for imposing strict formatting guidelines and requiring the presence of information such as read groups that other software packages do not require. Although this represents a small additional processing burden upfront, the downstream benefits are numerous, including the ability to process library data individually, and significant gains in speed and parallelization options.

### 1. Sort the aligned reads by coordinate order

#### Action

Run the following Picard command:

java -jar SortSam.jar \
SORT_ORDER=coordinate


#### Expected Results

This creates a file called sorted_reads.bam containing the aligned reads sorted by coordinate.

#### Action

Run the following Picard command:

java -jar MarkDuplicates.jar \
METRICS_FILE=metrics.txt


#### Expected Results

This creates a file called dedup_reads.bam with the same content as the input file, except that any duplicate reads are marked as such. It also creates a file called metrics.txt that contains metrics regarding duplication of the data.

#### More details

During the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process (sometimes called dedupping in bioinformatics slang) identifies these reads as such so that the GATK tools know to ignore them.

#### Action

Run the following Picard command:

java -jar AddOrReplaceReadGroups.jar  \
RGID=group1 RGLB= lib1 RGPL=illumina RGPU=unit1 RGSM=sample1


#### Expected Results

This creates a file called addrg_reads.bam with the same content as the input file, except that the reads will now have read group information attached.

### 4. Index the BAM file

#### Action

Run the following Picard command:

java -jar BuildBamIndex.jar \


#### Expected Results

This creates an index file called addrg_reads.bai, which is ready to be used in the Best Practices workflow.

Since Picard tools do not systematically create an index file when they output a new BAM file (unlike GATK tools, which will always output indexed files), it is best to keep the indexing step for last.

Created 2015-04-26 02:02:50 | Updated 2015-04-26 02:08:43 | Tags: tutorials haplotypecaller bamout debugging

### 1. Overview

As you may know, HaplotypeCaller performs a local reassembly and realignment of the reads in the region surrounding potential variant sites (see the HaplotypeCaller method docs for more details on why and how this is done). So it often happens that during the calling process, the reads get moved to different mapping positions than what you can observe in the BAM file that you originally provided to HC as input.

These remappings usually explain most discordances between calls that are expected based on the original data and actual calls made by HaplotypeCaller, so it's very useful to be able to visualize what rearrangements the tool has made.

### 2. Generating the bamout for a single site or interval

To generate the bamout file for a specific site or interval, just run HaplotypeCaller on the region around the site or interval of interest using the -L argument to restrict the analysis to that region (adding about 500 bp on either side) and using the -bamout argument to specify the name of the bamout file that will be generated.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -L 20:10255630-10255840 -bamout bamout.bam


If you were using any additional parameters in your original variant calling (including -ERC and related arguments), make sure to include them in this command as well so that you can make an apples-to-apples comparison.

Then you open up both the original bam and the bamout file together in a genome browser such as IGV. On some test data from our favorite sample, NA12878, this is what you would see:

You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine).

You can see a whole group of reads neatly aligned, with an insertion in the middle. In comparison, the original data shown in the lower track has fewer reads with insertions, but has several reads with mismapped ends. This is a classic example of a site where realignment through reassembly has provided additional evidence for an indel, allowing HaplotypeCaller to call it confidently. In contrast, UnifiedGenotyper was not able to call this insertion confidently.

### 3. Generating the bamout for multiple intervals or the whole genome

Although we don't recommend doing this by default because it will cause slower performance and take up a lot of storage space, you can generate a bamout that contains many more intervals, or even covers the whole genome. To do so, just run the same command, but this time, pass your list of intervals to -L, or simply omit it if you want the entire genome to be included.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -o hc_variants.vcf -bamout bamout.bam


This time, if you zoom out a bit in IGV, you will see multiple stacks of reads corresponding to the various ActiveRegions that were identified and processed.

### 4. Forcing an output in a region that is not covered in the bamout

In some cases HaplotypeCaller does not complete processing on an ActiveRegion that it has started. This is typically because there is either almost no evidence of variation once the remapping has been done, or on the contrary, the region is very messy and there is too much complexity. In both cases, the program is designed to give up in order to avoid wasting time. This is a good thing most of the time, but it does mean that sometimes you will have no output in the bamout for the site you are trying to troubleshoot.

The good news is that in most cases it is possible to force HaplotypeCaller to go through with the full processing so that it will produce bamout output for your site of interest. To do so, simply add the flags -forceActive and -disableOptimizations to your command line, in addition to the -bamout argument of course.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R human_b37_20.fasta -I recalibrated.bam -L 20:10371667-10375021 -o hc_forced.vcf -bamout force_bamout.bam -forceActive -disableOptimizations


In this other region, you can see that the original mapping (middle track) was a bit messy with some possible evidence of variation, and in fact UnifiedGenotyper called a SNP in this region (top variant track). But HaplotypeCaller did not call the SNP, and did not output anything in our first bamout file (top read track). When you force an output in that region using the two new flags, you see in the forced bamout (bottom read track) that the remapped data is a lot cleaner and the evidence for variation is essentially gone.

It is also possible to force an ActiveRegion to be triggered at specific intervals; see the HaplotypeCaller tool docs for more details on how this is done.

Created 2013-07-02 00:16:14 | Updated 2015-04-23 21:52:27 | Tags: install rscript igv picard gsalib samtools r ggplot2 rstudio

#### Objective

Install all software packages required to follow the GATK Best Practices.

#### Prerequisites

To follow these instructions, you will need to have a basic understanding of the meaning of the following words and command-line operations. If you are unfamiliar with any of the following, you should consult a more experienced colleague or your systems administrator if you have one. There are also many good online tutorials you can use to learn the necessary notions.

• Basic Unix environment commands
• Binary / Executable
• Compiling a binary
• Command-line shell, terminal or console
• Software library

You will also need to have access to an ANSI compliant C++ compiler and the tools needed for normal compilations (make, shell, the standard library, tar, gunzip). These tools are usually pre-installed on Linux/Unix systems. On MacOS X, you may need to install the MacOS Xcode tools. See https://developer.apple.com/xcode/ for relevant information and software downloads. The XCode tools are free but an AppleID may be required to download them.

Starting with version 2.6, the GATK requires Java Runtime Environment version 1.7. All Linux/Unix and MacOS X systems should have a JRE pre-installed, but the version may vary. To test your Java version, run the following command in the shell:

java -version


This should return a message along the lines of ”java version 1.7.0_25” as well as some details on the Runtime Environment (JRE) and Virtual Machine (VM). If you have a version other than 1.7.x, be aware that you may run into trouble with some of the more advanced features of the Picard and GATK tools. The simplest solution is to install an additional JRE and specify which you want to use at the command-line. To find out how to do so, you should seek help from your systems administrator.

#### Software packages

1. BWA
2. SAMtools
3. Picard
4. Genome Analysis Toolkit (GATK)
5. IGV
6. RStudio IDE and R libraries ggplot2 and gsalib

Note that the version numbers of packages you download may be different than shown in the instructions below. If so, please adapt the number accordingly in the commands.

### 1. BWA

• Installation

Unpack the tar file using:

tar xvzf bwa-0.7.12.tar.bz2


This will produce a directory called bwa-0.7.12 containing the files necessary to compile the BWA binary. Move to this directory and compile using:

cd bwa-0.7.12
make


The compiled binary is called bwa. You should find it within the same folder (bwa-0.7.12 in this example). You may also find other compiled binaries; at time of writing, a second binary called bwamem-lite is also included. You can disregard this file for now. Finally, just add the BWA binary to your path to make it available on the command line. This completes the installation process.

• Testing

Open a shell and run:

bwa


This should print out some version and author information as well as a list of commands. As the Usage line states, to use BWA you will always build your command lines like this:

bwa <command> [options]


This means you first make the call to the binary (bwa), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command.

### 2. SAMtools

• Installation

Unpack the tar file using:

tar xvzf samtools-0.1.2.tar.bz2


This will produce a directory called samtools-0.1.2 containing the files necessary to compile the SAMtools binary. Move to this directory and compile using:

cd samtools-0.1.2
make


The compiled binary is called samtools. You should find it within the same folder (samtools-0.1.2 in this example). Finally, add the SAMtools binary to your path to make it available on the command line. This completes the installation process.

• Testing

Open a shell and run:

samtools


This should print out some version information as well as a list of commands. As the Usage line states, to use SAMtools you will always build your command lines like this:

samtools <command> [options]


This means you first make the call to the binary (samtools), then you specify which command (method) you wish to use (e.g. index) then any options (i.e. arguments such as input files or parameters) used by the program to perform that command. This is a similar convention as used by BWA.

### 3. Picard

• Installation

Unpack the zip file using:

tar xjf picard-tools-1.130.zip


This will produce a directory called picard-tools-1.130 containing the Picard jar files. Picard tools are distributed as a pre-compiled Java executable (jar file) so there is no need to compile them.

Note that it is not possible to add jar files to your path to make the tools available on the command line; you have to specify the full path to the jar file in your java command, which would look like this:

java -jar ~/my_tools/jars/picard.jar <Toolname> [options]


This syntax will be explained in a little more detail further below.

However, you can set up a shortcut called an "environment variable" in your shell profile configuration to make this easier. The idea is that you create a variable that tells your system where to find a given jar, like this:

PICARD = "~/my_tools/jars/picard.jar"


So then when you want to run a Picard tool, you just need to call the jar by its shortcut, like this:

INFO  16:29:15,499 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 INFO 16:29:15,618 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] INFO 16:29:15,618 TraversalEngine - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 16:29:15,679 TraversalEngine - Total runtime 0.08 secs, 0.00 min, 0.00 hours INFO 16:29:15,718 TraversalEngine - 0 reads were filtered out during traversal out of 33 total (0.00%) INFO 16:29:16,712 GATKRunReport - Uploaded run statistics report to AWS S3  This time however, if we look inside the working directory, there is a newly created file there called output.txt. [bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% ls -la drwxr-xr-x 9 vdauwera CHARLES\Domain Users 306 Jul 25 16:29 . drwxr-xr-x@ 6 vdauwera CHARLES\Domain Users 204 Jul 25 15:31 .. -rw-r--r--@ 1 vdauwera CHARLES\Domain Users 3635 Apr 10 07:39 exampleBAM.bam -rw-r--r--@ 1 vdauwera CHARLES\Domain Users 232 Apr 10 07:39 exampleBAM.bam.bai -rw-r--r--@ 1 vdauwera CHARLES\Domain Users 148 Apr 10 07:39 exampleFASTA.dict -rw-r--r--@ 1 vdauwera CHARLES\Domain Users 101673 Apr 10 07:39 exampleFASTA.fasta -rw-r--r--@ 1 vdauwera CHARLES\Domain Users 20 Apr 10 07:39 exampleFASTA.fasta.fai -rw-r--r-- 1 vdauwera CHARLES\Domain Users 5 Jul 25 16:29 output.txt  This file contains the result of the analysis: [bm4dd-56b:~/codespace/gatk/sandbox] vdauwera% cat output.txt 2052  This means that there are 2052 loci in the reference sequence that are covered by at least one or more reads in the BAM file. #### Discussion Okay then, but why not show the full, correct command in the first place? Because this was a good opportunity for you to learn a few of the caveats of the GATK command system, which may save you a lot of frustration later on. Beyond the common basic arguments that almost all GATK walkers require, most of them also have specific requirements or options that are important to how they work. You should always check what are the specific arguments that are required, recommended and/or optional for the walker you want to use before starting an analysis. Fortunately the GATK is set up to complain (i.e. terminate with an error message) if you try to run it without specifying a required argument. For example, if you try to run this: java -jar <path to GenomeAnalysisTK.jar> -T CountLoci -R exampleFASTA.fasta  the GATK will spit out a wall of text, including the basic usage guide that you can invoke with the --help option, and more importantly, the following error message: ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 2.0-22-g40f97eb): ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed ##### ERROR Please do not post this error to the GATK forum ##### ERROR ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments. ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR MESSAGE: Walker requires reads but none were provided. ##### ERROR ------------------------------------------------------------------------------------------  You see the line that says ERROR MESSAGE: Walker requires reads but none were provided? This tells you exactly what was wrong with your command. So the GATK will not run if a walker does not have all the required inputs. That's a good thing! But in the case of our first attempt at running CountLoci, the -o argument is not required by the GATK to run -- it's just highly desirable if you actually want the result of the analysis! There will be many other cases of walkers with arguments that are not strictly required, but highly desirable if you want the results to be meaningful. So, at the risk of getting repetitive, always read the documentation of each walker that you want to use! Created 2014-10-18 00:41:42 | Updated 2015-07-22 17:17:43 | Tags: genotyperefinement denovo ### Overview This tutorial describes step-by-step instruction for applying the Genotype Refinement workflow (described in this method article) to your data. ### Step 1: Derive posterior probabilities of genotypes In this first step, we are deriving the posteriors of genotype calls in our callset, recalibratedVariants.vcf, which just came out of the VQSR filtering step; it contains among other samples a trio of individuals (mother, father and child) whose family structure is described in the pedigree file trio.ped (which you need to supply). To do this, we are using the most comprehensive set of high confidence SNPs available to us, a set of sites from Phase 3 of the 1000 Genomes project (available in our resource bundle), which we pass via the --supporting argument.  java -jar GenomeAnalysisToolkit.jar -R human_g1k_v37_decoy.fasta -T CalculateGenotypePosteriors --supporting 1000G_phase3_v4_20130502.sites.vcf -ped trio.ped -V recalibratedVariants.vcf -o recalibratedVariants.postCGP.vcf  This produces the output file recalibratedVariants.postCGP.vcf, in which the posteriors have been annotated wherever possible. ### Step 2: Filter low quality genotypes In this second, very simple step, we are tagging low quality genotypes so we know not to use them in our downstream analyses. We use Q20 as threshold for quality, which means that any passing genotype has a 99% chance of being correct. java -jar$GATKjar -T VariantFiltration -R $bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibratedVariants.postCGP.Gfiltered.vcf  Note that in the resulting VCF, the genotypes that failed the filter are still present, but they are tagged lowGQ in the FILTER field. ### Step 3: Annotate possible de novo mutations In this third and final step, we tag variants for which at least one family in the callset shows evidence of a de novo mutation based on the genotypes of the family members. java -jar$GATKjar -T VariantAnnotator -R \$bundlePath/b37/human_g1k_v37_decoy.fasta -V recalibratedVariants.postCGP.Gfiltered.vcf -A PossibleDeNovo -ped trio.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf


The annotation output will include a list of the children with possible de novo mutations, classified as either high or low confidence.

See section 3 of the method article for a complete description of annotation outputs and section 4 for an example of a call and the interpretation of the annotation values.

Created 2012-07-24 21:24:42 | Updated 2014-07-28 22:32:00 | Tags: install java

#### Objective

Test that the GATK is correctly installed, and that the supporting tools like Java are in your path.

#### Prerequisites

• Basic familiarity with the command-line environment
• Understand what is a PATH variable

#### Steps

1. Invoke the GATK usage/help message
2. Troubleshooting

### 1. Invoke the GATK usage/help message

The command we're going to run is a very simple command that asks the GATK to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your GATK package is installed correctly.

Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.

#### Action

Type the following command:

java -jar <path to GenomeAnalysisTK.jar> --help


replacing the <path to GenomeAnalysisTK.jar> bit with the path you have set up in your command-line environment.

#### Expected Result

You should see usage output similar to the following:

usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-I <input_file>] [-L
<intervals>] [-R <reference_sequence>] [-B <rodBind>] [-D <DBSNP>] [-H
<hapmap>] [-hc <hapmap_chip>] [-o <out>] [-e <err>] [-oe <outerr>] [-A] [-M
<maximum_reads>] [-sort <sort_on_the_fly>] [-compress <bam_compression>] [-fmq0] [-dfrac
<downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-S
<validation_strictness>] [-U] [-P] [-dt] [-tblw] [-nt <numthreads>] [-l
<logging_level>] [-log <log_to_file>] [-quiet] [-debug] [-h]
-T,--analysis_type <analysis_type>                     Type of analysis to run
-I,--input_file <input_file>                           SAM or BAM file(s)
-L,--intervals <intervals>                             A list of genomic intervals over which
to operate. Can be explicitly specified
on the command line or in a file.
-R,--reference_sequence <reference_sequence>           Reference sequence file
-B,--rodBind <rodBind>                                 Bindings for reference-ordered data, in
the form <name>,<type>,<file>
-D,--DBSNP <DBSNP>                                     DBSNP file
-H,--hapmap <hapmap>                                   Hapmap file
-hc,--hapmap_chip <hapmap_chip>                        Hapmap chip file
-o,--out <out>                                         An output file presented to the walker.
Will overwrite contents if file exists.
-e,--err <err>                                         An error output file presented to the
walker. Will overwrite contents if file
exists.
-oe,--outerr <outerr>                                  A joint file for 'normal' and error
output presented to the walker. Will
overwrite contents if file exists.

...


If you see this message, your GATK installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.

### 2. Troubleshooting

Let's try to figure out what's not working.

#### Action

First, make sure that your Java version is at least 1.7, by typing the following command:

java -version


#### Expected Result

You should see something similar to the following text:

java version "1.7.0_12"
Java(TM) SE Runtime Environment (build 1.7.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)


#### Remedial actions

If the version is less then 1.7, install the newest version of Java onto the system. If you instead see something like

java: Command not found


make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.

Created 2012-08-09 04:08:01 | Updated 2013-06-17 21:09:33 | Tags: test official basic analyst intro queue developer install

#### Objective

Test that Queue is correctly installed, and that the supporting tools like Java are in your path.

#### Prerequisites

• Basic familiarity with the command-line environment
• Understand what is a PATH variable
• GATK installed

#### Steps

1. Invoke the Queue usage/help message
2. Troubleshooting

### 1. Invoke the Queue usage/help message

The command we're going to run is a very simple command that asks Queue to print out a list of available command-line arguments and options. It is so simple that it will ALWAYS work if your Queue package is installed correctly.

Note that this command is also helpful when you're trying to remember something like the right spelling or short name for an argument and for whatever reason you don't have access to the web-based documentation.

#### Action

Type the following command:

java -jar <path to Queue.jar> --help


replacing the <path to Queue.jar> bit with the path you have set up in your command-line environment.

#### Expected Result

You should see usage output similar to the following:

usage: java -jar Queue.jar -S <script> [-jobPrefix <job_name_prefix>] [-jobQueue <job_queue>] [-jobProject <job_project>]
[-jobSGDir <job_scatter_gather_directory>] [-memLimit <default_memory_limit>] [-runDir <run_directory>] [-tempDir
<temp_directory>] [-emailHost <emailSmtpHost>] [-emailPort <emailSmtpPort>] [-emailTLS] [-emailSSL] [-emailUser
[-expandedDot <expanded_dot_graph>] [-startFromScratch] [-status] [-statusFrom <status_email_from>] [-statusTo
<status_email_to>] [-keepIntermediates] [-retry <retry_failed>] [-l <logging_level>] [-log <log_to_file>] [-quiet]
[-debug] [-h]

-S,--script <script>                                                      QScript scala file
-jobPrefix,--job_name_prefix <job_name_prefix>                            Default name prefix for compute farm jobs.
-jobQueue,--job_queue <job_queue>                                         Default queue for compute farm jobs.
-jobProject,--job_project <job_project>                                   Default project for compute farm jobs.
-jobSGDir,--job_scatter_gather_directory <job_scatter_gather_directory>   Default directory to place scatter gather
output for compute farm jobs.
-memLimit,--default_memory_limit <default_memory_limit>                   Default memory limit for jobs, in gigabytes.
-runDir,--run_directory <run_directory>                                   Root directory to run functions from.
-tempDir,--temp_directory <temp_directory>                                Temp directory to pass to functions.
-emailHost,--emailSmtpHost <emailSmtpHost>                                Email SMTP host. Defaults to localhost.
-emailPort,--emailSmtpPort <emailSmtpPort>                                Email SMTP port. Defaults to 465 for ssl,
otherwise 25.
-emailTLS,--emailUseTLS                                                   Email should use TLS. Defaults to false.
-emailSSL,--emailUseSSL                                                   Email should use SSL. Defaults to false.
secure! See emailPassFile.
-bsub,--bsub_all_jobs                                                     Use bsub to submit jobs
-run,--run_scripts                                                        Run QScripts.  Without this flag set only
performs a dry run.
-dot,--dot_graph <dot_graph>                                              Outputs the queue graph to a .dot file.  See:
http://en.wikipedia.org/wiki/DOT_language
-expandedDot,--expanded_dot_graph <expanded_dot_graph>                    Outputs the queue graph of scatter gather to
a .dot file.  Otherwise overwrites the
dot_graph
-startFromScratch,--start_from_scratch                                    Runs all command line functions even if the
outputs were previously output successfully.
-status,--status                                                          Get status of jobs for the qscript
-statusFrom,--status_email_from <status_email_from>                       Email address to send emails from upon
completion or on error.
-statusTo,--status_email_to <status_email_to>                             Email address to send emails to upon
completion or on error.
-keepIntermediates,--keep_intermediate_outputs                            After a successful run keep the outputs of
any Function marked as intermediate.
-retry,--retry_failed <retry_failed>                                      Retry the specified number of times after a
command fails.  Defaults to no retries.
-l,--logging_level <logging_level>                                        Set the minimum level of logging, i.e.
setting INFO get's you INFO up to FATAL,
setting ERROR gets you ERROR and FATAL level
logging.
-log,--log_to_file <log_to_file>                                          Set the logging location
-quiet,--quiet_output_mode                                                Set the logging to quiet mode, no output to
stdout
-debug,--debug_mode                                                       Set the logging file string to include a lot
of debugging information (SLOW!)
-h,--help                                                                 Generate this help message


If you see this message, your Queue installation is ok. You're good to go! If you don't see this message, and instead get an error message, proceed to the next section on troubleshooting.

### 2. Troubleshooting

Let's try to figure out what's not working.

#### Action

First, make sure that your Java version is at least 1.6, by typing the following command:

java -version


#### Expected Result

You should see something similar to the following text:

java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)


#### Remedial actions

If the version is less then 1.6, install the newest version of Java onto the system. If you instead see something like

java: Command not found


make sure that java is installed on your machine, and that your PATH variable contains the path to the java executables.

On a Mac running OS X 10.5+, you may need to run /Applications/Utilities/Java Preferences.app and drag Java SE 6 to the top to make your machine run version 1.6, even if it has been installed.